<<

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Running head: THE IMPACT OF INCONGRUENCE ON THE ROOT 1

1 The impact of incongruence and exogenous fragments on estimates of the eukaryote

2 root

3

1 2 4 Caesar Al Jewari and Sandra L. Baldauf

5

6 Program in Systematic Biology, Department of Organismal Biology, Uppsala University,

7 Uppsala, Sweden 75236

8

1 9 Email: [email protected]

2 10 Email: [email protected]

11

12 ORCIDs:

13 Caesar Al Jewari: 0000-0002-0868-0384

14 Sandra L. Baldauf: 0000-0003-4485-6671

15

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 2

16 Abstract

17 Phylogenomics uses multiple genetic loci to reconstruct evolutionary trees, under the

18 stipulation that all combined loci share a common phylogenetic history, i.e., they are

19 congruent. Congruence is primarily evaluated via single-gene trees, but these trees invariably

20 lack sufficient signal to resolve deep nodes making it difficult to assess congruence at these

21 levels. Two methods were developed to systematically assess congruence in multi-locus data.

22 Protocol 1 uses gene jackknifing to measure deviation from a central mean to identify

23 taxon-specific incongruencies in the form of persistent outliers. Protocol_2 assesses

24 congruence at the sub-gene level using a sliding window. Both protocols were tested on a

25 controversial data set of 76 mitochondrial proteins previously used in various combinations

26 to assess the eukaryote root. Protocol_1 showed a concentration of outliers in under-sampled

27 taxa, including the pivotal taxon Discoba. Further analysis of Discoba using Protocol_2

28 detected a surprising number of apparently exogenous gene fragments, some of which

29 overlap with Protocol_1 outliers and others that do not. Phylogenetic analyses of the full data

30 using the static LG-gamma evolutionary model support a neozoan-excavate root for

31 (Discoba sister), which rises to 99-100% bootstrap support with data masked

32 according to either Protocol_1 or Protocol_2. In contrast, site-heterogeneous (mixture)

33 models perform inconsistently with these data, yielding all three possible roots depending on

34 presence/absence/type of masking and/or extent of missing data. The neozoan-excavate root

35 places (including and fungi) and Diaphoretickes (including ) as

36 more closely related to each other than either is to Discoba (Jakobida, Heterolobosea, and

37 ), regardless of the presence/absence of additional taxa.

38

39 Keywords

40 phylogenomics, congruence, eToL, eukaryote root, Discoba, mitochondrial proteins.

41

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 3

42 Introduction

43 Over the past twenty years, large multigene (phylogenomic) datasets have confidently

44 and consistently identified the broad outlines of the eukaryote tree of life (eToL) (Baldauf et

45 al. 2000; Bapteste et al. 2002; Brown et al. 2018; Burki et al. 2016, 2020; Strassert et al.

46 2019). As a result, nearly all known eukaryotes can be assigned to one of three supergroups:

47 Amorphea, Diaphoretickes and . Amorphea includes animals (), fungi

48 ( or Nucletmycea), amoebozoan () and several less well

49 characterized lineages. Diaphoretickes encompasses nearly all other well-known eukaryotes

50 including land plants and nearly all of the algae interspersed with numerous

51 non-photosynthetic lineages such as , and apicomplexan parasites.

52 Excavata is by far the least well known and most poorly understood of the supergroups and is

53 thought to consist mainly of Discoba, Metamonada, and, possibly Malawimonadida (Adl et

54 al. 2018). Within Excavata only the monophyly of Discoba (Jakobida, Euglenozoa and

55 Heterolobosea) is well-established (Kamikawa et al. 2014). Discobans are largely free-living

56 and have functional, if molecularly diverse mitochondria, while all known are

57 micro- or anaerobic and lack mitochondrial DNA (mtDNA) and nearly all mitochondrial

58 proteins (Hjort et al. 2010).

59 One of the most important outstanding questions in eukaryote phylogeny is the

60 position of the root. Since eukaryote genomes are mosaics with roughly half of their of

61 clearly archaeal or bacterial origin (Brueckner & Martin 2020; Cotton & McInerney 2010),

62 eToL can be rooted with either bacterial or archaeal homologs. Of these, archaeal genes trace

63 to the eukaryote first common ancestor (FECA), while bacterial genes trace to the origin of

64 endosymbiotic organelles. Since mitochondria arose after FECA but before the eukaryote last

65 common ancestor (LECA; Gabaldón 2018), this makes bacteria a potentially closer outgroup

66 to eukaryotes than archaea. Nonetheless, rooting eToL with bacteria is complicated by the

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 4

67 facts that only ~10% of mitochondrial proteins (mitoPs) are shared across eukaryotes (Gray

68 2012), and these proteins trace to different groups of bacteria, although mostly alpha- and

69 gamma-proteobacteria (aP_bac and gP_bac, respectively) (Kurland & Andersson 2000). In

70 addition, Discoba are the only excavates with mitoPs meaning that an eToL rooted with

71 bacteria can include only one representative of the diverse and not necessarily monophyletic

72 Excavata (Karnkowska et al. 2016). Thus, an eToL based on mitoPs is only one step in

73 defining the eukaryote root, albeit an important one.

74 Previous attempts to recover the eToL root with mitoP data are mainly the work of

75 Derelle and Lang 2012 (DL2012), He et al. 2014 (H2014) and Derelle et al. 2015 (D2015).

76 These studies used partially overlapping subsets of mitoPs and similar taxa but reached

77 different conclusions. DL2012 and D2015 recovered the more traditional unikont- root

78 (Amorphea sister), while H2014 recovered a novel neozoan-excavate root (Discoba sister).

79 These studies differed mainly in outgroup selection (aP_bac only - DL2012/D2015, versus

80 mainly aP - and gP_bac - H2014) and inclusion (DL2012/D2015) or exclusion (H2014) of

81 Malawimonads. In addition, the two studies differ in phylogenetic method, with DL2012 and

82 D2015 using site-heterogeneous evolutionary models and removal of fast evolving site

83 (FSR), while H2014 used site-homogeneous models with no FSR and a site heterogeneous

84 model with a covarion process. Both studies also reported evidence of substantial

85 incongruence in the other data sets.

86 Congruence is a core premise of phylogenomics, i.e., the assumption that the loci to

87 be combined for phylogenetic analysis share the same underlying evolutionary history

88 (Huelsenbeck et al. 1996; Roger et al. 2012). If this premise is met, then increasing data

89 should amplify a weak common signal (Kupczok et al. 2010; Philippe et al. 2011; Wägele et

90 al. 2009). Thus, assessing congruence across a multi-gene dataset is critical. This is generally

91 done by evaluating single gene trees (SGTs) by either visual or automated inspection.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 5

92 However, SGTs generally lack sufficient phylogenetic signal to resolve the deepest branches

93 in a tree, making it particularly difficult to assess incongruence at deep levels. Variation in

94 length and degree of conservation among loci also make it challenging to apply a uniform

95 standard evaluation process for SGTs. Thus, evaluating SGTs individually becomes a

96 laborious and troublesome process for large numbers of loci.

97 The primary source of molecular incongruence is when genes arising by ancient

98 duplication (out-paralogs) or by horizontal transfer (xenologs) are mistaken for strictly

99 vertically inherited genes (orthologs), since only orthologs directly track organismal

100 phylogeny. Strategies for tackling the congruency problem can be classified into two general

101 categories: 1) detecting persistent outliers by measuring the deviation from a global reference

102 point or 2) clustering genetic loci into subsets based on a compatibility score. Both

103 approaches require a metric to measure the compatibility scores, whether between loci

104 (partitions) or between partitions and a global reference point. This metric is usually based on

105 tree dissimilarity (distance), tree likelihood, or a combination of the two (Smith et al. 2020).

106 The most widely cited methods for automated testing of congruence are Conclustador

107 (Leigh et al. 2011) and Phylo-MCOA (De Vienne et al. 2012). Conclustador uses a

108 distribution of bootstrap or posterior trees for each locus. It then measures the distance

109 between each pair of loci by counting the number of shared bipartitions in their tree set, after

110 removing non-shared taxa. Phylo-MCOA (De Vienne et al. 2012) uses multiple co-inertia

111 analysis to identify outliers at two levels: complete outliers (entire loci or species) and cell by

112 cell outliers (specific loci for specific species). Missing data are filled in with average values

113 from across the data. The method uses only a single tree for each locus, so the statistical

114 significance of topological differences is not taken into account.

115 We have developed two algorithmic approaches to assess incongruence in multi-gene

116 data. Protocol_1 is a combinatorial method using gene jackknifing to rank deviation from a

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 6

117 central mean. The ranking is measured by quantifying topological inconsistencies in data

118 subsamples and extracting sequence-specific deviation. The problem of missing data is

119 overcome by requiring each jackknife sample to include all taxa. The second method uses a

120 sliding window to assess inconsistent signal at the sub-gene level using a pre-selected set of

121 target taxa to narrow the search space. Both methods are applied to a set of core mitoPs

122 previously used to test the eukaryote root.

123

124 Materials and Methods

125 Data sources

126 Sequences were obtained primarily from GenBank (Benson et al. 2013) and iMicrobe

127 (Youens-Clark et al. 2019). Additional sources were used for Amoebozoa (Kang et al. 2017;

128 provided by the author), Dictyostelia (Fey et al. 2013), and three Discoba (Acrasis kona,

129 Andalucia godoyi, Seculamonas ecuadorienses; Fu et al. 2014; He et al. 2014). Taxa were

130 selected to give broad and balanced taxonomic coverage of the main eukaryotic groups and

131 the main bacterial outgroups, with a preference for organisms with higher sequence coverage.

132 The complete data set includes 208 organisms (Supplementary Table S1).

133

134 Data assembly

135 Sequences were identified by BLASTp, and all hits below an e-value of 1e-5 were

136 retrieved. The BLAST queries consisted of all 75 proteins used in various combinations in

137 three previous studies (Derelle et al. 2015; Derelle & Lang 2012; He et al. 2014). Homologs

138 for each protein were searched for separately using probes from three different organisms,

139 which together represent the three major divisions of eukaryotes - Naegleria gruberi

140 (Excavata), Capsaspora owczarzaki (Amorphea) and Selaginella moellendorffii

141 (Diaphoretickes). The retrieved protein sequences were aligned using Mafft 7.450, with

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 7

142 different algorithms depending on the number of sequences (Katoh & Standley 2013);

143 L-INS-i --maxiterate 1000 was used for alignments of ≥250 sequences and

144 FFT-NS-i --maxiterate 1000 otherwise. The resulting alignments were trimmed using the

145 gappyout option in trimAl 1.4 (Capella-Gutiérrez et al. 2009). Visual inspection identified 21

146 alignments in which automated trimming either removed too many columns or left too many

147 gaps to be useful for sliding window analysis (Protocol_2, below); these alignments were

148 instead trimmed by hand for Protocol_2 analyses only.

149 SGT analyses used individual protein alignments, which were iteratively screened for

150 potential horizontal gene transfer (HGT) and out-paralogs by multiple rounds of phylogenetic

151 analysis using FastTree (-gamma -lg -spr 6 -mlacc 2 -slownni -slow; Price et al. 2010). This

152 was followed by a final more stringent phylogenetic assessment using RAxML-NG with the

153 LG+GAMMA model and standard bootstrapping (Kozlov et al. 2019). The screening resulted

154 in seven proteins being rejected because they either showed strong support for major conflicts

155 with established eukaryote phylogeny (Adl et al. 2018), indicative of HGT or deep paralogy,

156 or lacked representation of one of the three major eukaryote groups. A further eight

157 alignments were found to consist of two strongly separated eukaryote-universal paralogs

158 (100% bootstrap support), and these were each split into two separate alignments. The result

159 is a final dataset of 76 proteins and 180 taxa.

160

161 Congruence tests – Protocol_1

162 Protocol_1 assesses data outliers through random subsampling of partitions

163 (jackknifing; Fig. 1). The three mains steps are 1) subsample assembly, 2) tree inference, and

164 3) tree evaluation. Step 1 generates the subsamples and an inclusion table that tracks the

165 partition composition of each subsample (Fig. 1a). The subsamples are created at random

166 without replacement with two requirements: i) each sequence must be present in an equal

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 8

167 number of subsamples and ii) each subsample must contain at least one sequence for every

168 taxon. The first requirement is achieved by randomly splitting the dataset into subsets with

169 equal numbers of partitions. This is repeated until reaching the desired number of subsets

170 while satisfying the equal inclusion frequency requirement. The equal inclusion requirement

171 means that the chosen number of subsamples must be divisible by the total number of

172 partitions. The size of subsample, in turn, depends on the taxon-wise ratio of missing data,

173 i.e., the ratio of missing versus available markers for each taxon. If one or more taxa have a

Tree #1 Tree #2 Tree #3 Tree #4

174 high ratio (roughly, >50% missing), it may be necessary to either increase the subsample size

Tree #5 Tree #6 Tree #7 Tree #8

175Protocol or to1 (Geneexclude jackknifing) the highly-incomplete taxon in order to satisfy the second requirement. Tree #1 Tree #2 Tree #3 Tree #4 TreeTree #9 #1 TreeTree #10 #2 TreeTree #11 #3 TreeTree #12 #4 AL JEWARI, C., & BALDAUF, S. L. Tree #1 Tree #2 Tree #3 Tree #4 1

a) b) Tree #5 Tree #6 Tree #7 c) Tree #8 Tree #1 Tree #2 Tree #3 Tree #4 TreeTree #13 #5 TreeTree #14 #6 TreeTree #15 #7 TreeTree #16 #8 1 Tree #5 Tree #6 Tree #7 Tree #8 z d)

Tree #1Tree #9 Tree #2Tree #10 Tree #3Tree #11 Tree #4Tree #12 input data for n sequences Tree #1 Tree #2 Tree #3 Tree #4 Tree #1 Tree #2 Tree #5Tree #3 Tree #6Tree #4 TreeTree #9 #7 TreeTree #10 #8 Tree #11 Tree #12 Tree #9 Tree #10 Tree #11 Tree #12 1 n

Tree #1 Tree #2 Tree #3 Tree #4 Tree #5 Tree #6 Tree #7 Tree #8 Tree #13 Tree #5Tree #14 Tree #6Tree #15 Tree #7Tree #16 Tree #8 Tree #5 Tree #6 Tree #9Tree #7 Tree #10Tree #8 TreeTree #13 #11 TreeTree #14 #12 Tree #15 Tree #16 Tree #13 Tree #14 Tree #15 Tree #16 s distance matrices

Tree #5 Tree #6 Tree #7 Tree #8 Tree #9 Tree #10 Tree #11 Tree #12 subsamples Tree #9 Tree #10 Tree #11 Tree #12 Tree #9 Tree #10 Tree #13Tree #11 Tree #14Tree #12 Tree #15 Tree #16

Tree #9 Tree #10 Tree #11 Tree #12 Tree #13 Tree #14 Tree #15 Tree #16 taxaTaxa h) Tree #13 Tree #14 g)Tree #15 Tree #16 Tree #13 Tree #14 f) Tree #15 Tree #16 e) deviation matrices 1 1 1 Tree #13 Tree #14 Tree #15 Tree #16 Genes genes s s s ingrouIngroupp Outgroupingroup ingroupIngroup outgroupOutgroup Ingroupingroup oOutgrouputgroup 1

2 Figure 1. Protocol_1: Estimating phylogenetic incongruence scores in multigene data by random sub-sampling (jackknifing). The protocol 3 consists of 8 main steps. a) Random balanced subsampling of ’n’ sequences generates ’s’ subsamples, with sequence distribution across 4 subsamples recorded in an inclusion table. b) Maximum likelihood trees are estimated for each subsample, and c) each tree is converted into a 5 nodal distance matrix. d) Tree deviations are estimated with reference to the inner mean across the ’z’ axis. e) Matrices are split into ingroup 6 and outgroup to be assessed separately, and f) a sum of deviations is calculated for each taxon by summing across columns. g) Taxon-gene 7 values are scaled by multiplying by the mean of the lowest 10% values across all subsamples. h) The inclusion table from step 1 is used to 8 convert scaled values from step 7 into a single scoring matrix. In an optional final step, the scoring matrix may be converted into a heat map 9 using additional scripts. 176

177 Step 2 of Protocol_1, the only computationally demanding step, generates a maximum

178 likelihood (ML) tree for each subsample by ML analysis using RAxML-NG 0.9 and the LG

179 substitution model (Fig. 1b). Unlike steps 1 and 3, which can be executed on a desktop

180 computer, step 2 is better suited for multicore parallel processing. Step 3 uses the output trees

181 from step 2 to calculate sequence-specific outlier scores. The script then masks sequences

182 with the top 10% highest scores, and outputs the resulting masked (trimmed) concatenated

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 9

183 dataset (Fig 1c-h). Outlier scores are calculated by first converting each tree to a pair-wise

184 nodal distance matrix (Fig 1c). Deviation values are then calculated for each taxon-pair for

185 each distance matrix from the calculated truncated mean at 35% for all taxon-pair distances

186 across all matrices according to equation 1:

�,, = (�,, − �,) (1)

187 where si,j,n is the distance between the taxon pair (i,j) at the nth subsample, and ui,j is the

188 truncated mean for the taxon pair (i,j) for the subsamples (Fig. 1d). Deviation scores are then

189 rescaled within each matrix by squaring each deviation value and dividing it by the mean of

190 all deviation values in that matrix. This gives higher weight to deviations in trees with fewer

191 but larger errors.

192 The deviation matrices are then split into ingroup and outgroup submatrices and

193 analysed separately (Fig. 1e). This is a precautionary measure, otherwise optional, to avoid

194 interference from the high phylogenetic noise often present in or arising from the outgroup,

195 which may overwhelm a weaker ingroup signal. The outgroup matrix also includes a single

196 ingroup taxon, specifically the one with the lowest standard deviation as calculated from the

197 ingroup submatrix. This taxon serves as a stable reference point for calculating instability

198 scores for the outgroup taxa.

199 Finally, cumulative taxon-specific deviation scores are calculated for each subsample

200 by summing the values in their respective deviation matrix (column wise or row wise). The

201 sum of scores of each sequence-taxon data point are then rescaled by multiplying them by the

202 mean of the lowest 10% points across all the data points for that sequence-taxon (Fig. 1g).

203 The inclusion table from step 1 (Fig. 1a) is then used to transfer the sum of deviation scores

204 from the subsamples to the corresponding proteins. Finally, the original data set is modified

205 by masking the sequences that rank in the top 10% deviation scores, followed by masking

206 any taxon or partition with more than 50% missing data and outputting the final filtered

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 10

207 dataset. To help visualize the results, the full set of deviation scores across the data set can be

208 converted into a heatmap (Fig. 1h, script available on request).

209

210 Congruence tests – Protocol_2

211 Protocol_2 uses a sliding window to evaluate orthology at the sub-gene/protein level.

212 The method requires designating a monophyletic group of interest (test taxon) and a

213 phylogenetically conservative and well-resolved taxon set (core taxa) to provide a stable

214 background against which to evaluate the test group data (Fig. 2). The stable group and test

215 group are defined a priori, for example based on results from Protocol_1. Data are prepared

216 as above, by alignment and trimming of individual orthologous sequence sets. Alignments Tree #1 Tree #2 Tree #3 Tree #4

217 are Protocolthen inspected 2 visually to identify windows missing all data from either the test or core Tree #1 Tree #2 Tree #3 Tree #5 Tree #4 Tree #6 Tree #7 Tree #8

218 taxon groups, and these windows are then deleted for Protocol_2. 10 Tree #1 Tree #2 Tree #3 Tree #4 Tree #5 Tree #6 Tree #7 TreeTree #9 #1 Tree #8 TreeTree #10 #2 TreeTree #11 #3 TreeTree #12 #4 Tree #1 Tree #2 Tree #3 Tree #4

a) b) Tree #1 TreeTree #2c) #5 TreeTree #3 #6 TreeTree #4 #7 Tree #8 Tree #9 Tree #10 Tree #11 TreeTree #13 #5 Tree #12 TreeTree #14 #6 TreeTree #15 #7 TreeTree #16 #8 TreeTree #1#5 Tree #2#6 Tree #7#3 Tree #8#4 chops protein x (unstable taxa only) test data sets trees Tree #1 Tree #2 Tree #3 Tree #4 Tree #1 TreeTree #2 #1 TreeTree #3 #2 TreeTree #4 #3 Tree #4 Tree #1 Tree #2 Tree #5Tree #3 TreeTree #6Tree #9 #4 TreeTree #7 #10 TreeTree #8 #11 Tree #12 Tree #13 Tree #14 Tree #15 Tree #9 Tree #16 Tree #10 Tree #11 Tree #12 TreeTree #5#9 TreeTree #10#6 TreeTree #11 #7 TreeTree #12 #8 stable groups

Tree Tree#5 #5 Tree Tree#6 #6 Tree #7Tree #7 Tree #8Tree #8 (core taxa) Tree #13 Tree Tree#5 #14 Tree Tree#6 #15 Tree Tree#7 #16 Tree #8 i TreeTree #5 #1 TreeTree #6 #2 Tree #9TreeTree #7 #3 Tree #10TreeTree #8 #4 TreeTree #13 #11 TreeTree #14 #12 Tree #15 Tree #16 Tree #13#9 TreeTree #10#14 Tree #15#11 Tree #16#12 i unstable window 1 ii window 3 groups chop chop 1 chop 2 Tree Tree#9 #9 TreeTree Tree#10 #9 #10 TreeTree #11Tree #10 #11 TreeTree #12Tree #11 #12 Tree #12 (test taxa) iii i n + 1 TreeTree #9 #5 TreeTree #10 #6 Tree #13TreeTree #11 #7 Tree #14TreeTree #12 #8 Tree #15 Tree #16 Tree #13 Tree #14 Tree #15 Tree #16 window 2 window n ii

ii Tree Tree#13 #13 TreeTree Tree#14 #13 #14 TreeTree #15Tree #14 #15 TreeTree #16Tree #15 #16 Tree #16 TreeTree #13 #9 TreeTree #14 #10 TreeTree #15 #11 TreeTree #16 #12

iii iii

11 Tree #13 Tree #14 Tree #15 Tree #16

12 Figure 2. Protocol_2: Sliding window assessment of congruence for protein fragments. The example shown is for testing fragment 13 incongruence in three taxa (“test taxa”), which can be single taxa or clades of related taxa. A taxonomically broad set of stable taxa (“core 14 taxa”, selected e.g., based on Protocol_1) provide a stable background against which to test congruence. a) Each protein is broken into 2-5 15 equal size fragments or “chops”, depending on protein size according to equation 1. Adjacent chops are then combined pairwise to create 16 semi-overlapping windows (50% overlap). In this example, the protein is divided into 5 chops which combine to give 4 overlapping windows. b) 17 Alignments for the core taxa are assembled separately for each window for each of the three test taxa (labelled i-iii) resulting in 12 alignments 18 for a 4-window protein. c) Phylogenetic trees are constructed for each window for each test taxon set using four method-model combinations: 19 RAxML LG, RAxML LG+gamma, IQTREE LG, IQTREE LG+gamma. A taxon-specific window is flagged if its sequence occurs nested within 20 any and the same core taxon clade in all four trees. Chops that are flagged for both underlying windows in which it occurs are scored as 21 potentially incongruent chops (PICs). Scripts are available from https://github.com/caeu/. 219

220 The protocol has three main steps: 1) sliding window data assembly, 2) tree inference,

221 and 3) tree evaluation (Fig. 2). Similar to protocol 1, the pipeline implementation assumes

222 running steps one and three on a local computer, while step two requires a multicore system.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 11

223 Step one first divides each trimmed alignment into 3-10 equal size fragments (chops), with

224 the size and number of chops based on alignment length according to the equation:

(�����ℎ)! − (�ℎ������)! (��� − ���) + 1 number of chops = ⋅ (2) (�������)! − (�ℎ������)! �����ℎ

225 where length is alignment length, and shortest and longest are the shortest and longest

226 alignments in the data set (chop size may vary +/–1 for alignments not precisely divisible by

227 the number of chops). The chops are then combined sequentially in pairs to generate a sliding

228 window across the alignment, with 50% overlap between each chop pair (Fig. 2a). Any

229 window with less than 20 gap-free columns for any major subdivision of the test group is

230 then omitted as these are unlikely to give useful information. A map of the chops’ structure is

231 generated and saved to record which chops are in which data set and to keep track of each set

232 of sliding windows.

233 Chop pairs are analyzed separately for each division of test taxa by phylogenetic

234 analysis in combination with full alignments for the core taxa (Fig. 2b). Four ML trees are

235 constructed for each sliding window using four different method-model combinations:

236 RAxML-NG and IQ-TREE with LG and LG+Gamma models (Fig. 2c). Chops for which the

237 test taxa are nested within one and the same division of the stable group in all four trees and

238 in two contiguous windows are marked as violating orthology (potentially incongruent chops

239 or “PICs”). Chops at the ends of an alignment or neighboring a missing data window are

240 evaluated based only on the single window in which they occur. In a final step, all PICs are

241 masked from the full data set, and the remaining data are concatenated to produce the final

242 Protocol_2 output. Alternatively, Protocol_2 can be limited to alignments with a minimum of

243 one sequence for each of the test subgroups. This restriction can be applied before and/or

244 after implementing Protocol_2 analysis.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 12

245 All calculations for both protocols are implemented in R v3.6.3 (R Core Team 2020).

246 The import and transformation of trees and sequences use the Ape package v5.0 (Paradis &

247 Schliep 2019). The R package Rfast is used to speed up parts of the calculations (Tsagris &

248 Papadakis 2018). All scripts are available from https://github.com/caeu/eukRoot.

249

250 Multiprotein Phylogeny

251 The final mitoP supermatrix was analysed with five different method-model

252 combinations: 1) RAxML NG with the LG+Gamma model and standard bootstraps (Kozlov

253 et al. 2019), 2) IQ-TREE with the LG+Gamma model and ultra-fast-bootstraps (UFB)

254 corrected with the “bnni” option (Hoang et al. 2018), 3,4) IQ-TREE with the C60+LG+G+F

255 model and PMSF bootstraps and also with UFB+bnni (Wang et al. 2018), and 5)

256 PhyloBayes-MP 1.8 (Lartillot et al. 2013). For IQ-TREE C60 analyses, the ML tree and the

257 UFB bootstraps were inferred first using the model, and then the ML tree was used to

258 generate the site-specific frequency profile for the PMSF bootstraps. PhyloBayes was run

259 with two independent chains for 10,000 generations, and a consensus tree and posterior

260 probabilities were calculated from both chains once they had converged.

261 All analyses other than tree reconstruction were implemented as R Pipeline scripts

262 and executed on a local computer. Tree reconstruction and bootstrapping were run with the

263 computational resources of Uppsala Multidisciplinary Center for Advanced Computational

264 Science (UPPMAX).

265

266 Results

267 Data filtration Protocol_1

268 Two complementary methods have been developed to screen phylogenomic data for

269 potentially problematic components. The first method screens for incongruent taxon-specific

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 13

270 loci by identifying and ranking high influence outliers using a whole gene/protein resampling

271 approach (Protocol_1; Fig. 1). The second method uses a sliding window protocol to screen

272 for potentially incongruent sequence fragments (Protocol_2). The two protocols were applied

273 to a data set of 76 mitochondrial proteins (mitoPs) used in various combinations in three

274 previous studies examining the position of the eukaryote root, abbreviated here as D2015,

275 DL2012, and H2014 (Derelle et al. 2015; Derelle & Lang 2012; He et al. 2014).

276 For Protocol_1 (Fig. 1), initial trials with the 76 mitoP proteins indicated that 14 was

277 the minimum number sufficient to avoid creating subsets with complete missing data for any

278 taxon, while being large enough to produce a stable distribution of well-resolved trees against

279 which to detect aberrations. Each protein was then sampled 364 times, creating a total of

280 1976 subsamples, which is the closest number to the 2000 sample target that is divisible by

281 76 (total number of proteins). The resulting phylogenies were used to produce dataset-wide

282 influential outlier scores, displayed here as a heatmap (Fig. 3), after which the sequences in

283 the top 10% of outlier scores were then masked. Since masking increases missing data, the

284 revised data set was trimmed column wise and row wise for excess missing data using a 50%

285 cutoff (Fig. 4). i.e., any protein or taxon with greater than 50% missing entries was removed

286 (Fig. 4, black marks along x and y axes). Altogether Protocol_1 reduces the mitoP data from

287 180 taxa and 76 proteins to 156 taxa and 66 proteins in preparation for phylogenetic analysis

288 (below).

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 14

10.0 7.5

Genes 5.0 2.5

Hafnia_sp_

Danio_rerio_ Togula_jolla_ Cafeteria_sp_ Rosalina_sp_ Ammonia_sp_ Lotharella_sp_ Bodo_saltans_ Acrasis_kona_ Volvox_carteri_ Apis_mellifera_ Salinivibrio_sp_ Hydra_vulgaris_ Opitutus_terrae_ Guillardia_theta_ Blastocystis_sp_Chromera_velia_ Albugo_candida_ Ochromonas_sp_ Laccaria_bicolor_Ustilago_maydis_ Erythrobacter_sp_ Streptomyces_sp_ Phytophthora_sp_ Diplochlamys_sp_ Amorphus_coralli_ Naegleria_gruberi_ Corallococcus_sp_ Ananas_comosus_ Fistulifera_solaris_ Polarella_glacialis_ Andalucia_godoyi_ Gonium_pectorale_ Oxytricha_trifallax_ Stentor_coeruleus_ Exaiptasia_pallida_ Chondrus_crispus_ Besnoitia_besnoiti_ Neobodo_designis_ Kordiimonas_lacus_ Rhodella_maculata_ Chlorella_variabilis_ Palpitomonas_bilix_ Filamoeba_nolandi_ Acanthoecalike_sp_ Fluviicola_taffensis_ Oryza_brachyantha_ Aplanochytrium_sp_ Perkinsus_marinus_ Neospora_caninum_Toxoplasma_gondii_ Pacificimonas_flava_ Sphingomonas_taxi_ Cystoisospora_suis_ Trypanosoma_grayi_ Angomonas_deanei_ Priapulus_caudatus_ Cystobacter_fuscus_ Stylonychia_lemnae_ Bigelowiella_natans_Reticulomyxa_filosa_ Trypanosoma_vivax_ Rhodomonas_salina_ Strigomonas_culicis_ Trichosphaerium_sp_ Mortierella_elongata_ Salpingoeca_rosetta_ Caulobacter_henricii_ Gonapodya_prolifera_ Aliifodinibius_roseus_ Porphyra_umbilicalis_ Scrippsiella_Hangoei_ Trypanosoma_brucei_ Aspergillus_nidulans_ Monosiga_brevicollis_ Longilinea_arvoryzae_ Archangium_gephyra_ Galdieria_sulphuraria_ Aphanomyces_astaci_ Blastocystis_hominis_ Hyalangium_minutum_Myxococcus_xanthus_ Idiomarina_salinarum_ Rhodosorus_marinus_ Proteomonas_sulcata_ Trypanosoma_rangeli_ Moniliophthora_roreri_ Catenaria_anguillulae_ Gracilariopsis_chorda_ Aspergillus_aculeatus_ Trichoplax_adhaerens_ Arsenicibacter_rosenii_ Rhizamoeba_saxonica_ Chitinophaga_eiseniae_Chthoniobacter_flavus_ Stigmatella_aurantiaca_ Klebsormidium_nitens_ Physcomitrella_patens_ Phaeospirillum_fulvum_ Oceanospirillum_maris_ Vibrio_nigripulchritudo_ Vitrella_brassicaformis_ Paludisphaera_borealis_ Hemiselmis_andersenii_ Ectocarpus_siliculosus_ Echinamoeba_exudans_ Ardenticatena_maritima_ Alteromonas_australica_ Hammondia_hammondi_ Paramecium_tetraurelia_ Amphidinium_massartii_ Nematostella_vectensis_ Azospirillum_brasilense_ Marchantia_polymorpha_ Chlorarachnion_reptans_ Leishmania_braziliensis_ Eutreptiella_gymnastica_ Allomyces_macrogynus_ Branchiostoma_floridae_ Capsaspora_owczarzaki_ Cafeteria_roenbergensis_ Rhizophagus_irregularis_ Protostelium_nocturnum_ Leptomonas_pyrrhocoris_ Caenispirillum_salinarum_ Percolomonas_cosmoWS_ Spizellomyces_punctatus_ Compsopogon_coeruleus_ Selaginella_moellendorffii_ Tetrahymena_thermophila_ Dictyostelium_purpureum_ Rhodospirillum_centenum_ Cyanidioschyzon_merolae_ Elphidium_margaritaceum_ Percolomonas_cosmoAE1_ Clastostelium_recurvatum_Vermamoeba_vermiformis_Acanthamoeba_castellanii_ Stegodyphus_mimosarum_ Marinobacterium_lutimaris_ Spongospora_subterranea_ Cryptodifflugia_operculata_ Hydrocarboniphaga_effusa_ Cryptomonas_paramecium_ Thalassiosira_pseudonana_Nannochloropsis_gaditana_ Partenskyella_glossopodia_ Nitrospirillum_amazonense_ Plasmodiophora_brassicae_ Polysphondylium_pallidum_Cavostelium_apophysatum_ Skermanella_stibiiresistens_ Thalassomonas_actiniarum_ Coccomyxa_subellipsoidea_Monoraphidium_neglectum_ Aplanochytrium_stocchinoi_ Eutreptiella_gymnasticalike_ Aestuariibacter_aggregatus_ Oceanicoccus_sagamiensis_ Insolitispirillum_peregrinum_ Lacimicrobium_alkaliphilum_ Chlamydomonas_reinhardtii_ Acytostelium_subglobosum_Dictyostelium_discoideumB_ Rhizoclosmatium_globosum_ Marinospirillum_alkaliphilum_ Schizochytrium_aggregatum_ Seculamonas_ecuadoriensis_ Asticcacaulis_biprosthecium_ Altererythrobacter_atlanticus_ Pseudoalteromonas_atlantica_ Oceanibacterium_hippocampi_Novosphingobium_barchaimii_ Marinobacterium_rhizophilum_ Coraliomargarita_akajimensis_ Methylacidiphilum_infernorum_

Auxenochlorella_protothecoides_ Batrachochytrium_dendrobatidis_

Pyrinomonas_methylaliphatogenes_ Chloracidobacterium_thermophilum_ Taxa Fungi Plantae Euglenids Holozoa Amoebae

Red Algae

Heterolobosea Bacteria Stramenopile

proteobacteria - 22 ⍺

23 Figure 3. Protocol_1 derived deviation scores for 76 mitochondrial proteins from diverse eukaryotes. The deviation scores for 76 proteins (y- 24 axis) for 180 taxa (x-axis) were determined using the persistent outlier detection method described in Figure 1 (Protocol_1). Deviation score 25 values are indicated by cell color intensity, with darker colors indicating higher deviation according to the scale on the right. Blue flags along the 26 x-axis indicate taxa used in Protocol_2 analyses. Species names are color coded to indicate major taxonomic groups (kingdoms / 289 27 superkingdoms / domains).

missing top 10%

Genes rest 9 0%

Hafnia_sp_

Danio_rerio_ Togula_jolla_ Cafeteria_sp_ Rosalina_sp_ Ammonia_sp_ Lotharella_sp_ Bodo_saltans_ Acrasis_kona_ Volvox_carteri_ Apis_mellifera_ Salinivibrio_sp_ Hydra_vulgaris_ Opitutus_terrae_ Guillardia_theta_ Blastocystis_sp_Chromera_velia_ Albugo_candida_ Ochromonas_sp_ Laccaria_bicolor_Ustilago_maydis_ Erythrobacter_sp_ Streptomyces_sp_ Phytophthora_sp_ Diplochlamys_sp_ Amorphus_coralli_ Naegleria_gruberi_ Corallococcus_sp_ Ananas_comosus_ Fistulifera_solaris_ Polarella_glacialis_ Andalucia_godoyi_ Gonium_pectorale_ Oxytricha_trifallax_ Stentor_coeruleus_ Exaiptasia_pallida_ Chondrus_crispus_ Besnoitia_besnoiti_ Neobodo_designis_ Kordiimonas_lacus_ Rhodella_maculata_ Chlorella_variabilis_ Palpitomonas_bilix_ Filamoeba_nolandi_ Acanthoecalike_sp_ Fluviicola_taffensis_ Oryza_brachyantha_ Aplanochytrium_sp_ Perkinsus_marinus_ Neospora_caninum_Toxoplasma_gondii_ Pacificimonas_flava_ Sphingomonas_taxi_ Cystoisospora_suis_ Trypanosoma_grayi_ Angomonas_deanei_ Priapulus_caudatus_ Cystobacter_fuscus_ Stylonychia_lemnae_ Bigelowiella_natans_Reticulomyxa_filosa_ Trypanosoma_vivax_ Rhodomonas_salina_ Strigomonas_culicis_ Trichosphaerium_sp_ Mortierella_elongata_ Salpingoeca_rosetta_ Caulobacter_henricii_ Gonapodya_prolifera_ Aliifodinibius_roseus_ Porphyra_umbilicalis_ Scrippsiella_Hangoei_ Trypanosoma_brucei_ Aspergillus_nidulans_ Monosiga_brevicollis_ Longilinea_arvoryzae_ Archangium_gephyra_ Galdieria_sulphuraria_ Aphanomyces_astaci_ Blastocystis_hominis_ Hyalangium_minutum_Myxococcus_xanthus_ Idiomarina_salinarum_ Rhodosorus_marinus_ Proteomonas_sulcata_ Trypanosoma_rangeli_ Moniliophthora_roreri_ Catenaria_anguillulae_ Gracilariopsis_chorda_ Aspergillus_aculeatus_ Trichoplax_adhaerens_ Arsenicibacter_rosenii_ Rhizamoeba_saxonica_ Chitinophaga_eiseniae_Chthoniobacter_flavus_ Stigmatella_aurantiaca_ Klebsormidium_nitens_ Physcomitrella_patens_ Phaeospirillum_fulvum_ Oceanospirillum_maris_ Vibrio_nigripulchritudo_ Vitrella_brassicaformis_ Paludisphaera_borealis_ Hemiselmis_andersenii_ Ectocarpus_siliculosus_ Echinamoeba_exudans_ Ardenticatena_maritima_ Alteromonas_australica_ Hammondia_hammondi_ Paramecium_tetraurelia_ Amphidinium_massartii_ Nematostella_vectensis_ Azospirillum_brasilense_ Marchantia_polymorpha_ Chlorarachnion_reptans_ Leishmania_braziliensis_ Eutreptiella_gymnastica_ Allomyces_macrogynus_ Branchiostoma_floridae_ Capsaspora_owczarzaki_ Cafeteria_roenbergensis_ Rhizophagus_irregularis_ Protostelium_nocturnum_ Leptomonas_pyrrhocoris_ Caenispirillum_salinarum_ Percolomonas_cosmoWS_ Spizellomyces_punctatus_ Compsopogon_coeruleus_ Selaginella_moellendorffii_ Tetrahymena_thermophila_ Dictyostelium_purpureum_ Rhodospirillum_centenum_ Cyanidioschyzon_merolae_ Elphidium_margaritaceum_ Percolomonas_cosmoAE1_ Clastostelium_recurvatum_Vermamoeba_vermiformis_Acanthamoeba_castellanii_ Stegodyphus_mimosarum_ Marinobacterium_lutimaris_ Spongospora_subterranea_ Cryptodifflugia_operculata_ Hydrocarboniphaga_effusa_ Cryptomonas_paramecium_ Thalassiosira_pseudonana_Nannochloropsis_gaditana_ Partenskyella_glossopodia_ Nitrospirillum_amazonense_ Plasmodiophora_brassicae_ Polysphondylium_pallidum_Cavostelium_apophysatum_ Skermanella_stibiiresistens_ Thalassomonas_actiniarum_ Coccomyxa_subellipsoidea_Monoraphidium_neglectum_ Aplanochytrium_stocchinoi_ Eutreptiella_gymnasticalike_ Aestuariibacter_aggregatus_ Oceanicoccus_sagamiensis_ Insolitispirillum_peregrinum_ Lacimicrobium_alkaliphilum_ Chlamydomonas_reinhardtii_ Acytostelium_subglobosum_Dictyostelium_discoideumB_ Rhizoclosmatium_globosum_ Marinospirillum_alkaliphilum_ Schizochytrium_aggregatum_ Seculamonas_ecuadoriensis_ Asticcacaulis_biprosthecium_ Altererythrobacter_atlanticus_ Pseudoalteromonas_atlantica_ Oceanibacterium_hippocampi_Novosphingobium_barchaimii_ Marinobacterium_rhizophilum_ Coraliomargarita_akajimensis_ Methylacidiphilum_infernorum_

Auxenochlorella_protothecoides_ Batrachochytrium_dendrobatidis_

Pyrinomonas_methylaliphatogenes_ Chloracidobacterium_thermophilum_ Taxa Jakobids Rhizaria Fungi Plantae Alveolate Euglenids Cryptista Holozoa Amoebae

Red Algae

Heterolobosea Bacteria

Green Algae Stramenopile proteobacteria - 28 ⍺

29 Figure 4. Distribution of missing data and high deviation scores across all mitoP proteins and taxa following Protocol_1 analysis. 30 Presence/absence of sequence data is indicated for 76 mitoP proteins (y-axis) for 180 taxa (x-axis) after Protocol_1 (outlier detection) masking. 31 Cell colors indicate protein presence (blue), absence (white), and whether or not the sequence is among the top 10% ranked outliers identified 32 by Protocol_1 (red). Black flags along the y- and x-axis indicate proteins and taxa, respectively, with more than 50% missing data. Species 33 names are color coded to indicate major taxonomic groups (kingdoms / superkingdoms / domains). 34 290

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 15

291 Inspection of the Protocol_1 deviation heatmap shows strong and consistent

292 phylogenetic signal across Amorphea, Plants () and most Bacteria for most or

293 all of the mito proteins (Fig. 3). Most of the Stramenopiles are also highly consistent, with the

294 exception of the two Blastocystis species, which reflects the fact that these have non-respiring

295 mitochondria (Stechmann et al. 2008). In contrast, strong outliers are seen for many proteins

296 in Alevolata, Rhizaria, Cryptista and Discoba. This is particularly striking for Amphidinium

297 massaratii and Scrippsiella hangoel (Dinoflagelata) in Alveolata, Rosalina sp. and Elphidium

298 margaritaceum () and Partenskyella glossopodia () in the Rhizaria, and

299 most of the Crpytista. As a result, two of the three major divisions of Diaphoretickes, and

300 most, if not all, of the Cryptista have substantial outlier content distributed widely if unevenly

301 throughout these data (Fig. 3). Most importantly here, the supergroup Discoba shows

302 consistently high deviation scores compared to the rest of the data, with over 60% of Discoba

303 sequences showing deviation scores ranking in the upper quartile. Thus, while Amorphea is

304 nearly completely outlier-free and even Diaphoretickes has largely outlier-free major

305 divisions, particularly Stramenopiles and Plants (red algae, green algae and land plants), no

306 major division of Discoba is largely free of outliers (Fig. 3).

307 While the excess of outliers in Discoba may be due to some unusual characteristic of

308 the taxon, Discoba is also distinguished as being is by far the least well sampled of the three

309 major divisions of eukaryotes examined here. Thus, it is likely that the consistency seen here

310 in groups such as animals, plants and fungi reflects the fact that they are molecularly

311 well-sampled and well-studied. This makes it easier to identify and avoid outlier taxa while

312 still maintaining the broad taxonomic representation needed to recover deep phylogenetic

313 branches. In fact, there is a strong correlation between breadth of taxonomic sampling in

314 GenBank and Protocol_1 deviation scores (Fig. S1). Unfortunately, avoiding outlier taxa is

315 not possible at present for Discoba, where all currently available genomic data must be used

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 16

316 to achieve any semblance of taxonomic breadth. This includes using molecularly highly

317 divergent taxa such as the kinetoplastid parasites (e.g. Trypanosoma spp.; Simpson 2006).

318

319 Data filtration – Protocol_2

320 The widely distributed phylogenetic outlier signal in Discoba, the only major

321 eukaryote division with no clearly identifiable stable core taxa (Fig. 3), was further explored

322 using the sliding window method of Protocol_2 (Fig. 2). A set of 70 stable core taxa was

323 selected based on Protocol_1 results as indicated by blue arrows along the x-axis in Figure 3.

324 These are taxa that show minimal phylogenetic variation (low deviation scores; Fig. 3) and

325 low levels of missing data (Fig. 4). For these analyses, the three major divisions of Discoba -

326 Euglenozoa, Heterolobosea and Jakobida - were tested separately against the stable core

327 taxon set.

328 Following a combination of automated and manual trimming, the 76 mitoP

329 alignments vary in length from 106 to 979 positions. Using equation 2, this gives 3-10 equal

330 length chops per protein or a total of 475 chops. Combining these chops sequentially in pairs,

331 creates 399 consecutive sliding windows with 50% overlap, (except for terminal windows)

332 After removing chops missing entirely for specific subgroups, the final tally is 298, 303 and

333 357 non-missing data chops for Jakobida, Euglenozoa and Heterolobosea, respectively.

334 Protocol_2 analyses of these data identify 341 potentially incongruent chops (PICs)

335 distributed across Discoba, of which 135 are found in Jakobida, 99 in Euglenozoa and 107 in

336 Heterolobosea (pale pink; Fig. 5). These PICs seem to be largely subgroup specific; very few

337 are universal across Discoba (Fig. 5). All proteins identified as outliers by Protocol_1

338 include at least one PIC in one of the taxa for which the protein is identified as an outlier,

339 although not necessarily all of the taxa for which the protein is an outlier (dark pink; Fig. 5).

340 However, many more PICs are found in sequences not identified as outliers.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 17

P2_Seculamonas_ecuadoriensis Seculamonas_ecuadoriensis P2_Andalucia_godoyi Andalucia_godoyi P2_Percolomonas_cosmoWS Percolomonas_cosmoWS P2_Percolomonas_cosmoAE1 Percolomonas_cosmoAE1 P2_Naegleria_gruberi Naegleria_gruberi P2_Acrasis_kona Acrasis_kona P2_Trypanosoma_vivax Trypanosoma_vivax P2_Trypanosoma_rangeli Trypanosoma_rangeli P2_Trypanosoma_grayi Trypanosoma_grayi P2_Trypanosoma_brucei Trypanosoma_brucei P2_Strigomonas_culicis Strigomonas_culicis P2_Neobodo_designis Neobodo_designis P2_Leptomonas_pyrrhocoris Leptomonas_pyrrhocoris P2_Leishmania_braziliensis Leishmania_braziliensis P2_Eutreptiella_gymnasticalike Eutreptiella_gymnasticalike P2_Eutreptiella_gymnastica Eutreptiella_gymnastica P2_Bodo_saltans Bodo_saltans P2_Angomonas_deanei Angomonas_deanei 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

P2_Seculamonas_ecuadoriensis Seculamonas_ecuadoriensis P2_Andalucia_godoyi Andalucia_godoyi Missing Data Umodified Mask P1 Mask P2 P2_Percolomonas_cosmoWS Percolomonas_cosmoWS P2_Percolomonas_cosmoAE1 Percolomonas_cosmoAE1 P2_Naegleria_gruberi Naegleria_gruberi P2_Acrasis_kona Acrasis_kona P2_Trypanosoma_vivax Trypanosoma_vivax P2_Trypanosoma_rangeli Trypanosoma_rangeli P2_Trypanosoma_grayi Trypanosoma_grayi P2_Trypanosoma_brucei Trypanosoma_brucei P2_Strigomonas_culicis Strigomonas_culicis P2_Neobodo_designis Neobodo_designis P2_Leptomonas_pyrrhocoris Leptomonas_pyrrhocoris P2_Leishmania_braziliensis Leishmania_braziliensis P2_Eutreptiella_gymnasticalike Eutreptiella_gymnasticalike P2_Eutreptiella_gymnastica Eutreptiella_gymnastica P2_Bodo_saltans Bodo_saltans P2_Angomonas_deanei Angomonas_deanei

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76

Missing Data Umodified Mask P1 Mask P2 35

36 Figure 5. Potentially incongruent fragments of discobid mitoP proteins identified by Protocol_2. The 76 mitoP proteins were divided into 399 37 equal-size fragments (chops) and analyzed separately for the three major divisions of Discoba by assembling into 50% overlapping-windows 38 followed by phylogenetic analysis (Protocol_2). Chops consistently appearing in non-canonical phylogenetic position were tagged as potentially 39 incongruent chops (PICs). Cell colors indicate whether the chop is present (blue), absent (white), designated as an outlier by Protocol_1 (red) 40 or as a PIC by Protocol_2 (pink). Chops are delineated by light grey lines and proteins by heavy black lines. Numbers along the x-axis indicate 41 the protein partition number. Discobid taxon names are shown on the y-axis and color coded according to : Jakobida (pink), 42 Heterolobosea (red), Euglenozoa (orange). 341

342 For example, the full-length alignment for protein 40 (mtHsp70) reconstructs a

343 monophyletic Discoba, albeit weakly and with an unresolved position among eukaryotes

344 (Fig. S2 “full”), and a very similar topology is reconstructed by sliding windows 1 and 2 (Fig.

345 S2 “w1-2”). In contrast, windows 3, 4, and 5 place the Heterolobosea with the

346 (green algae and land plants; Fig. S2 “w3-5”). This relationship is particularly strongly

347 supported with the central window, window 4, which gives 88% bootstrap support for this

348 topology (Fig. S2 “w4”). Together, these data suggest that the 2 chops that make up window

349 4, which also correspond to the 5’ and 3’ halves of windows 3 and 5, respectively, have a

350 different origin from the rest of the protein in some or all of the Heterolobosea. Importantly,

351 this anomaly is not apparent in analyses including all Discoba and the full alignment (Fig. S2

352 “full”), and as a result the protein does not register as an outlier for Heterolobosea under

353 Protocol_1 (Fig. 5).

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 18

354 A similar phenomenon is seen with protein 03 (mtHSP60). Phylogenetic analysis of

355 the full alignment places the Heterolobosea with the Euglenozoa, weakly grouped with

356 Amorphea (Fig. S3a). This is essentially compatible with the unresolved position of

357 Heterolobosea when analysed in the absence of other Discoba (Fig. S3b). However, when the

358 Heterolobosea are represented by only chop 3 of the same protein, they appear nested within

359 bacteria with very high support (96% bootstrap; Fig. S3c). This strongly suggests that this

360 portion of the gene encoding mtHSP90 has an exogenous origin for some or all of the

361 Heterolobosea.

362 From the Protocol_2 results, it would appear that nearly all discobid mitoP-encoding

363 genes carry some DNA fragments of exogeneous origin. However, it is important to note that

364 nearly half of these are “edge chops” (48.7%), meaning chops at alignment edges or

365 bordering missing data. The relatively small number of character positions in the sliding

366 windows makes Protocol_2 sensitive to stochastic errors, which is generally controlled for by

367 requiring a chop to be flagged in two contiguous windows before designating it as a PIC.

368 However, this is not possible for edge chops, which must be evaluated based only on the

369 single window in which they occur. As a result, both type I and II errors are expected to be

370 higher for edge chops. However, even with the exclusion of edge chops there remains a large

371 number of potentially incongruent fragments in many discobid mitoP genes.

372

373 Protocol_1 vs Protocol_2

374 Most incongruencies strong enough to be detected by both methods described here are

375 likely to be deleted at the SGT stage. Nonetheless, the results from the two protocols show

376 some correspondence. For example, all proteins identified as outliers by Protocol_1 include

377 some PICs for some taxa (Fig. 5). Nonetheless, there is often only a single or a few PICs per

378 outlier, and this is not always for the same taxon. At the same time, many PICs occur in

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 19

379 proteins not identified as outliers by Protocol_1. These include several proteins with multiple

380 PICs that do not appear as Protocol_1 outliers (Fig. 5). Preliminary analyses suggest that such

381 “non-outlier multi-PIC proteins” may occur when full proteins sequences have very weak

382 phylogenetic signal and thus escape detection in a multigene based method such as

383 Protocol_1 (data not shown). Alternatively, such a situation could arise if a gene carried

384 multiple PICs of different origin, such as DNA transfers from different sources, which, if

385 fragment transfers are as common as the results here suggest (Fig. 5), would not be altogether

386 unlikely.

387 Therefore, there are several reasons why it is not expected that all PICs will cause

388 their host protein to register as an outlier. On the other hand, the fact that some outliers carry

389 only one or a few PICs suggests that some small incongruent fragments may have a large

390 influence. Thus, while Protocol_1 outliers are more likely to affect a multigene phylogeny,

391 overlap between the results here suggests that both methods are effective at detecting

392 disruptive phylogenetic signal hidden from SGTs. This is reinforced by subsequent

393 phylogenomic analyses (below).

394

395 Molecular phylogeny

396 Six versions of the mitoP data were assembled for phylogenetic analysis (Fig. 6;

397 Table S1): the full data set of 76 mitoPs for 180 taxa (full data), the full data set masked by

398 Protocol_1 (full, P1 mask), the 70 stable core taxa plus Discoba used as input for Protocol_2

399 before (core_1), the same data after Protocol_2 masking (core_1, P2 mask), and a stricter

400 version of core_1 including only proteins with at least one sequence from each major

401 divisions of Discoba before (core_2) and after (core_2, P2 mask2) Protocol_2 masking.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 20

Homogeneous model Heterogeneous models raxml-ng iqtree iqtree C60+LG+G+F phylobayes LG+G (BP) LG+G (UFB) (UFB) (PMSF) gtr-cat (PP) taxa (loci_sites) full_data 83 91 80 92 N/A 180 (76_31003) full_data, P1 mask 100 100 92 92 N/A 156 (66_27143) core_1 73 75 82 90 96 70 (76_29669) core_1, P2 mask 99 100 47 85 91 70 (76_27803) core_2 93 94 52 76 100 70 (60_24681) core_2, P2 mask 99 100 67 70 90 70 (54_22031)

DSC AMO DIA Amorphea Discoba Amorphea Diaphoretickes Diaphoretickes Discoba Discoba Amorphea Diaphoretickes 43

44 Figure 6: The eukaryote root based on mitoP proteins with and without masking according to Protocol_1 (persistent outliers) or Protocol_2 45 (incongruent fragments). Six different versions of a data set of 76 conservative mitochondrial proteins were subjected to phylogenetic analysis 46 using five different method-model combinations to test the three main alternative eukaryote roots, shown at the bottom of the figure as cartoons 47 and labeled with acronyms: DSC – Discoba sister (neozoan excavate root), AMO – Amorphea sister (unikont-bikont root), and DIA – 48 Diaphoretickes sister. Support values for these roots are shown as a bar plot for six versions of the mitoP data (x-axis) analysed with five 49 different phylogenetic method - model combinations (top y-axis). Methods are standard bootstrap with RAxML-NG and LG+4G (Raxml-NG LG- 50 G) models (Kozlov et al. 2019), ultrafast bootstraps (UFB+bnni) using IQTREE with LG+4G (-bb) optimized with (-bnni) option, PMSF 51 bootstraps and UFB+bnni from IQTREE using the C60 model LG+C60+F+G (Nguyen et al. 2015; Wang et al. 2018), and posterior probabilities 52 from PhyloBayes using CAT+GTR (Laritollot et al 2012). Sizes of the mitoP data sets are shown on the right x-axis including number of taxa, 53 number of proteins (loci) and number of aligned positions (sites). The data sets are as follows: “full_data” - the original mitoP data, “full_data, 54 P1 mask” - full data filtered for outlier sequences using Protocol_1, “core_1” - mitoP sequences for stable core taxa plus Discoba used in 55 Protocol_2, “core_1, P2 mask” - core data filtered for incongruent protein fragments identified by Protocol_2, “core_2” - subset of core_1 data 56 including only proteins with at least one sequence for each subdivision of Discoba (Euglenozoa, Heterolobosea and Jakobida), and “core_2, P2 57 mask” - core_2 data with the same masking as core_1 P2 mask. N/A: Phylobayes analysis were conducted only for the masked and unmasked 58 core_1 and core_2 data due to computational limitations. 402

403 These data sets were analyzed using five different combinations of maximum

404 likelihood evolutionary models and bootstrap implementations 1) RAxML-NG with

405 LG+gamma model and standard non-parametric bootstraps (Kozlov et al. 2019), 2) IQTREE

406 with LG+gamma and UFB+bnni, and 3) IQTREE with the LG+C60+F+Gamma site specific

407 (mixture) model and PSMF bootstraps and 4) UFB+bnni (Nguyen et al. 2015; Wang et al.

408 2018), and 5) Bayesian inference with the CAT+GTR mixture model in PhyloBayes

409 (Lartillot et al. 2013) Due to its heavy computational demand, only the four datasets

410 associated with Protocol_2 were analyzed by PhyloBayes.

411 The resulting trees are largely compatible, reproducing all well-established major

412 divisions of eukaryotes with full support (Figs. S4-S7). The main conflicts among the trees

413 are the position of the root, and two contentious branches within Diaphoretickes (see below).

414 For the root, i.e., the branching order among the three main divisions of eukaryotes, all

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 21

415 analyses using homogeneous or “static” models support a grouping of Amorphea and

416 Diaphoretickes to the exclusion of Discoba. Support for this “neozoan-excavate” root reaches

417 99-100% bootstrap support (BP), when the data are masked according to either Protocol_1 or

418 2 (Fig. 6, Fig. S5-S7). In contrast, analyses of these data with mixture models find highly

419 variable support for all possible resolutions of this trichotomy, although IQTree using the

420 C60 model (IQTree_C60) tends to support the neozoan-excavate root with masked data, most

421 notably data filtered according to Protocol_1 (92% BP; Fig. 6, Fig. S5).

422 Phylogenetic analysis using the two site heterogeneous mixture models appear to be

423 particularly sensitive to data set composition. For example, IQTree_C60 shows strongest

424 support for a unikont-bikont root (Amorphea sister or AMO) with the full data set (80-92%

425 BP; Fig. 6), but analyses using Protocol_1 masked data switch completely to equally strong

426 support for a neozoan-excavate root (Discoba sister or DIS) (92% BP; Fig. 6). PhyloBayes

427 CAT-GTR also supports AMO with unfiltered data with a posterior probability (PP) of 0.96

428 (Fig. 6) but then switches to a novel Diaphoretrickes sister (DIA) with masking (core_2, P2

429 mask) (0.90 PP; Fig. 6, Fig. S4).

430 However, the shift in the PhyloBayes CAT-GTR results may not be related to the P2-

431 masking, because the same method finds even higher support for the DIA root with the

432 unfiltered Core_2 data (1.0 PP; Fig. 6). This suggests that it is the composition of core_2 that

433 makes the difference for PhyloBayes CAT-GTR, and not the P2 masking. The only

434 difference between the two datasets is 16 “discobid-poor” proteins, which are present in the

435 core_1 data set and deleted from the core_2 data. These proteins comprise 17.0% of the total

436 data (5051 out of 29669 alignment columns), for which Discoba are missing in 81.3% of the

437 cells. In contrast, the core_2 data set consists only of the remaining 60 proteins for which

438 Discoba are missing in only 15.0% of cells. The result is that the core_2 data set contains

439 95.7% of all the non-gap data for Discoba, while the 16 deleted proteins carry only 4.3 % of

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 22

440 the discobid non-gap data. This, means that PhyloBayes CAT-GTR support for the AMO root

441 is found here only with the highly discobid-depauperate data of core_1.

442 The other two conflicting nodes encountered here both occur within the supergroup

443 Diaphoretickes. These are the branching order among Stramenopiles, Rhizaria and

444 (aka SAR) and the phylogenetic position of Cryptista (Fig. S5-S7). All analyses here tend to

445 support a sister group relationship between Stramenopiles and Rhizaria to the exclusion of

446 Alveolates, particularly after Protocol_1 filtering (88-94% BP; Fig. S5). The only exception

447 to this is IQ-Tree_C60 analysis of the full unfiltered data, which places Stramenopiles and

448 Alveolates as sister taxa (61-86% BP, data not shown). With regard to the Cryptista, all

449 analyses with the full mitoP data reconstruct Cryptista as the sister group to red and green

450 plants, albeit with highly variable support (46-85% BP, data not shown). However, all

451 analyses with Protocol_1 filtered data place Cryptista as the sister to SAR (88-96% BP; Fig.

452 S5). These are important phylogenetic questions that need to be carefully dissected in their

453 own right, probably with more taxa in the case of Cryptista and possibly a more detailed

454 examination with Protocol_2 in the case of SAR. It should be noted that neither node was

455 tested with Protocol_2 filtered data as both Cryptista and Rhizaria failed the Protocol_2

456 stable core taxa criteria (Figs. 3 and 4).

457

458 Discussion

459 Two methods (protocols) have been developed to evaluate conflict and congruence in

460 phylogenomic data. Protocol_1 tests for persistent outliers at the whole gene/protein level

461 using a gene jackknifing approach (Fig. 1), while Protocol_2 assesses incongruent signal at

462 the sub-gene/protein level using a sliding-window and pre-selected core and target taxa (Fig.

463 2). The two methods are independent but complementary (Fig. 5). Phylogenetic analysis of

464 76 conservative mitochondrial proteins using the static LG-gamma model and data masked

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 23

465 according to Protocol_1 or 2 show strong and consistent support for a neozoan-excavate root

466 for eukaryotes (Discoba sister, 99-100% BP; Fig. 6). However, analyses of these data using

467 two site-specific (mixture) models yield mixed results, flipping between alternative roots

468 depending on presence/absence of masking and/or large amounts of missing data (Fig. 6).

469 Phylogenetic analysis of Protocol_2 output also reveals strong evidence of exogenous gene

470 fragments scattered across the mitoPs of Discoba, suggesting HGT from an assortment of

471 donors (Fig. 5; Figs. S2 and S3).

472

473 Two methods for congruence testing

474 The two methods presented here examine congruence at different levels. Protocol_1

475 operates at a more global scale looking for outlier signal using multigene data, while

476 Protocol_2 operates at a more local scale using gene/protein fragments. Protocol_1 ranks

477 sequences according to their tendency to disrupt phylogenetic signal in data subsets large

478 enough to produce well-resolved trees but small enough to be sensitive to deviant signals.

479 This is particularly important for detecting problems in deep phylogenetic branches, for

480 which little signal exists in SGTs. One problem with this approach, is that many proteins

481 simply have weak signal, e.g., if they have few variable sites, and this causes them to rank

482 higher because their signal is easily affected by other proteins. To correct for this, each

483 sequence-taxon data point is multiplied by the mean of the lowest 10% points (lowest

484 deviations of the 364 samples) for that protein-taxon point. This is because in a set of

485 subsamples where protein x is a member, the proportion of deviation contributed by x is

486 expected to be higher in the subsamples with lower deviation values. Therefore, the final

487 score is downweighed by the mean of the lowest 10% values, which also makes it possible

488 for all the data points to be included in calculating the final score.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 24

489 One important concern in a rooted tree is the potential presence of a substantially

490 different evolutionary process in the outgroup. In this case bacteria are expected to have high

491 internal phylogenetic variability of a different type and extent than that of the ingroup

492 (eukaryotes), due to processes such as complex patterns of HGT, which is pervasive in

493 bacteria (Juhas et al. 2009; Ke et al. 2000; Pál et al. 2005). To avoid high outgroup

494 phylogenetic noise from overwhelming an ingroup signal, analyses are conducted separately

495 for ingroup and outgroup. To factor in the changes in distance of the outgroup to the ingroup,

496 the most phylogenetically consistent ingroup taxon is included in the outgroup analysis. In

497 the case of eukaryotes this ingroup taxon is usually a or , as these include many

498 taxa with short phylogenetic branches and low amounts of missing data.

499 Protocol_2 addresses the problem of incongruence at the subgene level (mosaicism).

500 Mosaics can be difficult to detect in multigene data or even in SGTs, which suggests that

501 most such mosaic genes, if they do exist, would probably not have a strong influence on

502 multigene phylogeny. However, there could be cases in which mosaics are influential, for

503 example, if multiple small HGTs were acquired from the same or closely related sources.

504 Such a situation might occur in a cell that hosts symbionts or parasites, both of which are

505 common in microbial eukaryotes, or even simply a microbe having a favored food source. In

506 such cases, multiple small incongruent fragments could add up to a large influence once

507 combined into a multigene data set. Although preliminary analyses of Discobid mitoPs did

508 not identify such a pattern (data not shown), detecting one, if it does exist, would require

509 more detailed analyses using smaller chops and wider taxon sampling from potential donor

510 lineages.

511 Nonetheless, most persistent outliers identified by Protocol_1 are also found to

512 include PICs by Protocol_2 (Fig. 5), and both protocols appear to be effective in masking

513 discordant signal in phylogenomic data (Fig. 6). Thus, one potential advantage of Protocol_2

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 25

514 is that masking of incongruent fragments can be more precise than the usual method of

515 masking whole genes or proteins. As a result, Protocol_2 can potentially be used to

516 ameliorate the predicament of eliminating whole sequences if they look suspicious in SGTs

517 or are judged to be outliers in Protocol_1, thereby leaving more data for concatenation. For

518 example, Protocol_2 masking removes only 1866 sites or ~6% of the mitoP data, while

519 Protocol_1 masking leads to the complete deletion of 10 mitoP proteins or (15% of the data).

520 Protocol_1 performs a similar function to a number of other congruence testing

521 methods such as Conclustador and Phylo-MCOA. However, unlike Protocol_1, Conclustador

522 (Leigh et al. 2011) does not identify which taxa are incongruent for which sequence. The

523 software also is not complete and requires author support, which, to our knowledge is no

524 longer available. By contrast, Phylo-MCOA (De Vienne et al. 2012) can detect taxon-

525 partition specific incongruence. However, the method represents each partition with only a

526 single tree and without considering the statistical significance of the tree. Therefore, the

527 method can be misled by the stochastic errors often encountered in SGT topologies.

528 Of the two methods presented here, Protocol_1 is more general and easier to run. The

529 main challenge for the user is selecting a subsample size large enough to produce meaningful

530 trees, yet small enough that an outlier signal can still manifest. This value also depends on the

531 amount and distribution of missing data, since each subsample must include data from all

532 taxa. The outlier effects identified by this method are also more likely to affect the results of

533 multigene phylogeny as they are identified on the basis of their ability to impact multigene

534 trees. Protocol_2, on the other hand, requires informed decisions in order identify a specific

535 set of target taxa and a phylogenetically stable but taxonomically broad core taxon set. One

536 of the most straightforward ways to do this would be based on Protocol_1 results, as done

537 here, but of course other sources of information could be used.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 26

538 The accuracy of both Protocols is expected to increase with increased subsampling,

539 and, as with any random sampling method, the quantity of runs is more important than

540 optimizing their quality. Hence a fast but reasonably accurate phylogenetic method, such as

541 fast ML with a simple model (e.g. RAxML with static LG) is recommended for constructing

542 the trees, although much simpler methods like distance or unweighted parsimony do not seem

543 to perform well (data not shown). It should be noted that both protocols are not designed to

544 be replacements for data pre-screening with SGTs but rather aim to improve the quality of

545 SGT pre-screened data by addressing its limitations.

546

547 What are PICs

548 Protocol_2 identifies a surprisingly large number of what appear to be incongruent

549 gene fragments in the mitoP proteins of various Discoba (Fig 5), some examples of which are

550 confirmed by molecular phylogeny (e.g., Figs. S2, S3). While many discobid PICs may be

551 false positives, particularly edge PICs, these are still only roughly half of the PICs identified

552 here. Moreover, it is unlikely that all or even most edge PICs are false. Thus, it appears that a

553 substantial number of discobid mitoP-encoding genes are mosaic. Such mosaicism could

554 result from recombination of host genes with paralogs elsewhere in their genomes or with

555 externally derived gene fragments (xenologs). Studies of HGT tend to focus on whole genes,

556 whose acquisition could allow a new host to acquire a novel function, a process that appears

557 to be pervasive in bacteria (Hall et al. 2017). However, there is also the potential that, if the

558 host already has a homolog of the “invading” DNA, the host and exogenous sequences could

559 recombine and create a mosaic gene.

560 Theoretically, there should be considerable potential for random partial gene

561 replacement for the bacterial-like genes of microbial eukaryotes, as many eukaryotic

562 microbes are bacteria predators and/or host bacterial symbionts or parasites. Any of these

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 27

563 could potentially break down and release random genetic material into the host cell (Husnik

564 and McCutcheon 2018). Moreover, the largest class of bacterial-like genes in eukaryotes are

565 genes encoding proteins targeted to organelles of endosymbiotic origin (Ku et al. 2015). Such

566 mosaic gene have been found in plant mitochondrial DNA, albeit from other plants and

567 potentially mediated by viruses (Richardson & Palmer 2007). The results presented here

568 suggest that partial gene replacement is more pervasive in eukaryotes, at least for some taxa

569 and some sets of genes, in this case, nuclear genes encoding bacterial-like mitochondrial

570 proteins of Discoba (Fig. 5).

571

572 Deep phylogeny of eukaryotes based on mitoP data

573 The mitoP data used here have been the subject of three previous studies, abbreviated

574 here as DL2012, D2015 and H2014 (Derelle et al. 2015; Derelle & Lang 2012; He et al.

575 2014). Despite using partly overlapping data these studies found very different results with

576 DL2012 and D2015 supporting the unikont-bikont or Amorphea sister root and H2014 the

577 neozoan-excavate or Discoba sister root. In the current work, all proteins used in the previous

578 analyses were combined and treated equally to build a new dataset, also adding several more

579 recently sequenced taxa. These data were then objectively and systematically screened for

580 persistent outliers (Protocol_1; Fig. 1) and incongruent fragments (Protocol_2: Fig. 2) using

581 no assumption on the position of the root. Trees were then constructed from these data before

582 and after protocol-based masking and before and after removing the most discobid-depleted

583 proteins (Fig. 6).

584 The single consistent result obtained here for the eukaryote root is a neozoan-excavate

585 root, which specifies Amophea+Diaphoreticks (“neozoa”) and Discoba (“excavata”) as two

586 separate major divisions of eukaryotes. This result is fully consistent using the LG+GAMMA

587 model, which finds moderate support for this root with the full unfiltered data set (73-94%

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 28

588 BP; Fig. 6) and full support after either Protocol_1 or Protocol_2 filtering (99-100% BP; Fig.

589 6). In contrast, very inconsistent results were obtained with two mixture model-based

590 methods. Together, these switch among all three possible solutions to what is essentially an

591 unresolved trichotomy depending on presence/absence of filtering and/or large amounts of

592 discobid missing data (Fig. 6). This suggests there is a problem with applying mixture models

593 to these data. This seems particularly true for PhyloBayes CAT-GTR, also the only method-

594 model combination strongly supporting an AMO root in previous studies (DL2012, D2015).

595

596 The cost of complexity

597 Model selection has become something of a holy grail in phylogenetics, although

598 recent studies indicate that better fitting models (as judged by criteria such as AIC or BIC) do

599 not necessary produce more reliable trees with either DNA (Abadi et al 2019) or protein

600 sequences (Spielman 2020). Model complexity and error are direct trade-offs, because each

601 model parameter value is an estimate with an associated error (Huelsenbeck & Crandall

602 1997). Thus, increasingly complex models run the risk of over-fitting, whereby the model fits

603 increasingly well to a smaller subset of the data. An added complication of highly complex

604 models such as C60 and especially CAT-GTR is that they are computationally heavily

605 demanding. Therefore, these models are still largely unexplored in terms of their performance

606 with various problematic data types such as persistent outliers, varying quantities and

607 distributions of missing data (e.g., patterned missing data) or model variation across the tree

608 (heterotachy). In fact, recent work has indicated that an access of categories in mixture

609 models can be problematic (Li et al. bioRxiv).

610 In the case of the mitoP data, the inconsistency of results using mixture models is

611 particularly apparent when contrasted between before and after protocol-based masking. As

612 expected, such masking increases nodal support in analyses using simpler models (Fig. 6).

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 29

613 However, it causes radical shifts in topology and support with both mixture models. For

614 example, IQTree_C60 finds 92% support for an AMO root with unfiltered data, but then

615 switches to 92% support for a DSC root after Protocol_1 filtering. Moreover, with core data

616 (most stable taxa plus Discoba), IQTree C60 finds all three possible roots depending on

617 masking and core group composition. PhyloBayes CAT-GTR, on the other hand may be

618 especially sensitive to missing data as it supports an AMO root with core_1 taxa, with or

619 without masking, but a DIA root with reduced missing data, again with or without masking

620 (Fig. 6).

621

622 Missing taxa

623 Many taxa are not included in these analyses presented here, including newly

624 discovered species that may represent novel major divisions of eukaryotes (Burki et al. 2020).

625 However, none of these taxa are widely sampled or have sufficient genomic data to allow

626 reliable pre-screening for phylogenomics. We chose to exclude these taxa to focus on a single

627 specific question, the branching order among Amorphea, Diaphoretickes and Discoba. Thus,

628 we cannot rule out the possibility that one or more of the excluded lineages represents a

629 closer sister lineage to Amorphea+Diaphoretickes or Discoba or a deeper branching sister

630 group to all three. We also obviously cannot rule out the possibility that one or more such

631 organisms or taxon groups will be discovered in the future, and hopefully they will be.

632 However, no addition of taxa should affect the fundamental relationship described here, that

633 Amorphea and Diaphoretickes are more closely related to each other than either is to

634 Discoba. Careful selection of data and taxa is of paramount importance to accurate

635 phylogenetic reconstruction and removing strong outlier data is crucial to improve tree

636 accuracy (Philippe et al. 2017), particularly for the deepest branches. Moreover, precisely

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 30

637 defining relationships among known major taxa will make it easier to assign novel taxa to

638 their correct position in the tree.

639

640 Conclusions

641 Two new protocols are presented for identifying incongruence in multigene data at

642 the level of whole (Protocol_1) and subgene (Protocol_2) levels. Protocol_1 is quite easy to

643 apply using scripts archived in GitHub (https://github.com/caeu/eukRoot). Protocol_2

644 requires considerably more user expertise including clearly defined hypotheses. However, it

645 can potentially detect mosaic genes, and the results reported here suggest that these are more

646 widespread in eukaryotes than previously known (Bergthorsson et al. 2003; Bock 2010;

647 Mower et al. 2004; Nielsen et al. 1998; Richardson & Palmer 2007). Most importantly,

648 application of these protocols to a data set of bacterial-like mitochondrial proteins and

649 subsequent phylogenetic analyses place the Discoba as a sister clade to Amorphea plus

650 Diaphoretickes, with no consistent support found with these data for any alternative

651 hypothesis. Probably the most important missing taxa here are the remaining “excavates”, the

652 metamonads or “de-mitochondriate excavates”, which, due to their lack of respiring

653 mitochondria also lack nearly all core mitoPs. Thus, the neozoan-excavate root is just one

654 step, albeit an important one, in defining the root of the eukaryote tree.

655

656 Acknowledgement

657 This work was supported by VR grant (Project VR 2017-04351) and Project

658 SNIC2019/3-266 from the Uppsala Multidisciplinary Center for Advanced Computational

659 Science (UPPMAX). The authors thank Lars Arvestad and Mikael Thollesson for helpful

660 comments on the manuscript.

661

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 31

662 Figure legends

663 Figure 1. Protocol_1: Estimating phylogenetic incongruence scores in multigene data by

664 random sub-sampling (jackknifing). The protocol consists of 8 main steps. a) Random

665 balanced subsampling of ’n’ sequences generates ’s’ subsamples, with sequence distribution

666 across subsamples recorded in an inclusion table. b) Maximum likelihood trees are estimated

667 for each subsample, and c) each tree is converted into a nodal distance matrix. d) Tree

668 deviations are estimated with reference to the inner mean across the ’z’ axis. e) Matrices are

669 split into ingroup and outgroup to be assessed separately, and f) a sum of deviations is

670 calculated for each taxon by summing across columns. g) Taxon-gene values are scaled by

671 multiplying by the mean of the lowest 10% values across all subsamples. h) The inclusion

672 table from step 1 is used to convert scaled values from step 7 into a single scoring matrix. In

673 an optional final step, the scoring matrix may be converted into a heatmap using additional

674 scripts. All scripts are available from https://github.com/caeu/, except for the heatmap script

675 which is available upon request.

676

677 Figure 2. Protocol_2: Sliding window assessment of congruence for protein fragments. The

678 example shown is for testing fragment incongruence in three taxa (“test taxa”), which can be

679 single taxa or clades of related taxa. A taxonomically broad set of stable taxa (“core taxa”,

680 selected e.g., based on Protocol_1) provide a stable background against which to test

681 congruence. a) Each protein is broken into 2-5 equal size fragments or “chops”, depending

682 on protein size according to equation 1. Adjacent chops are then combined pairwise to create

683 semi-overlapping windows (50% overlap). In this example, the protein is divided into 5

684 chops which combine to give 4 overlapping windows. b) Alignments for the core taxa are

685 assembled separately for each window for each of the three test taxa (labelled i-iii) resulting

686 in 12 alignments for a 4-window protein. c) Phylogenetic trees are constructed for each

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 32

687 window for each test taxon set using four method-model combinations: RAxML LG,

688 RAxML LG+gamma, IQTREE LG, IQTREE LG+gamma. A taxon-specific window is

689 flagged if its sequence occurs nested within any and the same core taxon clade in all four

690 trees. Chops that are flagged for both underlying windows in which it occurs are scored as

691 potentially incongruent chops (PICs). Scripts are available from https://github.com/caeu/.

692

693 Figure 3. Protocol_1 derived deviation scores for 76 mitochondrial proteins from diverse

694 eukaryotes. The deviation scores for 76 proteins (y-axis) for 180 taxa (x-axis) were

695 determined using the persistent outlier detection method described in Figure 1 (Protocol_1).

696 Deviation score values are indicated by cell color intensity, with darker colors indicating

697 higher deviation according to the scale on the right. Blue flags along the x-axis indicate taxa

698 used in Protocol_2 analyses. Species names are color coded to indicate major taxonomic

699 groups (kingdoms / superkingdoms / domains).

700

701 Figure 4. Distribution of missing data and high deviation scores across all mitoP proteins and

702 taxa following Protocol_1 analysis. Presence/absence of sequence data is indicated for 76

703 mitoP proteins (y-axis) for 180 taxa (x-axis) after Protocol_1 (outlier detection) masking.

704 Cell colors indicate protein presence (blue), absence (white), and whether or not the sequence

705 is among the top 10% ranked outliers identified by Protocol_1 (red). Black flags along the

706 y- and x-axis indicate proteins and taxa, respectively, with more than 50% missing data.

707 Species names are color coded to indicate major taxonomic groups (kingdoms /

708 superkingdoms / domains).

709

710 Figure 5. Potentially incongruent fragments of discobid mitoP proteins identified by

711 Protocol_2. The 76 mitoP proteins were divided into 399 equal-size fragments (chops) and

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 33

712 analyzed separately for the three major divisions of Discoba by assembling into 50%

713 overlapping-windows followed by phylogenetic analysis (Protocol_2). Chops consistently

714 appearing in non-canonical phylogenetic position were tagged as potentially incongruent

715 chops (PICs). Cell colors indicate whether the chop is present (blue), absent (white),

716 designated as an outlier by Protocol_1 (red) or as a PIC by Protocol_2 (pink). Chops are

717 delineated by light grey lines and proteins by heavy black lines. Numbers along the x-axis

718 indicate the protein partition number. Discobid taxon names are shown on the y-axis and

719 color coded according to taxonomy: Jakobida (pink), Heterolobosea (red), Euglenozoa

720 (orange).

721

722 Figure 6: The eukaryote root based on mitoP proteins with and without masking according to

723 Protocol_1 (persistent outliers) or Protocol_2 (incongruent fragments). Six different versions

724 of a data set of 76 conservative mitochondrial proteins were subjected to phylogenetic

725 analysis using five different method-model combinations to test the three main alternative

726 eukaryote roots, shown at the bottom of the figure as cartoons and labeled with acronyms:

727 DSC – Discoba sister (neozoan excavate root), AMO – Amorphea sister (unikont-bikont

728 root), and DIA – Diaphoretickes sister. Support values for these roots are shown as a bar plot

729 for six versions of the mitoP data (x-axis) analysed with five different phylogenetic

730 method - model combinations (top y-axis). Methods are standard bootstrap with RAxML-NG

731 and LG+4G (Raxml-NG LG-G) models (Kozlov et al. 2019), ultrafast bootstraps

732 (UFB+bnni) using IQTREE with LG+4G (-bb) optimized with (-bnni) option, PMSF

733 bootstraps and UFB+bnni from IQTREE using the C60 model LG+C60+F+G (Nguyen et al.

734 2015; Wang et al. 2018), and posterior probabilities from PhyloBayes using CAT+GTR

735 (Laritollot et al 2012). Sizes of the mitoP data sets are shown on the right x-axis including

736 number of taxa, number of proteins (loci) and number of aligned positions (sites). The data

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 34

737 sets are as follows: “full_data” - the original mitoP data, “full_data, P1 mask” - full data

738 filtered for outlier sequences using Protocol_1, “core_1” - mitoP sequences for stable core

739 taxa plus Discoba used in Protocol_2, “core_1, P2 mask” - core data filtered for incongruent

740 protein fragments identified by Protocol_2, “core_2” - subset of core_1 data including only

741 proteins with at least one sequence for each subdivision of Discoba (Euglenozoa,

742 Heterolobosea and Jakobida), and “core_2, P2 mask” - core_2 data with the same masking as

743 core_1 P2 mask. N/A: Phylobayes analysis were conducted only for the masked and

744 unmasked core_1 and core_2 data due to computational limitations.

745

746 References

747 Adl S.M., Bass D., Lane C.E., Lukeš J., Schoch C.L., Smirnov A., Agatha S., Berney C.,

748 Brown M.W., Burki F., Cárdenas P., Čepička I., Chistyakova L., del Campo J.,

749 Dunthorn M., Edvardsen B., Eglit Y., Guillou L., Hampl V., Heiss A.A., Hoppenrath

750 M., James T.Y., Karpov S., Kim E., Kolisko M., Kudryavtsev A., Lahr D.J.G., Lara

751 E., Le Gall L., Lynn D.H., Mann D.G., Massana i Molera R., Mitchell E.A.D.,

752 Morrow C., Park J.S., Pawlowski J.W., Powell M.J., Richter D.J., Rueckert S.,

753 Shadwick L., Shimano S., Spiegel F.W., Torruella i Cortes G., Youssef N.,

754 Zlatogursky V., Zhang Q. 2018. Revisions to the Classification, Nomenclature, and

755 Diversity of Eukaryotes. J. Eukaryot. Microbiol.:jeu.12691.

756 Baldauf S.L., Roger A.J., Wenk-Siefert I., Doolittle W.F. 2000. A -level phylogeny

757 of eukaryotes based on combined protein data. Science. 290:972–977.

758 Bapteste E., Brinkmann H., Lee J.A., Moore D.V., Sensen C.W., Gordon P., Duruflé L.,

759 Gaasterland T., Lopez P., Müller M., Philippe H. 2002. The analysis of 100 genes

760 supports the grouping of three highly divergent amoebae: Dictyostelium, Entamoeba,

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 35

761 and Mastigamoeba. Proceedings of the National Academy of Sciences of the United

762 States of America. 99:1414–1419.

763 Benson D.A., Cavanaugh M., Clark K., Karsch-Mizrachi I., Lipman D.J., Ostell J., Sayers

764 E.W. 2013. GenBank. Nucleic Acids Research. 41:D36–D42.

765 Bergthorsson U., Adams K.L., Thomason B., Palmer J.D. 2003. Widespread horizontal

766 transfer of mitochondrial genes in flowering plants. Nature. 424:197–201.

767 Bock R. 2010. The give-and-take of DNA: horizontal gene transfer in plants. Trends in Plant

768 Science. 15:11–22.

769 Brown M.W., Heiss A.A., Kamikawa R., Inagaki Y., Yabuki A., Tice A.K., Shiratori T.,

770 Ishida K.-I., Hashimoto T., Simpson A.G.B., Roger A.J. 2018. Phylogenomics Places

771 Orphan Protistan Lineages in a Novel Eukaryotic Super-Group. Genome Biology and

772 Evolution. 10:427–433.

773 Brueckner J., Martin W.F. 2020. Bacterial Genes Outnumber Archaeal Genes in Eukaryotic

774 Genomes. Genome Biology and Evolution. 12:282–292.

775 Burki F., Kaplan M., Tikhonenkov D.V., Zlatogursky V., Minh B.Q., Radaykina L.V.,

776 Smirnov A., Mylnikov A.P., Keeling P.J. 2016. Untangling the early diversification of

777 eukaryotes: A phylogenomic study of the evolutionary origins of centrohelida,

778 haptophyta and cryptista. Proceedings of the Royal Society B: Biological Sciences.

779 283:20152802.

780 Burki F., Roger A.J., Brown M.W., Simpson A.G.B. 2020. The New Tree of Eukaryotes.

781 Trends in Ecology and Evolution. 35:43–55.

782 Capella-Gutiérrez S., Silla-Martínez J.M., Gabaldón T. 2009. trimAl: A tool for automated

783 alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 25:1972–

784 1973.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 36

785 Cotton J.A., McInerney J.O. 2010. Eukaryotic genes of archaebacterial origin are more

786 important than the more numerous eubacterial genes, irrespective of function.

787 Proceedings of the National Academy of Sciences of the United States of America.

788 107:17252–17255.

789 De Vienne D.M., Ollier S., Aguileta G. 2012. Phylo-MCOA: A fast and efficient method to

790 detect outlier genes and species in phylogenomics using multiple co-inertia analysis.

791 Molecular Biology and Evolution. 29:1587–1598.

792 Derelle R., Lang B.F. 2012. Rooting the eukaryotic tree with mitochondrial and bacterial

793 proteins. Molecular Biology and Evolution. 29:1277–1289.

794 Derelle R., Torruella G., Klimeš V., Brinkmann H., Kim E., Vlček Č., Lang B.F., Eliáš M.

795 2015. Bacterial proteins pinpoint a single eukaryotic root. Proceedings of the National

796 Academy of Sciences of the United States of America. 112:E693–E699.

797 Fey P., Dodson R.J., Basu S., Chisholm R.L. 2013. One Stop Shop for Everything

798 Dictyostelium: dictyBase and the Dicty Stock Center in 2012. Methods in Molecular

799 Biology. Humana Press, Totowa, NJ. p. 59–92.

800 Fu C.J., Sheikh S., Miao W., Andersson S.G.E., Baldauf S.L. 2014. Missing genes, multiple

801 ORFs, and C-to-U type RNA editing in Acrasis kona (Heterolobosea, Excavata)

802 mitochondrial DNA. Genome Biology and Evolution. 6:2240–2257.

803 Gabaldón T. 2018. Relative timing of mitochondrial endosymbiosis and the “pre-

804 mitochondrial symbioses” hypothesis: RELATIVE TIMING OF MITOCHONDRIAL

805 SYMBIOSIS. IUBMB Life. 70:1188–1196.

806 Gray M.W. 2012. Mitochondrial evolution. Cold Spring Harbor Perspectives in Biology. 4.

807 Hall J.P.J., Brockhurst M.A., Harrison E. 2017. Sampling the mobile gene pool: innovation

808 via horizontal gene transfer in bacteria. Philosophical Transactions of the Royal

809 Society B: Biological Sciences. 372:20160424.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 37

810 He D., Fiz-Palacios O., Fu C.J., Tsai C.C., Baldauf S.L. 2014. An alternative root for the

811 eukaryote tree of life. Current Biology. 24:465–470.

812 He D., Sierra R., Pawlowski J., Baldauf S.L. 2016. Reducing long-branch effects in multi-

813 protein data uncovers a close relationship between Alveolata and Rhizaria. Molecular

814 Phylogenetics and Evolution. 101:1–7.

815 Hjort K., Goldberg A.V., Tsaousis A.D., Hirt R.P., Embley T.M. 2010. Diversity and

816 reductive evolution of mitochondria among microbial eukaryotes. Phil. Trans. R. Soc.

817 B. 365:713–727.

818 Hoang D.T., Chernomor O., von Haeseler A., Minh B.Q., Vinh L.S. 2018. UFBoot2:

819 Improving the Ultrafast Bootstrap Approximation. Molecular biology and evolution.

820 Molecular Biology and Evolution. 35:518–522.

821 Huelsenbeck J.P., Bull J.J., Cunningham C.W. 1996. Combining data in phylogenetic

822 analysis. Trends in Ecology and Evolution. 11:152–158.

823 Huelsenbeck J.P., Crandall K.A. 1997. Phylogeny Estimation and Hypothesis Testing Using

824 Maximum Likelihood. Annual Review of Ecology and Systematics. 28:437–466.

825 Husnik F., McCutcheon J.P. 2018. Functional horizontal gene transfer from bacteria to

826 eukaryotes. Nature Reviews Microbiology. 16:67–79.

827 Juhas M., Van Der Meer J.R., Gaillard M., Harding R.M., Hood D.W., Crook D.W. 2009.

828 Genomic islands: Tools of bacterial horizontal gene transfer and evolution. FEMS

829 Microbiology Reviews. 33:376–393.

830 Kamikawa R., Kolisko M., Nishimura Y., Yabuki A., Brown M.W., Ishikawa S.A., Ishida

831 K.I., Roger A.J., Hashimoto T., Inagaki Y. 2014. Gene content evolution in discobid

832 mitochondria deduced from the phylogenetic position and complete mitochondrial

833 genome of Tsukubamonas globosa. Genome Biology and Evolution. 6:306–315.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 38

834 Kang S., Tice A.K., Spiegel F.W., Silberman J.D., Pánek T., Cepicka I., Kostka M.,

835 Kosakyan A., Alcântara D.M.C., Roger A.J., Shadwick L.L., Smirnov A.,

836 Kudryavtsev A., Lahr D.J.G., Brown M.W. 2017. Between a Pod and a Hard Test:

837 The Deep Evolution of Amoebae. Molecular biology and evolution. 34:2258–2270.

838 Karnkowska A., Vacek V., Zubáčová Z., Treitli S.C., Petrželková R., Eme L., Novák L.,

839 Žárský V., Barlow L.D., Herman E.K., Soukal P., Hroudová M., Doležal P., Stairs

840 C.W., Roger A.J., Eliáš M., Dacks J.B., Vlček Č., Hampl V. 2016. A eukaryote

841 without a mitochondrial organelle. Current Biology. 26:1274–1284.

842 Katoh K., Standley D.M. 2013. MAFFT multiple sequence alignment software version 7:

843 Improvements in performance and usability. Molecular Biology and Evolution.

844 30:772–780.

845 Ke D., Boissinot M., Huletsky A., Picard F.J., Frenette J., Ouellette M., Roy P.H., Bergeron

846 M.G. 2000. Evidence for horizontal gene transfer in evolution of elongation factor Tu

847 in enterococci. Journal of Bacteriology. 182:6913–6920.

848 Kozlov A.M., Darriba D., Flouri T., Morel B., Stamatakis A., Wren J. 2019. RAxML-NG: A

849 fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference.

850 Bioinformatics. 35:4453–4455.

851 Ku C., Nelson-Sathi S., Roettger M., Sousa F.L., Lockhart P.J., Bryant D., Hazkani-Covo E.,

852 McInerney J.O., Landan G., Martin W.F. 2015. Endosymbiotic origin and differential

853 loss of eukaryotic genes. Nature. 524:427–432.

854 Kupczok A., Schmidt H.A., von Haeseler A. 2010. Accuracy of phylogeny reconstruction

855 methods combining overlapping gene data sets. Algorithms for Molecular Biology.

856 5:37.

857 Kurland C.G., Andersson S.G.E. 2000. Origin and Evolution of the Mitochondrial Proteome.

858 Microbiology and Molecular Biology Reviews. 64:786–820.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 39

859 Lartillot N., Rodrigue N., Stubbs D., Richer J. 2013. PhyloBayes MPI: Phylogenetic

860 Reconstruction with Infinite Mixtures of Profiles in a Parallel Environment. Syst Biol.

861 62:611–615.

862 Leigh J.W., Schliep K., Lopez P., Bapteste E. 2011. Let Them Fall Where They May:

863 Congruence Analysis in Massive Phylogenetically Messy Data Sets. Molecular

864 Biology and Evolution. 28:2773–2785.

865 Li Y., Shen X.-X., Evans B., Dunn C.W., Rokas A. 2020. Rooting the animal tree of life.

866 bioRxiv.:2020.10.27.357798.

867 Mower J.P., Stefanović S., Young G.J., Palmer J.D. 2004. Gene transfer from parasitic to

868 host plants. Nature. 432:165–166.

869 Nguyen L.T., Schmidt H.A., Von Haeseler A., Minh B.Q. 2015. IQ-TREE: A fast and

870 effective stochastic algorithm for estimating maximum-likelihood phylogenies.

871 Molecular Biology and Evolution. 32:268–274.

872 Nielsen K.M., Bones A.M., Smalla K., van Elsas J.D. 1998. Horizontal gene transfer from

873 transgenic plants to terrestrial bacteria – a rare event? FEMS Microbiol Rev. 22:79–

874 103.

875 Pál C., Papp B., Lercher M.J. 2005. Adaptive evolution of bacterial metabolic networks by

876 horizontal gene transfer. Nature Genetics. 37:1372–1375.

877 Paradis E., Schliep K. 2019. Ape 5.0: An environment for modern phylogenetics and

878 evolutionary analyses in R. Bioinformatics. 35:526–528.

879 Philippe H., Brinkmann H., Lavrov D.V., Littlewood D.T.J., Manuel M., Wörheide G.,

880 Baurain D. 2011. Resolving difficult phylogenetic questions: Why more sequences

881 are not enough. PLoS Biology. 9:e1000602.

882 Philippe H., De Vienne D.M., Ranwez V., Roure B., Baurain D., Delsuc F. 2017. Pitfalls in

883 supermatrix phylogenomics. European Journal of Taxonomy. 2017:1–25.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. AL JEWARI, C., & BALDAUF, S. L. 40

884 Price M.N., Dehal P.S., Arkin A.P. 2010. FastTree 2 - Approximately maximum-likelihood

885 trees for large alignments. PLoS ONE. 5:e9490.

886 R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna,

887 Austria

888 Richardson A.O., Palmer J.D. 2007. Horizontal gene transfer in plants. J Exp Bot. 58:1–9.

889 Roger A.J., Kolisko M., Simpson A.G. 2012. Phylogenomic Analysis. Evolution of Virulence

890 in Eukaryotic Microbes, First Edition. Edited by L. David Sibley, Barbara J. Howlett,

891 and Joseph Heitman. John Wiley & Sons, Inc.:44–69.

892 Schoch C.L., Ciufo S., Domrachev M., Hotton C.L., Kannan S., Khovanskaya R., Leipe D.,

893 Mcveigh R., O’Neill K., Robbertse B., Sharma S., Soussov V., Sullivan J.P., Sun L.,

894 Turner S., Karsch-Mizrachi I. 2020. NCBI Taxonomy: a comprehensive update on

895 curation, resources and tools. Database : the journal of biological databases and

896 curation. 2020.

897 Simpson A.G.B., Stevens J.R., Lukeš J. 2006. The evolution and diversity of kinetoplastid

898 flagellates. Trends in Parasitology. 22:168–174.

899 Smith S.A., Walker-Hale N., Walker J.F., Brown J.W. 2020. Phylogenetic Conflicts,

900 Combinability, and Deep Phylogenomics in Plants. Systematic Biology. 69:579–592

901 Stechmann A., Hamblin K., Pérez-Brocal V., Gaston D., Richmond G.S., Giezen M. van der,

902 Clark C.G., Roger A.J. 2008. Organelles in Blastocystis that Blur the Distinction

903 between Mitochondria and Hydrogenosomes. Current Biology. 18:580–585.

904 Strassert J.F.H., Jamy M., Mylnikov A.P., Tikhonenkov D.V., Burki F. 2019. New

905 phylogenomic analysis of the enigmatic phylum further resolves the

906 eukaryote tree of life. Molecular Biology and Evolution. 36:757–765.

907 Tsagris M., Papadakis M. 2018. Taking R to its limits: 70+ tips. PeerJ. 6.

bioRxiv preprint doi: https://doi.org/10.1101/2021.04.08.438903; this version posted April 9, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. THE IMPACT OF CONGRUENCE ON THE EUKARYOTE ROOT 41

908 Wägele J.W., Letsch H., Klussmann-Kolb A., Mayer C., Misof B., Wägele H. 2009.

909 Phylogenetic support values are not necessarily informative: The case of the Serialia

910 hypothesis (a mollusk phylogeny). Frontiers in Zoology. 6:12.

911 Wang H.C., Minh B.Q., Susko E., Roger A.J. 2018. Modeling Site Heterogeneity with

912 Posterior Mean Site Frequency Profiles Accelerates Accurate Phylogenomic

913 Estimation. Systematic Biology. 67:216–235.

914 Youens-Clark K., Bomhoff M., Ponsero A.J., Wood-Charlson E.M., Lynch J., Choi I.,

915 Hartman J.H., Hurwitz B.L. 2019. IMicrobe: Tools and data-driven discovery

916 platform for the microbiome sciences. GigaScience. 8.