<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 The integrin-mediated adhesome complex, essential to multicellularity, is present in the most 2 recent common ancestor of , fungi, and amoebae. 3 4 Seungho Kang1,2, Alexander K. Tice1,2, Courtney W. Stairs3, Daniel J. G. Lahr4, Robert E. 5 Jones1,2, Matthew W. Brown1,2* 6 7 1. Department of Biological Sciences, Mississippi State University USA 8 2. Institute for Genomics, Biocomputing & Biotechnology, Mississippi State University, USA 9 3. Department of Cell and Molecular Biology, Uppsala University, Sweden. 10 4. Department of Zoology, University of São Paulo, Brazil 11 *Correspondence: Matthew W. Brown, [email protected] 12 13 Keywords: Integrin, , Protein architecture, , reductive evolution

14

15 Abstract: Integrins are transmembrane receptor proteins that activate signal transduction

16 pathways upon extracellular matrix binding. The Integrin Mediated Adhesion Complex (IMAC),

17 mediates various cell physiological process. The IMAC was thought to be an specific

18 machinery until over the last decade these complexes were discovered in , the group

19 containing animals, fungi, and several microbial lineages. Amoebozoa is the eukaryotic

20 supergroup sister to Obazoa. Even though Amoebozoa represents the closest outgroup to Obazoa,

21 little genomic-level data and attention to gene inventories has been given to the supergroup. To

22 examine the evolutionary history of the IMAC, we examine gene inventories of deeply sampled

23 set of 100+ Amoebozoa taxa, including new data from several taxa. From these robust data

24 sampled from the entire breadth of known amoebozoan , we show the presence of an

25 ancestral complex of integrin adhesion proteins that predate the evolution of the Amoebozoa. Our

26 results highlight that many of these proteins appear to have evolved earlier in eukaryote evolution

27 than previously thought. Co-option of an ancient protein complex was key to the emergence of

28 animal type multicellularity. The role of the IMAC in a unicellular context is unknown but must

29 also play a critical role for at least some unicellular organisms.

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

30

31 INTRODUCTION

32 Integrins are transmembrane signaling and adhesion heterodimers, made up of a-integrin (ITGA)

33 and b-integrin (ITGB) protein subunits [1]. They are important signal transduction molecules

34 across the cell membrane and associate with a set of intracellular proteins that act on the

35 cytoskeleton, together known as the integrin mediated adhesion complex (IMAC) (figure 1A) [2].

36 The intracellular component of the IMAC includes adhesome proteins that are 1) integrin bound

37 proteins that can bind to actin directly (talin, α-actinin, and filamin); 2) integrin bound proteins

38 that can bind indirectly to the cytoskeleton (integrin linked kinase (ILK), particularly interesting

39 new cysteine-histidine rich protein (PINCH), and Paxillin); and 3) non-integrin binding proteins

40 such as vinculins.

41

42 The IMAC is best known for its critical role in cellular aspects of behavior including development,

43 migration, proliferation, survival and metabolism [1,3,4]. They function by relaying messages

44 through outside/in and inside/out between the extracellular matrix and the internal cytoskeleton[5].

45

46 Since integrin proteins have been demonstrated to be absent in other complex multicellular

47 lineages of (i.e., and fungi) as well as in the , which are the

48 closest protistan relatives of Metazoa, integrins were initially believed to be an animal specific

49 evolutionary innovation. However, over the last decade, investigations into the genomic content

50 of other close protistan relatives of animals have led to the discovery of integrins and other IMAC

51 components outside of Metazoa. Integrins and complete components of the IMAC have been

52 reported outside Metazoa in unicellular , such as Capsaspora owczarzaki[6],

2 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

53 Pigoraptor spp.[7], Syssomonas multiformis[7], limacisporum[8], Creolimax

54 fragrantissima[9], [10], as well in the apusomonad Thecamonas trahens [6],

55 and breviate Pygsuia biforma[11]. Thus, the IMAC complex and other associated proteins have a

56 much more ancient evolutionary origin than previously expected, well outside of the animals, but

57 IMAC remains to be only reported the in lineage called Obazoa, which is composed of

58 Opisthokonta, Breviatea, and [11]. However, are these complexes exclusive to

59 Obazoa, or have they appeared even deeper in evolutionary history?

60

61 Amoebozoa is the eukaryotic supergroup that is sister to Obazoa, altogether grouped in the

62 [12]. Despite the relatively close evolutionary proximity of Amoebozoa to the animals

63 containing lineage, genomic-level data and relevant research into the supergroup remains sparse.

64 The predicted or inferred genomic complement present has yet to be undetermined in the last

65 common ancestor of Amoebozoa. Recent works revealed that in Amoebozoa (

66 discoideum and castellanii) crucial components of the IMAC were missing [13],

67 namely ITGA and ITGB. Was this result simply due to a lack of data or is it an evolutionary trend?

68 Here, we examine the gene repertories across the Amoebozoa supergroup to investigate the

69 evolutionary history of the adhesome associated with the integrin proteins. The absence of the

70 integrin proteins reported previously may simply be due to the great paucity of taxonomically

71 broad genomic data from this , which is a limiting factor in our knowledge of IMAC evolution.

72

73 To if the lack of IMAC is truly absence or rather unobserved due to a lack of data from a broad

74 taxonomic sampling, we took a comparative genomic/transcriptomic approach utilizing data

75 recently used to resolve the tree of Amoebozoa [14]. Here, we have found a number of predicted

3 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

76 IMAC proteins across a wide breadth of Amoebozoa and in some unexamined obazoans

77 ( (Opisthokonta [12]) and Lenisia limosa (Breviatea[15])), including integrins,

78 both ITGA and ITGB (figure 1B). Across the breadth of these taxa, we have identified a nearly

79 complete IMAC. Here we demonstrate that IMAC is not exclusive to Obazoa. Rather it was in the

80 last common ancestor of Amorphea (i.e., Obazoa + Amoebozoa), which from our interpretations

81 of the gene distributions probably had a full set of these components (figure 1C).

82

83 BACKGROUND

84 Molecular details on Integrin proteins

85 The overall architecture of both integrin proteins includes a large extracellular N-terminal ligand-

86 binding head domain and C-terminal stalk (also termed “legs”). Integrin activation is dependent

87 on the binding of a ligand (such as extracellular matrix proteins (collagen, fibronectin, and laminin)

88 [5]and the binding of divalent cations (primarily Ca2+, Mg2+, or Mn2+) on unique but well-

89 conserved cation binding motifs[16–19]. In animals, there are universally five cation binding sites

90 on the ITGA and three on the ITGB paralogs[16,17,19,20]. Intracellular integrin engagement with

91 cytoskeletal proteins in IMAC can regulate cell-signaling pathways that control differentiation,

92 survival, and motility making the integrins one of the most functionally diverse family of cell

93 adhesion molecules [5,21].

94

95 Integrin Alpha (ITGA)

96 Metazoa ITGA

97 In metazoan ITGA, which we refer herein to as “canonical ITGA proteins”, there is a β-propeller

98 made of seven blades (IPR013519)[18], an integrin leg domain (IPR013649)[18], a

4 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

99 transmembrane domain, and a C-terminal cytoplasmic tail amino acid motif (KXGFFXR -

100 IPR018184) [22](numbers are INTERPRO ids derived from INTERPROSCAN version

101 5.27.66 (IPRSCAN)[23]). These proteins are typically 700-800 amino acids in animals. The β-

102 propeller blade motif is composed of an FG-GAP domain (IPR013517), and in the final 3-4 blades

103 there is a cation binding site motif between an “FG” and the “GAP” motifs. Of note, the consensus

104 pattern of the FG-GAP domain is highly variable[20], and not necessarily those amino acids (i.e.,

105 FG and GAP)[20]. The cation binding blades of the β-propeller influences extracellular ligand

106 affinity and ligand specificity[24,25]. The cation motifs follow the consensus pattern of D/E-h-

107 D/N-x-D/N-G-h-x-D/E (h=hydrophobic residue; x=any residue)[16]. In metazoan ITGA, there are

108 three ITGA leg domains (IPR013649) located next to the ITGA head. These ITGA leg

109 (IPR013649) domains are structurally and functionally similar to immunoglobulin (Ig) fold-like

110 beta sandwich[18].

111

112 The classification of an authentic ITGA protein outside Metazoa would be a membrane docked

113 protein with an ITGA head composed of a β-propeller made of several β-propeller blades

114 (IPR013519) (figure 2), some with FG-GAP domains (IPR013517) only, and some with a cation

115 motif flanked by FG and GAP motifs (figure 4). However, we do not discard the possibility of

116 truncated ITGA because they occur in metazoans [26,27] and in the Creolimax

117 fragrantissima [9].

118

119

120 Integrin Beta (ITGB)

121 Metazoan ITGB

5 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

122 In metazoan ITGB, which we refer herein to as “canonical ITGB proteins”, there are seven

123 domains; a plexin-semaphorin-integrin (PSI) (IPR002165), a hybrid domain (described below),

124 three to four EGF modules (IPR013111), an integrin b tail subunit (IPR012896), transmembrane,

125 and signal peptide domains[18]. The metazoan hybrid domain is composed of a unique von

126 Willebrand factor A (known as ITGB “inserted” domain (or b-I) (IPR002369) and a Ig-like domain

127 (IPR013783) upstream of the b-I domain [18] (figure 3). EGF domains make up a cysteine rich

128 stalk motif (CRSM) that corresponds to a b-subunit stalk region[18]. The consensus pattern of the

129 CRSM in Opisthokonta is CXCXXCXC[6,28], in Metazoa there are typically 3-4 CRSMs. The

130 metazoan transmembrane domain consists of GXXXG membrane motif, important for

131 oligomerization of integrin proteins[29].

132

133 In Metazoa, the b-I-domain is made up of three metal ion binding sites (comprising a ca. 200

134 amino-acid region in total)[18]. These are, a metal ion dependent adhesion site (MIDAS), site

135 adjacent to MIDAS (called ADMIDAS), and a synergistic metal ion binding site (SyMBS)[16].

136 These metal ion binding sites are key regulators for all integrin-ligand interactions and require the

137 divalent cation ions such as manganese, calcium, and magnesium ions[16]. Three metal binding

138 sites have a consensus motif of DXSYSX95EX30D (at site 150 in Human ITGB1 (NCBI:P05556.2)

139 (MIDAS), SXXDDX121DX83A (at site 154 in Human ITGB1 (NCBI:P05556.2) (ADMIDAS), and

140 D/EX55NDXPXE (at site 189 in Human ITGB1 (NCBI:P05556.2) (SyMBS)[17]. The MIDAS and

141 SyMBS (the positive regulatory site for integrin activation[16]) motifs are essential for integrin-

142 ligand binding[16,18], but ADMIDAS (the negative regulatory [30] site for integrin inhibition) is

143 not absolutely necessary for integrin activation[30]. The ITGB tail subunit (IPR012896) consists

144 of two conserved motifs responsible for activating the integrin complex: the cytosolic tail motif

6 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

145 (NPXY/F - to scaffold cytoskeletal proteins) located after the transmembrane region, and the

146 ITGA interacting motif (LLXXXHDRKE – a highly electrically charged amino acid motif)

147 downstream (C-terminus) of the NPXY/F motif, which interacts with scaffolding and signaling

148 proteins[31].

149

150 The classification of an authentic ITGB protein outside Metazoa would be a membrane docked

151 protein with an ITGB head composed of a von Willebrand factor A (IPR002369) with three metal

152 ion binding sites, and cysteine rich regions at downstream (N-terminal) (figure 3). Again, we do

153 not discard the possibility of truncated ITGB since these are present in both metazoans [26,27] and

154 Creolimax fragrantissima[9].

155

156 RESULTS

157 Integrin Alpha

158 Amoebozoan ITGA: General observations on ITGA

159 Our results show ITGA homologs are found in 23 out of 113 amoebozoan transcriptomes

160 examined (figure 2). Of those 23 amoebozoans, eight have more than one ITGA paralog (figure

161 2). Amoebozoan ITGAs vary greatly in size ranging from 523-1200 amino acids in length

162 (supplemental table 1). Many of the proteins we identified are predicted to have a diverse number

163 of β-propellers from a few (3-5 b-propeller blades), to canonical (6-7 b-propeller blades), to many

164 (more than 8 b-propeller blades) and cation binding sites (1 to 5 sites). However, all amoebozoan

165 ITGA proteins have at least three β-propeller blade (IPR013517) domains (figure 2), with an FG-

166 GAP domain and FG-GAP/cation binding motif (figure 3A). The positions of the cation binding

167 motif/FG-GAP domains are seemingly distributed randomly (supplemental table 1), and not

7 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

168 necessarily in the last 3-4 blades as in canonical animal ITGA. We noticed the presence of a C-

169 terminal transmembrane anchor (IPR021157) domain in three amoebozoan ITGAs. Animal-like

170 KXGFFXKR cytoplasmic tail motif is absent in all but one amoebozoan (figure 3E; supplemental

171 table 1).

172

173 We find amoebozoan ITGAs are present in all three major clade, but interestingly these proteins

174 are not in eight amoebozoan taxa that have sequences, (Dictyostelium spp.,

175 spp., Heterostelium spp., Speleostelium caveatum, polycephalum, spp.,

176 Acanthamoeba spp., and Protostelium fungiovrum[32]). However very importantly, we were able

177 to confidently find ITGA in balamuthi (figure 2), which has a complete genome

178 available [33]. To confirm the presence of these genes in the genome, we used a BLAST-based

179 search to identify 2 ITGA genes of M. balamuthi to the genome data (CBKX00000000)

180 on NCBI (see Methods) [86]. For ITGA, there are no introns in either paralog 1

181 (CBKX010025823.1) or paralog 2 (CBKX010005624.1).

182

183 We classified ITGA into two representative types based on their predicted IPRSCAN domains,

184 cation motifs, their similarity to metazoans ITGA, and novel domains that are previously

185 unobserved in ITGA proteins.

186

187 Amoebozoa ITGA: Type I ITGA: similar architecture to Metazoa

188 For communication purposes we have classified the amoebozoan ITGA homologs as ‘Type I’ and

189 ‘Type II’ based on their similar or divergent domain complements compared to metazoan ITGA

190 sequences, respectively (figure 2). From the 23 amoebozoan taxa in which we identified an ITGA,

8 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

191 19 taxa have Type I (figure 2). One amoebozoan (Amphizonella sp. 9 paralog 2) has a Type I ITGA

192 with a domain architecture and cation motifs that are virtually identical to head region of metazoan

193 ITGA (figure 2, supplemental table 1). The rest of the amoebozoan Type I ITGAs have canonical

194 beta propellers, but they have varying numbers of cation motifs that differs from the canonical

195 ITGA proteins (supplemental table 1). Most of the cation motifs of all but four amoebozoan Type

196 I ITGAs follow the consensus metazoan amino acid pattern (figure 3A, supplemental table 1).

197 thirteen of the amoebozoan Type I ITGAs have both a predicted signal peptide and transmembrane

198 region (figure 2). The rest of the amoebozoans Type I ITGAs have only a predicted signal peptide

199 region or transmembrane region. Amoebozoa ITGA without transmembrane region may be the

200 result of endogenous soluble integrin. In a few cases we did not observe the signal peptide or the

201 transmembrane region. This may be due to truncation as we are predicting proteins from a

202 transcriptome or the product of shedding [34]. However, in the genome of Mastigamoeba

203 balamuthi all ITGA paralogs lack signal peptides and one lacks a transmembrane region (paralog

204 1).

205

206 Out of 19 Type I amoebozoan ITGA homologs, four ITGAs have an extracellular Ig-fold-like

207 (IPR013783) or cadherin-like (IPR015919) domain next to the ITGA head (figure 2). A cadherin

208 like domain (IPR015919) and an Ig-fold-like domain (IPR013783) are part of an immunoglobulin

209 fold like beta sandwich protein class, which fold in a similar structure[35]. The three amoebozoan

210 ITGAs that contain either an Ig-fold-like or a cadherin-like domain also have canonical b-

211 propellers (6-7 blades) (figure 2), although two of the four Type I ITGA have an extra cation motif

212 (supplemental table 1).

213

9 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

214 Amoebozoan ITGA is a Type II ITGA with novel domains

215 We designate any ITGA homolog with a domain flanking the ITGA head that has not previously

216 been reported in an ITGA protein as a “Type II” ITGA (figure 2). From the 23 amoebozoan taxa

217 in which we identified an ITGA, 12 taxa have Type II (figure 2). Of the 23 Type II amoebozoan

218 ITGAs we discovered, two have a protein kinase domain (IPR000719) (figure 2), five

219 amoebozoans have a proline rich extension motif called “extensins” (PR01217) (numbers are

220 SPRINTS ids derived from PRINTS version 42.0 [36] and INTERPROSCAN version 5.27.66

221 [23](supplemental methods)) (figure 3F; supplemental table 1).

222

223 In one Type II ITGA, one amoebozoan sequence ( citata paralog 1) uniquely has an

224 epidermal growth factor (EGF)-like domain (IPR000742) as its only novel domain located

225 downstream (C-terminal) to the ITGA head (figure 2). Additionally, three amoebozoans uniquely

226 have two epidermal growth factor (EGF) domains (figure 3B), a Thrombospondin repeat-1 (TSP1)

227 (IPR000884) (figure 3C), and a carbohydrate binding domain called a wall stress component

228 (WSC) (IPR002889) (figure 3D) all with cysteine rich motifs located upstream (N-terminal) to the

229 ITGA head (figure 2). We were unable to detect any other protein in the nr database with

230 significant similarity to the N-terminal domain composition of these sequences (EGF-TSP1-WSC)

231 using an e-value cut-off < 1e-10. This suggests that this EGF-TSP1-WSC architecture may be

232 unique to these three amoebozoans.

233

234 Seven of the amoebozoan ITGAs we designated as Type II possess canonical (6-7 blades) b-

235 propellers (figure 2) and a different number of cation motifs (supplemental table 1). The other five

236 amoebozoans Type II ITGAs have short (3-5 blades) b-propellers, and a wide range of number of

10 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

237 cation motifs (supplemental table 1). Four amoebozoan ITGA homologs of this type contained

238 both a signal peptide and a transmembrane domain (figure 2). Eight others had either a signal

239 peptide or a transmembrane region.

240

241 Obazoan ITGA: General observations on ITGA

242 We also found ITGA homologs in several underexplored obazoan transcriptomes: the opisthokont

243 Ministeria vibrans and the breviate Lenisia limosa (figure 2). Obazoan ITGA proteins have at least

244 three β-propeller blade (IPR013517) domains, with an FG-GAP domain (figure 2, supplemental

245 figure 5A) and FG-GAP/cation binding motif domains are seemingly distributed randomly

246 (supplemental table 1). Animal-like KXGFFXKR cytoplasmic tail motifs were observed in most

247 obazoan transcriptomes that were examined in this study (figure 4E, supplemental figure 5C).

248 Among the non-animal obazoans, the seven novel obazoans ITGA have an Ig-fold-like domain leg

249 next to ITGA head (figure 2; supplemental figure 5D).

250

251 Ministeria vibrans has both Type I and Type II paralogs (figure 2). In Type II, paralog 1 and 2

252 have a predicted long (10) ITGA b-propeller, but they do not contain transmembrane domains.

253 The other paralog has a few (3) b-propeller blades (one with FG-GAP/cation binding motif), a

254 cytosolic tail motif (figure 3E), and a transmembrane domain (figure 2). In both paralog 1 and 3,

255 a signal peptide region was not predicted. Ministeria vibrans Type II ITGA uniquely has an

256 epidermal growth factor (EGF) like domain (IPR000742) attached upstream (N-terminal) to the

257 ITGA head (figure 2). All obazoans in our study have a signal peptide or a transmembrane region

258 except for Pigoraptor spp., Sphaeroforma arctica, and Syssomonas multiformis (figure 2), which

259 lack both.

11 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

260

261 Phylogenetic analysis of ITGA

262 The unrooted tree of ITGA recovered all of Amoebozoa ITGA as a clade (figure 2), which is

263 generally sister to most of Obazoa. However, a few opisthokont sequences are nested in this clade.

264 These include a paralog of Capsaspora owczarzaki (ITGA4: XP_004347912), both M. vibrans

265 paralogs, Breviates and the lone Syssomonas multiformis ITGA. The major lineages or sublineages

266 of Amoebozoa were not recovered. Instead amoebozoan ITGA are divided into two clades. One

267 clade consists primarily of tubulinid amoebozoans (+ Ministeria) and the other contains the rest of

268 Amoebozoa.

269 It should be noted that in addition to these taxa, the genome sequence of the very

270 evolutionary distant Stramenopile brown alga (Ectocarpus siliculosus) revealed the presence of

271 three ITGA homologs[37] (supplemental figure 8). However, no other available Stramenopile

272 genome data or transcriptomic data searched thus far (for a list of taxa searched, please see

273 supplemental text) contains integrin homologs in our examinations. Therefore, we inferred an

274 additional ITGA tree with this species (supplemental figure 3). Through BlastP to NCBI-NR we

275 identified several predicted beta propeller proteins in that have a similar architecture to

276 the canonical ITGA. An unrooted tree of beta propeller homologs recovered Archaea as a

277 paraphyletic across our phylogeny (supplemental figure 8), this is further discussed below.

278 We observed more FG-GAP proteins in other taxa such as (e.g. Gloeobacter

279 violaceus; WP_011143221.1, Nostoc punctiforme; ACC84844.1 and Crocosphaera watsonii;

280 WP_007307935.1), Haptophyta (Emiliania huxleyi; XP_005778208.1), Crytophyta (Guillardia

281 theta; XP_005842481.1), and Stramenopiles (e.g. Nannochloropsis spp. CCMP1776: TFJ84058.1,

282 EWM21519 and Cafeteria roenbergensis; KAA0148838.1) on NCBI. These FG-GAP proteins

12 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

283 either do not possess transmembrane and signal peptide regions or possess 7 transmembrane or

284 more. Thus, in our estimation these FG-GAP proteins fail to pass our minimum criteria of being

285 an ITGA.

286

287 Integrin Beta

288 Amoebozoan ITGB: General observations on non-metazoan ITGB

289 We find that ITGB homologs are found in 19 out of 113 amoebozoan transcriptomes examined

290 with three ITGB paralogs in saxonica and two ITGB paralogs in Idionectes vortex

291 (figure 4). We find amoebozoan ITGBs are present in all three major clades of Amoebozoa.

292 However like in ITGA, these proteins are not in four amoebozoans groups that have genome

293 sequences, (the , entamoebids, acanthamoebids, and Protostelium fungiovrum[32]).

294 We were able to find ITGB gene in the genome of the amoebozoan Mastigamoeba balamuthi

295 ATCC 30984 (figure 6). For ITGB, there were five introns in the genomic scaffold

296 CBKX010020319.1 and eight introns in CBKX010020318.1. These introns are canonical M.

297 balamuthi introns and the other genes on the scaffold are mastigamoebid/amoebozoan in nature.

298 We detected a tenascin ORF on ITGB genomic scaffold of CBKX010020318.1 and PA14 ORF on

299 CBKX010020318.1. We inferred a for tenascin and PA14 by searching for

300 homologs in our extensive datasets, both of which clearly show amoebozoan signal (shown as a

301 subset in figure 6). The presence of introns within the genomic scaffold for ITGB clearly

302 demonstrates that this protein is indeed part of the genome. Additionally, the phylogenetic signal

303 of other ORFs on these genomic scaffolds show amoebozoan signal and are not from

304 contamination in the genome project.

13 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

305

306 The size range of ITGB paralogs in Amoebozoa vary widely from 595 to 3296 amino acids

307 (supplemental table 2). In addition to the novel ITGB proteins in Amoebozoa, we also found a

308 previously unreported ITGB homologs in two obazoan transcriptomes, as well in CRuMs, which

309 is a recently described clade closely related to Amorphea[38]. Based on domain and motif

310 architecture from IPRSCAN outputs, we classify non-metazoan ITGBs into two representatives;

311 ITGB that are similar to a canonical ITGB, and unique domain architecture proteins that are ITGB-

312 like.

313

314 Type I ITGB: similar architecture to Metazoa

315 Our first representative type of ITGB has a similar domain architecture to metazoan ITGB (figure

316 4). From the 19 amoebozoan taxa in which we identified an ITGB, 14 taxa have Type I (figure 4).

317 Most amoebozoans in this class appear to have the vWA domain, b-subunit stalk region and the

318 PSI domains except two amoebozoans that lacks PSI domains and one amoebozoans which possess

319 the duplicated vWA domain (figure 4). Amoebozoa have the CRSM pattern in the b-subunit stalk

320 region of DXCGXCGGX(3-6)CXGC (figure 5F) ranging from three to seven times (Supplemental

321 table 2). In of b-tail motifs (IPR012896), the cytosolic tail motif (NPXY/F) is present across

322 Amoebozoa (figure 5G) except one amoebozoan (supplemental table 2) whereas an ITGA

323 interacting motif (LLXXXHDRKE) is absent in Amoebozoa. Twelve amoebozoans in this class

324 possess amino acids with a predicted electrostatic charge where the ITGA interacting motif

325 (LLXXXHDRKE) would be present in metazoan ITGB (figure 4, figure 5H). MIDAS are well

326 conserved especially in amoebozoan b-I domains (figure 5A, B) and SyMBS are well conserved

327 across the amoebozoans (figure 5C, D). However, ADMIDAS motifs are quite variable in

14 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

328 Amoebozoa compared to the metazoan consensus pattern (figure 5B, Supplemental table 2). A

329 signal peptide and a transmembrane domain were predicted in most amoebozoan ITGB in this

330 class and three amoebozoans lacking signal peptides (figure 4).

331

332 The ITGB proteins of non-metazoan obazoans have domain/motif architectures very similar to a

333 canonical metazoan ITGB (figure 4, supplemental 6). At the B-tail motifs (IPR012896) of Obazoa,

334 the cytosolic tail motif (NPXY/F) is present in all obazoans ITGB homologs (supplemental figure

335 6G), but we did not observe the ITGA interacting motif (LLXXXHDRKE) in ITGB of Breviatea.

336 We found a metazoan GXXXG motif in three amoebozoan ITGB (supplemental table 2),

337 suggesting that this motif is not animal specific. There are multiple tandem repeats of the

338 opisthokont CRSM pattern in Obazoa ITGB and both breviate taxa have a clear expansion of b-

339 subunit stalk region as previously reported [6–11] (figure 3). All of the obazoan ITGB MIDAS,

340 ADMIDAS, SyMBS motifs were well-conserved (supplemental figure 6A-D). We also

341 investigated a possible “ ITGB gene” that was recently reported to be present in

342 the transcriptome of Didymoeca costata[39]. The protein domain architecture has a serine protease

343 flanking vWA, PSI domain. However, it is missing the CRSM, NPXY/F (IPR021157),

344 LLXXXHDRKE (IPR014836) and GXXXG motifs. To our surprise Rigifila ramosa (in CRuMs),

345 has a protein that contains all the canonical domains of ITGB such as vWA, amoebozoan CRSM

346 pattern and c-terminal motifs (figure 3) and all typical cation motifs (supplemental table 2). We

347 were unable to find an ITGA homolog in either Rigifila ramosa or Didymoeca costata.

348

349 Type two ITGB: unique architecture compared to Metazoa and solely found in the amoebozoan

350 clade Variosea

15 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

351 Our second representative class of ITGB, which we refer to as Type II, are large proteins (1405 aa

352 to 3641 aa) that have novel domain architectures unobserved before in ITGB, which we refer to as

353 Type II. Of the total 19 amoebozoan taxa that have ITGB five amoebozoans have an ITGB

354 assigned to our Type II class (figure 4). Type II amoebozoan ITGB domain architectures include

355 Laminin G3 (LamG3) (IPR013320), Von Willebrand Type D (vWD) (IPR001846), Ig-like fold

356 (IPR013783), EGF-like (IPR013032) and a cyclic nucleotide binding (IPR000595) domain (figure

357 4. A canonical human ITGB has four EGF domains [16,18] that coincide with the location of Ig

358 regions in amoebozoan ITGB. Three amoebozoan ITGB in this class have an extracellular EGF–

359 like domain at the N-terminus, at the location of the PSI domain (figure 3). LamG3 (CSM) and

360 EGF-like domain are recovered in two amoebozoan ITGB (figure 3). The cation binding motifs

361 MIDAS, ADMIDAS motifs are hypervariable in Type II representatives compared to the metazoan

362 consensus pattern, but SyMBS is well conserved except in Filamoeba nolandi (supplemental table

363 2).

364

365

366 Phylogenetic analysis of ITGB

367 We selected Homo sapiens, Gallus gallus, and Nematostella vectensis as a selective reference

368 group because integrin proteins are highly conserved among Metazoa. The tree of ITGB reveals a

369 three major clade when we rooted on Opisthokonta as an outgroup (figure 3). We observed three

370 clades of type II amoebozoan ITGB, breviates with apusomonads ITGB, and type I amoebozoan

371 with the CRuMs Rigifila ramosa ITGB. Therefore, we do not recover Amoebozoa as a

372 monophyletic group.

16 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

373 Additionally, we investigated the evolutionary history of Sib (Similar to Integrin Beta) proteins.

374 Sib proteins is a cell adhesion molecule that are similar to ITGB, originally discovered in

375 [40]. However, they do not appear to have a canonical ITGB

376 (Supplemental result). Additionally, an ITGB homolog was shown in cyanobacteria, but it

377 contained only one of the ITGB domains[6]. Therefore, we inferred an additional ITGB tree with

378 these SIBA proteins from these species (Didymoeca costata, Dictyostelium discoideum and

379 cyanobacteria) (supplemental figure 2). In this tree, a clade of Type II ITGB is placed between

380 the cyanobacteria Trichodesmium and the choanoflagellate Didymoeca. However, these apparent

381 ITGB homologs do not have obvious ITGB-stalk region or NPXY motifs. While they are

382 recovered well with other ITGB, given the important missing motifs and domains the placement

383 of these in a phylogenetic context alone does not clearly define these as ITGB.

384

385 Scaffold Cytoplasmic proteins

386 In our comparative transcriptome and genome analysis, we observed the following intracellular

387 scaffold cytoskeletal adhesomes proteins: talin, a-actinin, vinculin, PINCH, paxillin, and ILK

388 throughout Amoebozoa. These six adhesome proteins were have been reported to be present in

389 Amoebozoa[6]. Since then the number of the core consensus adhesomes were expanded[21].

390 Filamin, kindlin, and tensin are three consensus adhesome proteins and cross-actin linkers that

391 interact with ITGB[21]. We have observed two novel adhesome proteins (filamin, tensin) that were

392 not known in Amoebozoa (Supplemental figure 1). We suspect these cytoskeleton proteins may

393 be critical, and the absence of these genes in some amoebozoan taxa is probably a result low gene

394 coverage in some transcriptomes. Talin, PINCH, filamin, and Paxillin key players in the integrin

395 adhesome, were not present in seven amoebozoan transcriptomes (figure 1B), however their

17 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

396 absence is also suspected to be a artefact of low gene coverage in these data. Across the

397 amoebozoan clade, we also noticed a large number of what appear to be truncated talin compared

398 to the rest of adhesome proteins (figure 1B).

399

400 ILK is scattered across the Amoebozoa unlike in Apusomonadida and Opisthokonta[6–10]. a-

401 actinin is the only signaling protein that is present in all of the amoebozoan taxa. Consistent with

402 previous reports, we did not detect FAK-cSRC or Parvin in any of Amoebozoa or Breviatea[11].

403 Since the gene coverage for Pygsuia (Breviatea) is high and are available for Lenisia

404 limosa (Breviatea) and Entamoeba spp. (Amoebozoa), we conclude that breviates and Entamoeba

405 spp. likely do not possess vinculin proteins (supplemental figure 1). For Ectocarpus siliculosus,

406 its genome has talin, and a-actinin homologs[37], but it does not contain any other integrin-

407 signaling adhesome protein homologs.

408

409

410

411 DISCUSSION

412 Our findings show that the IMAC is not exclusive to obazoans as previously proposed[6,8–11].

413 We report a nearly full set of IMAC proteins is present throughout Amorphea including four

414 amoebozoan species (Flabellula citata, Rhizamoeba saxonica, Cryptodifflugia operculatum,

415 Mastigamoeba balamuthi, and Nebela sp.), and two previously unsurveyed obazoan taxa Lenisia

416 limosa (Breviatea) and in Ministeria vibrans (Opisthokonta). Our increased taxon sampling that

417 spans the breadth of the known diversity within Amoebozoa permitted discovery of these proteins.

418 While the integrin proteins undoubtedly played a role in the origins of animal

18 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

419 multicellularity[6,41,42], we do not find any evidence of integrins in the few multicellular lineages

420 of Amoebozoa, those which socially aggregate to form emergent fruiting body structures

421 (Copromyxa protea and the model dictyostelids). Interestingly, secondary loss of these proteins

422 seems to be rampant, as the majority of amoebozoan taxa appear to lack integrins or some of the

423 IMAC components, but the caveat is that some of these data are based on incomplete

424 transcriptomic data. Therefore, their absence may be due to the available data. None-the-less, the

425 most parsimonious explanation of the IMAC evolution is that the ancestor of Amoebozoa must

426 have contained them. An analogous scenario of secondary loss of the IMAC plays out in obazoan

427 evolution, in which we see that the closest relative of animals, the Choanoflagellata[39,43,44], as

428 well as the Nucletmycea (also referred to as the , the fungal lineage) are devoid of key

429 IMAC components (namely the integrins)[6,11].

430

431 The functional domains of ITGA in amoebozoans are similar to the canonical ITGA found in

432 animals and their close relatives like the opisthokont filastereans Capsaspora owczarzaki[6],

433 Ministeria vibrans, and Pigoraptor spp.[7]. Given the homologous nature of the functional domain

434 architecture in amoebozoan ITGA, it is likely that they function at the cellular-level in a similar

435 manner to those in animals. Therefore, we suspect the last common ancestor of Amorphea (LCAA)

436 probably already had ITGA within its genome. Of note, there are potential ITGA homologs in

437 Archaea, therefore it is likely ITGA homologs predate the eukaryotic last common ancestor, but

438 have been retained mostly in Amorphea. We also suspect that the ITGA Ig-like domains possessed

439 by some amoebozoans and obazoans are equivalent to metazoan ITGA legs originally thought to

440 be specific to metazoans[6,11]. An ITGA leg domain was probably in the ITGA of the LCAA, and

441 secondarily lost in some taxa.

19 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

442

443 Some novel functional domains in Type II representatives are integrin binding domains that are

444 present in ECM protein such as LamG[45], TSP-1[46], and EGF-like domains[47]. Could the last

445 common ancestor of Type II representatives or LCAA contain these integrin binding domains?

446 Other functional domains in amoebozoan ITGA such as protein kinase, TSP-1 together with WSC

447 and intramolecular chaperone are not found in Obazoa. Domain shuffling is an important

448 mechanism for creating a novel genes in evolution[48]. It has been suggested that many domains

449 in the ECM are found in unrelated proteins[49]. Whether these proteins were parts of the domain

450 shuffling in ITGA of the last common ancestor of Amoebozoa, which resulted in ECM evolution

451 remains to be seen and their function remains elusive. However, based on our results, it is possible

452 that we have predicted novel integrin proteins with catalytic as well as adhesive properties.

453

454 The domains of amoebozoan ITGB are very similar to the canonical metazoan ITGB and we

455 suspect that the last common ancestor of Amoebozoa retained these canonical ITGB structures

456 including the CRSM motif and three metal binding sites. The cytosolic tail motif “GFFKR” is

457 probably specific to Obazoa and there may be no interaction in the c-terminal intracellular region

458 of integrins in Amoebozoa. The presence of transmembrane and the signal peptide regions are

459 sporadic in Amorphea including metazoan, a previous study showed shedding (evolution from

460 membrane docked to a secretory function) of integrins occurs[27]. The amoebozoan ITGB

461 cysteine pattern of CGXCGGXXXXCXGC possesses more amino acids, including glycine,

462 compared to Obazoa (figure 5F). Glycine is a small non-polar amino acid[50] and due to its small

463 size, glycine positions are often highly conserved within proteins[51,52]; because, glycine residues

20 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

464 maintain the overall architecture[53]. It may be interpreted that increased glycine residues could

465 maintain the overall protein structure in amoebozoan ITGB.

466

467 The type II ITGB form an independent clade, and it is a sister to the other amoebozoans and R.

468 ramosa in figure 4. LamG3 and vWD in ITGB have never been observed in Obazoa and the

469 functions of these type II ITGB proteins are unknown. Some unicellular such as breviates,

470 Capsaspora owczarzaki, and apusomonads have a long expansion of the CRSM for ITGB [6,11].

471 We suspect that Breviatea with a long integrin legs form a long integrin heterodimerization

472 complex at the extracellular area of the cell, as both ITGA and ITGB has long legs of nearly equal

473 proportions in Pygsuia biforma (figures 2 and 3)[11].

474

475 Our prediction is that most of , which includes Acanthamoeba (again with several

476 genomes available) and many other sampled transcriptomes in our study, have lost these proteins.

477 However, one representative discosean (Mycamoeba gemmipara) has canonical ITGB homolog

478 (figure 4). Additionally, four other amoebozoan taxa with genome sequences available also appear

479 to have lost ITGB in their evolution. There are amoebozoans in which we have observed only

480 ITGA or ITGB receptors, without their counterpart integrin protein. A study by Li et al. and

481 Schneider et al. showed that homodimerization of ITGA or ITGB might occur by clustering of

482 integrin in a lipid raft[54,55]. Our results show that many of the amoebozoan species singly possess

483 either ITGA or ITGB and yet they possess all the components of signal components of IMAC.

484 These amoebozoan with singular integrins may be capable of initiating the integrin signaling

485 pathway by a similar process.

486

21 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

487 Also, the question remains that the complexity of extra domains of integrin are not species related

488 and their compositions are a single multidomain protein that is present only in Amoebozoa.

489 Phylogenetically these orthologs (i.e., amoebozoan ITGA and ITGB) form novel clades

490 independent to known organisms and most likely are ancestral to the Amoebozoa as a whole. In

491 order to further examine the overall distributions and protein architectures, genomic sequencing

492 will be necessary, and the combination of localization and gene knockout strategies are necessary

493 to understand the function of these ‘multicellularity’ proteins in unicellular taxa.

494

495 Conclusion

496 Our most comprehensive comparative genomic and transcriptomic of Amoebozoa has revealed the

497 presence of a full set of IMAC across the Amoebozoa. To date, this complex was believed to be

498 Obazoa specific, our data shows that these IMAC proteins predates Obazoa. Therefore, the last

499 common ancestor of Amorphea contains the necessary machinery of IMAC, yet it remains

500 unknown if and what functional role they play in these unicellular protists. In future work,

501 localization and functionality studies of these IMAC proteins will be investigated.

502

503 Acknowledgements

504 This project was supported in part by the United States National Science Foundation (NSF)

505 Division of Environmental Biology (DEB) grant 1456054 (http://www.nsf.gov), awarded to MWB.

506 We thank Dr. Andrew J. Roger (Dalhousie University) for advanced access to Mastigamoeba

507 balamuthi transcriptome, which was supported by grant MOP-142349 from the Canadian Institutes

508 of Health Research awarded to A.J. Roger.

509 510 FIGURE LEGENDS

22 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

511 512 Figure 1: A) Repertoire of IMAC in amoebozoan species that have either integrin alpha or beta. 513 Names of IMAC proteins are listed at the bottom and names of amoebozoan taxa are listed on side 514 of the presence/absence heat map. Arrows indicates amoebozoan species that have nearly complete 515 set of IMAC proteins. Dark blue indicates the presence of a canonical IMAC protein, white 516 indicates the absence of an IMAC protein and light blue indicates a truncated form of IMAC 517 protein. B) Phylogenetic distribution of IMAC. Circle, innovation/gain; bar, loss. Abbreviations: 518 vin, vinculin; pax, paxillin; tal, talin; par, parvin; pin, particularly interesting new cysteine- 519 histidine-rich protein; FAK, focal adhesion kinase; ILK, integrin-linked kinase; αA, alpha-actinin; 520 α, α-integrin; β, β-integrin. C) Cartoon schematic of the membrane docked IMAC, which 521 associates with intracellular actin (modified from Brown et al., 2013), color corresponds with the 522 species tree. 523 524 525 526 527 528

23 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

529 530 Figure 2: Domain architecture of ITGA from Amoebozoa. The domain architecture of ITGA for 531 amoebozoan species are shown at the right side and phylogenetic tree of Amoebozoa is shown at 532 the left side of figure. Keys to each domain’s color and shape are represented at the bottom of the 533 figure. Phylogenetic tree of amoebozoan ITGA, 160 sites phylogeny of amoebozoan ITGA rooted 534 with Obazoa ITGA. The tree was built using IQTREE v1.5.5 under the LG+C60+F+G model of 535 protein evolution. Numbers at nodes are ML bootstrap (MLBS) values derived from 1000 ultrafast 536 MLBS replicates. Any values that are below 50% are not shown. The color of branches 537 amoebozoan sequences are illustrated according to Kang et al figure 3[14]. When domains are 538 overlapped because of different software algorithms, these are represented in transparent colors. 539

24 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

ITGA Motif

A.

4

3

bits 2 I L 1 S A D V I V SS I P A N I S T I AL Y VL LS T Y GG K S R I Y G A ST V I N T I VA VK QNN L E N L A V R DC F L S V F A V P I R Q SFT L Y L L P R N K P C Q F F M R E AA A A H M AMV CG NG E V 0 D K I FA 2 3 4 5 6 7 8 9 1 G G N 10 11 12 13 14 15 16 17 18 G19 20 21 B. D D D MEME (no SSC) 18.06.2018 20:29 C. 4 4 3 3

bits 2 V LP T S Q KN S N bits 2 Q T TV T 1 L EL H L G N TD 1 LYS ST SA VE 0 A DD F KQ 2 3 4 5 6 7 8 9 1 A FR 10 11 12 13 14 15 16 0 3 6 2 5 7 4 9 8 MEME (no SSC) 18.06.2018 22:08 1 EFG GP DQ YD 10 13 12 15 11 14 C C GF DG C C CSMEME (no SSC) 07.06.2018C 21:11 D. E.W W 4 4

3 3

bits 2 VN T Q QY bits 2 EKR K 1 Q 1 K EK RL Q F HM Q VP D S S R QY K R Y F Q NE E SEQ PRQL LKNFEPD N M 0 HHEA MAM I I 2 3 4 5 6 7 8 9 1 HA M E GF 0 P E 4 8 5 9 2 6 3 7 T LA PY 1 10 11 12 13 14 15 16 17 18 19 20 10 11 MEME (no SSC) 18.06.2018 22:11 C C MEME (no SSC) 18.06.2018 20:36 NF. N DL AAT 4

3

bits 2 S T SP SP PS L TSP PS S I TS NSP S T T S N 1 L P VN N P R I V T V A P VP TV TQ L V N T P NS P T T V P MP T R T TTS R S RL T R T T L S L N I Q N N Q I S SS I AR P NLAS QT R I RQ QT I P TF T PNPSL I ST TPTS NTP 0 L I AV I FNV N GANT N I EN DPVSDLLATAAPPNRRLALSFNNS 2 3 4 5 6 7 8 9 1 S L S S I S 10 N11 12 13 14 15 16 P17 18 19 20 21 22 23 24 25 26 27 28 29 T30 31 32 33 34 35 36 37 38 39 40 P41 42 43 44 45 T46 47 48 49 T50 540 T P MEME (no SSC) 18.06.2018 21:14 541 Figure 3: Most significantly enriched motifs in 23 amoebozoans ITGA homologs, as obtained by 542 MEME-suite except for (E). (A) Consensus cation motif and FG-GAP motifs found in 23 543 amoebozoans ITGA homologs. (B-D) For figure 3B to 3D, these consensus motifs are exclusively 544 enriched in three amoebozoan ITGA homologs out of 23 amoebozoan ITGA homologs: (B) EGF- 545 like motif (C) Thrombospodin repeat 1 motif (D) Carbohydrate WSC motif. (E) GFFKR motif 546 found in three ITGA homologs from Filamoeba nolandi and two obazoans species, Ministeria 547 vibrans and Pygsuia biforma. (F) Proline-rich extensin motif found in 8 ITGA amoebozoan 548 homologs out of 23 amoebozoan ITGA homologs. 549

25 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

550

26 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

551 552 553 Figure 4: Phylogenetic tree of ITGB homologs with a focus on Amoebozoa. The domain 554 architecture of ITGB homologs are shown to the right and the phylogenetic tree of ITGB. Keys to 555 each of domains color and shape are represented in the box at the bottom of the figure. Tree of 556 amoebozoan ITGB 295amino acids sites phylogeny of amoebozoan ITGB rooted with Obazoa. 557 The tree was built using IQTREE v1.5.5 under the LG+C60+F+G model of protein evolution. 558 Numbers at nodes on the phylogenetic tree are maximum likelihood bootstrap support values 559 derived from 1000 ultrafast bootstrap replicates. Any values that are below 50% are not shown. 560 The color of branches amoebozoan sequences are illustrated according to Kang et al. figure 3[14]. 561 When domains are overlapped because of different software algorithms, these are represented in 562 transparent colors. 563

27 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

ITGB Motif

A. B. 4 4

3 3

bits 2

bits 2 D I L RT V Q T Q S K Q H 1 I F Y V F G T S FV E Y R H I F F N I P V K 1 L K S L T V R L T T A I LV S D V D M T I L V V A V N I Q V VD Q V D M A Q GD MT F A T L N L ASY I Y L SN S L P M R V ML G L R F I L M S H V SL N I L S D I F D L F E LL M K F G R N MM S A F D Y YL I A I H V E QE AM I A I AGD T C SC E T I V K AT L S I L G A M I V V C PTF N K Y I T 0 I AA LF G E 9 4 6 2 5 8 3 7 1 L E 0 12 15 18 21 24 27 30 23 11 14 17 20 25 26 28 29 31 10 13 16 19 22 4 8 5 9 2 6 3 7 P 1 S 10 S MEME (no SSC) 11.01.2018 20:59 11 D D SM P DMEME (no SSC) 24.04.2018 21:59 C. D. 4 4

3 3

bits 2 S bits 2 L Y L Y A 1 G AG NQ E GS L L RV G H D S Q 1 V T I F A P A V G S V VS T T N P K Q SL N S N W F S F P R T M ML R L P T D M Y I AG E R L D V T I S I I RC MY TW F WV N I G L D A Q A Y E L T Y K H Y Y VY I L S A Q L Q D YG F T V K MM S Q YY S WV V N T ES N A C VL P I I I N N A Y Q S A G K QQ T M F M VT I L QF N VSS W R T G D S I F RG NMY Q G A S D E I D S D L L R C Q P S SQ RK AAHD A F C I M A G A G VL V KD HN V T I RMA NA T VK I A T RK N Q AV M H L SC G SHNGHRN I A GC 0 2 3 4 5 6 7 8 9 1 PE E G Y 0 2 3 4 5 6 7 8 9 1 21 20 16 17 18 19 10 11 12 13 14 15 F E 12 13 14 15 16 17 18 G K 10 11 P E. F.

4 4

3 3 bits bits 2 2 S A N V V T LT P 1 I N 1 L E DG S G T G LY V M I S V D DQNK SA E T SR R S R G I L F A A VR I N A R S SL G I V N Q R S A T P V L I W N E F K K T N F K W C L G V G AQ R L D K Y Q Q E HE V M ST P V H Q V M H L DED L DLKGD I SD A NF NV D E N KG L D H A 0 D 2 3 4 5 6 7 8 9 0 1 I 7 3 6 2 5 9 8 4 1 V 21 16 17 18 19 20 12 13 14 15 10 11 13 10 12 15 14 G 11 G G G G.C C C D C C G C CD 4

3

bits 2 K K LDD S T T L 1 K S E K FD QQH EA V T TE T WS G G I K T Q VQN Q S A N VT NS I Y V T EE R V K T M I Y A NH I S E I N R Y A V QHG V T PQG RN Q Q V G KH Q N S F NDASGS H RI DNSFDAKFSPAQL 0 F 2 3 4 1 5 6 7 8 F9 21 16 17 18 19 20 N L 10 11 12 13 14 15 N H. P 4

3

bits 2 V S V L I MV A V F F T I A I V QE 1 V L I CV S K N V V M G K LV A V A GT E L V I A I V A HV L A I L V TG V LF SGS K EM H T L F F G D I A A I AA L L A A A D 0 L VGG T I LA AL I GL ACACAVCAV I I R A I FTQL 4 9 3 7 8 2 6 5 1 S G V 10 19 24 29 33 34 38 14 18 23 28 12 13 17 22 26 27 31 32 36 37 40 41 11 16 21 25 30 I L G I 15 20 A M 35 M 39 KE V A AD MEME (no SSC) 06.04.2018 16:56 564 565 Figure 5: Most significantly enriched motifs in 18 amoebozoans ITGB homologs, as obtained by 566 MEME-suite. (A)-(F) Three divalent cation binding motifs: MIDAS, ADMIDAS and SyMBS are 567 shown. (A) MIDAS (black arrow) and ADMIDAS (red arrow) motifs are shown. (B) A shared 568 amino acid (indicted by arrow) in MIDAS and ADMIDAS motifs are illustrated. (C) SyMBS motif 569 represented by green arrow and a shared amino acid of MIDAS motif indicated by black arrow are 570 shown. These consensus motifs are obtained from type I amoebozoan ITGB homologs sequences 571 only. (D) SyMBS motif indicted by a green arrow is shown. (E-F) Cysteine rich motifs are shown. 572 (E) PSI domain and (F) Cysteine-rich motif stalk is shown. These are recovered from type I 573 amoebozoan ITGB sequences. (G-H) Motifs found at the C-terminal are shown. (G) NPXY motif 574 shown in blue arrow. found in most amoebozoans ITGB except three amoebozoan ITGB homolog. 575 (H) An electrostatic charge motif is shown, this motif is found exclusively in type I ITGB 576 homologs. 577 578 579

28 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

580 581 Figure 6: The presence of ITGB gene in a Mastigamoeba balamuthi genome sequence. 582 Introns/exons boundaries sequences are shown along with corresponding intron size. The genome 583 contig that possess ITGB gene are colored blue and green. The protein domain architecture of 584 ITGB gene are shown along with protein size. The adjacent tenascin gene on the ITGB genome 585 contig are shown left and right along with the inferred phylogenetic tree. 586

587 Supplemental Materials

588 Supplemental Results: Details on the Sib (similar to integrin beta) proteins found in dictyostelids.

589 Supplemental Figure 1: Repertoire of IMAC including filamin and tensin in Amoebozoa.

590 Supplemental Figure 2: Repertoire of IMAC in Amoebozoa species. Supplemental Figure 3:

591 Maximum likelihood tree of ITA homolog and other cell adhesion molecules that contain beta-

592 propellers. Supplemental Figure 4: Maximum likelihood tree of ITB homolog and Sib proteins.

593 Supplemental Figure 6: Most significantly enriched motifs in Obazoan ITA, obtained by MEME-

594 suite. Supplemental Figure 7: Most significantly enriched motifs in Obazoan ITB, obtained by

595 MEME-suite. Supplemental Figure 8: Most significantly enriched motifs in Sib protein, obtained

596 by MEME.

597

598

599 Supplemental Table 1: The table shows the list of Amorphea ITGA proteins characteristic

600 including: length, beta propeller motifs, subcellular location, any extra domains attached to ITGA,

601 c-terminal motif signal peptide, transmembrane regions and type of ITGA. The order of are

29 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

602 listed in accordance to the Figure 2. The subcellular localization is displayed with likelihood value

603 in parentheses. The location of extra domains attached to ITGA are shown as the C-terminal end

604 (CTE) or at the start of N-terminal (NTB). The presence of ITGA interacting motif with ITGB is

605 shown as GFFKR and GXXXG. The location of beta propeller motifs performed by MEME-suite

606 is displayed. The presence of FG-GAP motifs are shown in asterixis. The location of beta motifs

607 that corresponds to IPRSCAN output are highlighted by parentheses. The presence of consensus

608 cation motifs are displayed in number 2 superscripts.

609

610 Supplemental Table 2: The table shows ITGB motifs, subcellular localizations, signal peptide,

611 transmembrane and extra domains attached to ITGB protein. The order of taxons are listed in

612 accordance to the Figure 3. For ITGB motifs of MIDAS, AMIDAS and SyMBS, any non-canonical

613 amino acids were highlighted in red color, cysteine rich motifs are either shown as EGF or PSI

614 domains. The presence of C-terminal motifs are shown in KLXXXD or NPXY/F and GXXXG.

615 The parentheses indicate number of motifs present in ITGB.

616

617 Supplemental Table 3: This table shows Rsem value of protists that have either a complete set of

618 IMAC, integrins with extra domains or unexpected discovery of an integrin protein. The values

619 are shown in transcripts per million (TPM).

620

621 Supplemental Table 4: A). Summary of amoebozoan ITA type I table listing: number of paralogs,

622 Beta propeller, FG-GAP and canonical motifs. The table also shows presence of signal peptide

623 and transmembrane region, c-terminal interacting motif, GXXXG motif and Ig-fold like/Cadherin

624 domain. B). Summary of amoebozoan ITA type II table listing: number of paralogs, Beta

30 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

625 propeller, FG-GAP and canonical motifs. The table also shows presence of signal peptide and

626 transmembrane region, c-terminal interacting motif, GXXXG motif and extra domain attached to

627 ITA.

628

629 REFERENCES

630 1. Hynes, R.O. (2002). Integrins: bidirectional, allosteric signaling machines. Cell 110, 673– 631 687. 632 2. LaFlamme, S.E., and Auer, K.L. (1996). Integrin signaling. Semin. Cancer Biol. 7, 111– 633 118. 634 3. Harburger, D.S., and Calderwood, D.A. (2009). Integrin signalling at a glance. J. Cell Sci. 635 122, 159–163. 636 4. Horton, E.R., Byron, A., Askari, J.A., Ng, D.H.J., Millon-Frémillon, A., Robertson, J., 637 Koper, E.J., Paul, N.R., Warwood, S., Knight, D., et al. (2015). Definition of a consensus integrin 638 adhesome and its dynamics during adhesion complex assembly and disassembly. Nat. Cell Biol. 639 17, 1577. 640 5. LaFlamme, S.E., Mathew-Steiner, S., Singh, N., Colello-Borges, D., and Nieves, B. (2018). 641 Integrin and crosstalk in the regulation of cellular processes. Cell. Mol. Sci. 75, 642 4177–4185. 643 6. Sebé-Pedrós, A., Roger, A.J., Lang, F.B., King, N., and Ruiz-Trillo, I. (2010). Ancient 644 origin of the integrin-mediated adhesion and signaling machinery. Proc. Natl. Acad. Sci. 107, 645 10142. 646 7. Hehenberger, E., Tikhonenkov, D.V., Kolisko, M., del Campo, J., Esaulov, A.S., Mylnikov, 647 A.P., and Keeling, P.J. (2017). Novel Predators Reshape Holozoan Phylogeny and Reveal the 648 Presence of a Two-Component Signaling System in the Ancestor of Animals. Curr. Biol. 27, 2043- 649 2050.e6. 650 8. Grau-Bové, X., Torruella, G., Donachie, S., Suga, H., Leonard, G., Richards, T.A., and 651 Ruiz-Trillo, I. (2017). Dynamics of genomic innovation in the unicellular ancestry of animals. 652 eLife 6, e26036. 653 9. de Mendoza, A., Suga, H., Permanyer, J., Irimia, M., and Ruiz-Trillo, I. (2015). Complex 654 transcriptional regulation and independent evolution of fungal-like traits in a relative of animals. 655 Elife 4, e08904. 656 10. Dudin, O., Ondracka, A., Grau-Bové, X., A.B. Haraldsen, A., Toyoda, A., Suga, H., Bråte, 657 J., and Ruiz-Trillo, I. (2019). A unicellular relative of animals generates an epithelium-like cell 658 layer by actomyosin-dependent cellularization. bioRxiv, 563726. 659 11. Brown, M.W., Sharpe, S.C., Silberman, J.D., Heiss, A.A., Lang, B.F., Simpson, A.G., and 660 Roger, A.J. (2013). Phylogenomics demonstrates that breviate are related to 661 opisthokonts and apusomonads. Proc. R. Soc. Lond. B Biol. Sci. 280, 20131755. 662 12. Adl, S.M., Bass, D., Lane, C.E., Lukeš, J., Schoch, C.L., Smirnov, A., Agatha, S., Berney, 663 C., Brown, M.W., Burki, F., et al. (2018). Revisions to the Classification, Nomenclature, and 664 Diversity of Eukaryotes. J. Eukaryot. Microbiol. 0. Available at: https://doi.org/10.1111/jeu.12691 665 [Accessed October 18, 2018].

31 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

666 13. Cavalier-Smith, T. (2017). Origin of animal multicellularity: precursors, causes, 667 consequences—the choanoflagellate/ transition, neurogenesis and the Cambrian explosion. 668 Philos. Trans. R. Soc. B Biol. Sci. 372. Available at: 669 http://rstb.royalsocietypublishing.org/content/372/1713/20150476.abstract. 670 14. Kang, S., Tice, A.K., Spiegel, F.W., Silberman, J.D., Pánek, T., Čepička, I., Kostka, M., 671 Kosakyan, A., Alcântara, D., and Roger, A.J. (2017). Between a pod and a hard test: the deep 672 evolution of amoebae. Mol. Biol. Evol. 673 15. Hamann, E., Gruber-Vodicka, H., Kleiner, M., Tegetmeyer, H.E., Riedel, D., Littmann, S., 674 Chen, J., Milucka, J., Viehweger, B., Becker, K.W., et al. (2016). Environmental Breviatea 675 harbour mutualistic Arcobacter epibionts. Nature 534, 254. 676 16. Zhang, K., and Chen, J. (2012). The regulation of integrin function by divalent cations. 677 Cell Adhes. Migr. 6, 20–29. 678 17. Xia, W., and Springer, T.A. (2014). Metal ion and ligand binding of integrin α5β1. Proc. 679 Natl. Acad. Sci. 111, 17863–17868. 680 18. Campbell, I.D., and Humphries, M.J. (2011). Integrin structure, activation, and interactions. 681 Cold Spring Harb. Perspect. Biol. 3, a004994. 682 19. Lee, J.-O., Rieu, P., Arnaout, M.A., and Liddington, R. Crystal structure of the A domain 683 from the a subunit of integrin CR3 (CD11 b/CD18). Cell 80, 631–638. 684 20. Xiong, J.-P., Stehle, T., Diefenbach, B., Zhang, R., Dunker, R., Scott, D.L., Joachimiak, 685 A., Goodman, S.L., and Arnaout, M.A. (2001). Crystal structure of the extracellular segment of 686 integrin αVβ3. Science 294, 339–345. 687 21. Humphries, J.D., Chastney, M.R., Askari, J.A., and Humphries, M.J. (2019). Signal 688 transduction via integrin adhesion complexes. Cell Archit. 56, 14–21. 689 22. Srichai, M.B., and Zent, R. (2010). Integrin structure and function. In Cell-Extracellular 690 Matrix Interactions in Cancer (Springer), pp. 19–41. 691 23. Finn, R.D., Attwood, T.K., Babbitt, P.C., Bateman, A., Bork, P., Bridge, A.J., Chang, H.- 692 Y., Dosztányi, Z., El-Gebali, S., Fraser, M., et al. (2017). InterPro in 2017—beyond protein family 693 and domain annotations. Nucleic Acids Res. 45, D190–D199. 694 24. Oxvig, C., and Springer, T.A. (1998). Experimental support for a beta-propeller domain in 695 integrin alpha-subunits and a calcium binding site on its lower surface. Proc. Natl. Acad. Sci. U. 696 S. A. 95, 4870–4875. 697 25. Humphries, M.J., Symonds, E.J.H., and Mould, A.P. (2003). Mapping functional residues 698 onto integrin crystal structures. Curr. Opin. Struct. Biol. 13, 236–243. 699 26. Jin, R., Trikha, M., Cai, Y., Grignon, D., and Honn, K.V. (2007). A naturally occurring 700 truncated β3 integrin in tumor cells: Native anti-integrin involved in tumor cell motility. Cancer 701 Biol. Ther. 6, 1559–1568. 702 27. Gomez, I.G., Tang, J., Wilson, C.L., Yan, W., Heinecke, J.W., Harlan, J.M., and Raines, 703 E.W. (2012). Metalloproteinase-mediated Shedding of Integrin β2 promotes macrophage efflux 704 from inflammatory sites. J. Biol. Chem. 287, 4581–4589. 705 28. Tan, S.-M., Walters, S.E., Mathew, E.C., Robinson, M.K., Drbal, K., Shaw, J.M., and Law, 706 S.K.A. (2001). Defining the repeating elements in the cysteine-rich region (CRR) of the CD18 707 integrin β2 subunit. FEBS Lett. 505, 27–30. 708 29. Russ, W.P., and Engelman, D.M. (2000). The GxxxG motif: A framework for 709 transmembrane helix-helix association11Edited by G. von Heijne. J. Mol. Biol. 296, 911–919.

32 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

710 30. Mould, A.P., Barton, S.J., Askari, J.A., Craig, S.E., and Humphries, M.J. (2003). Role of 711 ADMIDAS cation-binding site in ligand recognition by integrin α5β1. J. Biol. Chem. 278, 51622– 712 51629. 713 31. Calderwood, D.A., Fujioka, Y., de Pereda, J.M., García-Alvarez, B., Nakamoto, T., 714 Margolis, B., McGlade, C.J., Liddington, R.C., and Ginsberg, M.H. (2003). Integrin β cytoplasmic 715 domain interactions with phosphotyrosine-binding domains: A structural prototype for diversity 716 in integrin signaling. Proc. Natl. Acad. Sci. 100, 2272. 717 32. Hillmann, F., Forbes, G., Novohradská, S., Ferling, I., Riege, K., Groth, M., Westermann, 718 M., Marz, M., Spaller, T., Winckler, T., et al. (2018). Multiple Roots of Fruiting Body Formation 719 in Amoebozoa. Genome Biol. Evol. 10, 591–606. 720 33. Nývltová, E., Šuták, R., Harant, K., Šedinová, M., Hrdý, I., Pačes, J., Vlček, Č., and 721 Tachezy, J. (2013). NIF-type iron-sulfur cluster assembly system is duplicated and distributed in 722 the mitochondria and cytosol of Mastigamoeba balamuthi. Proc Natl Acad Sci U A 110, 7371– 723 7376. 724 34. Metcalf, J.A., Funkhouser-Jones, L.J., Brileya, K., Reysenbach, A.-L., and Bordenstein, 725 S.R. (2014). Antibacterial gene transfer across the tree of life. Elife 3, e04266. 726 35. Clarke, J., Cota, E., Fowler, S.B., and Hamill, S.J. (1999). Folding studies of 727 immunoglobulin-like β-sandwich proteins suggest that they share a common folding pathway. 728 Structure 7, 1145–1153. 729 36. Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A.L., 730 Moulton, G., Nordle, A., Paine, K., Taylor, P., et al. (2003). PRINTS and its automatic supplement, 731 prePRINTS. Nucleic Acids Res. 31, 400–402. 732 37. Cock, J.M., Sterck, L., Rouzé, P., Scornet, D., Allen, A.E., Amoutzias, G., Anthouard, V., 733 Artiguenave, F., Aury, J.-M., Badger, J.H., et al. (2010). The Ectocarpus genome and the 734 independent evolution of multicellularity in brown . Nature 465, 617. 735 38. Brown, M.W., Heiss, A.A., Kamikawa, R., Inagaki, Y., Yabuki, A., Tice, A.K., Shiratori, 736 T., Ishida, K.-I., Hashimoto, T., Simpson, A.G.B., et al. (2018). Phylogenomics Places Orphan 737 Protistan Lineages in a Novel Eukaryotic Super-Group. Genome Biol. Evol. 10, 427–433. 738 39. Richter, D., Fozouni, P., Eisen, M., and King, N. (2017). The ancestral animal genetic 739 toolkit revealed by diverse choanoflagellate transcriptomes. bioRxiv. Available at: 740 http://biorxiv.org/content/early/2017/12/26/211789.abstract. 741 40. Cornillon, S., Gebbie, L., Benghezal, M., Nair, P., Keller, S., Wehrle-Haller, B., Charette, 742 S.J., Brückert, F., Letourneur, F., and Cosson, P. (2006). An adhesion molecule in free-living 743 Dictyostelium amoebae with integrin β features. EMBO Rep. 7, 617–621. 744 41. Suga, H., Chen, Z., de Mendoza, A., Sebé-Pedrós, A., Brown, M.W., Kramer, E., Carr, M., 745 Kerner, P., Vervoort, M., Sánchez-Pons, N., et al. (2013). The Capsaspora genome reveals a 746 complex unicellular prehistory of animals. 4, 2325. 747 42. Brunet, T., and King, N. (2017). The Origin of Animal Multicellularity and Cell 748 Differentiation. Dev. Cell 43, 124–140. 749 43. King, N., Westbrook, M.J., Young, S.L., Kuo, A., Abedin, M., Chapman, J., Fairclough, 750 S., Hellsten, U., Isogai, Y., and Letunic, I. (2008). The genome of the choanoflagellate Monosiga 751 brevicollis and the origin of metazoans. Nature 451, 783–788. 752 44. Fairclough, S.R., Chen, Z., Kramer, E., Zeng, Q., Young, S., Robertson, H.M., Begovic, 753 E., Richter, D.J., Russ, C., and Westbrook, M.J. (2013). Premetazoan genome evolution and the 754 regulation of cell differentiation in the choanoflagellate Salpingoeca rosetta. Genome Biol. 14, 755 R15.

33 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

756 45. Pulido, D., Hussain, S.-A., and Hohenester, E. (2017). Crystal Structure of the 757 Heterotrimeric Integrin-Binding Region of Laminin-111. Struct. Lond. Engl. 1993 25, 530–535. 758 46. Calzada, M.J., Annis, D.S., Zeng, B., Marcinkiewicz, C., Banas, B., Lawler, J., Mosher, 759 D.F., and Roberts, D.D. (2004). Identification of novel β1 integrin binding sites in the type 1 and 760 type 2 repeats of thrombospondin-1. J. Biol. Chem. 279, 41734–41743. 761 47. Hynes, R.O. (2012). The evolution of metazoan extracellular matrix. J. Cell Biol. 196, 671. 762 48. Babushok, D.V., Ohshima, K., Ostertag, E.M., Chen, X., Wang, Y., Mandal, P.K., Okada, 763 N., Abrams, C.S., and Kazazian, H.H., Jr (2007). A novel testis ubiquitin-binding protein gene 764 arose by exon shuffling in hominoids. Genome Res. 17, 1129–1138. 765 49. Adams, J.C. (2013). Extracellular Matrix Evolution: An Overview. In Evolution of 766 Extracellular Matrix, F. W. Keeley and R. P. Mecham, eds. (Berlin, Heidelberg: Springer Berlin 767 Heidelberg), pp. 1–25. Available at: https://doi.org/10.1007/978-3-642-36002-2_1. 768 50. Lodish, H., Darnell, J.E., Berk, A., Kaiser, C.A., Krieger, M., Scott, M.P., Bretscher, A., 769 Ploegh, H., and Matsudaira, P. (2008). Molecular cell biology (Macmillan). 770 51. Creighton, T.E. (1993). Proteins: structures and molecular properties (Macmillan). 771 52. Branden, C.I., and Tooze, J. (2012). Introduction to protein structure (Garland Science). 772 53. Jacob, J., Duclohier, H., and Cafiso, D.S. (1999). The Role of Proline and Glycine in 773 Determining the Backbone Flexibility of a Channel-Forming Peptide. Biophys. J. 76, 1367–1376. 774 54. Li, R., Mitra, N., Gratkowski, H., Vilaire, G., Litvinov, R., Nagasami, C., Weisel, J.W., 775 Lear, J.D., DeGrado, W.F., and Bennett, J.S. (2003). Activation of integrin αIIbß3 by modulation 776 of transmembrane helix associations. Science 300, 795–798. 777 55. Schneider, D., and Engelman, D.M. (2004). Involvement of transmembrane domain 778 interactions in signal transduction by α/β integrins. J. Biol. Chem. 279, 9840–9846. 779 56. Clarke, M., Lohan, A.J., Liu, B., Lagkouvardos, I., Roy, S., Zafar, N., Bertelli, C., Schilde, 780 C., Kianianmomeni, A., and Burglin, T.R. (2013). Genome of Acanthamoeba castellanii highlights 781 extensive lateral gene transfer and early evolution of tyrosine kinase signaling. Genome Biol 14. 782 57. Tice, A.K., Shadwick, L.L., Fiore-Donno, A.M., Geisen, S., Kang, S., Schuler, G.A., 783 Spiegel, F.W., Wilkinson, K.A., Bonkowski, M., Dumack, K., et al. (2016). Expansion of the 784 molecular and morphological diversity of (Centramoebida, Amoebozoa) and 785 identification of a novel life cycle type within the group. Biol. Direct 11, 69. 786 58. Urushihara, H., Kuwayama, H., Fukuhara, K., Itoh, T., Kagoshima, H., Shin-I, T., Toyoda, 787 A., Ohishi, K., Taniguchi, T., Noguchi, H., et al. (2015). Comparative genome and transcriptome 788 analyses of the social amoeba Acytostelium subglobosum that accomplishes multicellular 789 development without germ-soma differentiation. BMC Genomics 16, 80–80. 790 59. Lahr, D.J.G., Kosakyan, A., Lara, E., Mitchell, E.A.D., Morais, L., Porfirio-Sousa, A.L., 791 Ribeiro, G.M., Tice, A.K., Pánek, T., Kang, S., et al. (2019). Phylogenomics and Morphological 792 Reconstruction of Highlight Diversity of Microbial Eukaryotes in 793 the Neoproterozoic. Curr. Biol. 29, 991-1001.e3. 794 60. Tekle, Y.I., Anderson, O.R., Katz, L.A., Maurer-Alcalá, X.X., Romero, M.A.C., and 795 Molestina, R. (2016). Phylogenomics of ‘Discosea’: A new molecular phylogenetic perspective 796 on Amoebozoa with flat body forms. Mol Phylogenet Evol 99, 144–154. 797 61. Heidel, A.J., and Glöckner, G. (2008). Mitochondrial Genome Evolution in the Social 798 Amoebae. Mol. Biol. Evol. 25, 1440–1450. 799 62. Eichinger, L., Pachebat, J.A., Glockner, G., Rajandream, M.A., Sucgang, R., Berriman, M., 800 Song, J., Olsen, R., Szafranski, K., Xu, Q., et al. (2005). The genome of the social amoeba 801 Dictyostelium discoideum. Nature 435, 43–57.

34 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

802 63. Gonzales, C.M., Spencer, T.D., Pendley, S.S., and Welker, D.L. (1999). Dgp1 and Dfp1 803 Are Closely Related Plasmids in theDictyosteliumDdp2 Plasmid Family. Plasmid 41, 89–96. 804 64. Sucgang, R., Kuo, A., Tian, X., Salerno, W., Parikh, A., Feasley, C.L., Dalin, E., Tu, H., 805 Huang, E., Barry, K., et al. (2011). Comparative genomics of the social amoebae Dictyostelium 806 discoideum and . Genome Biol. 12, R20–R20. 807 65. Loftus, B., Anderson, I., Davies, R., Alsmark, U.C.M., Samuelson, J., Amedeo, P., 808 Roncaglia, P., Berriman, M., Hirt, R.P., Mann, B.J., et al. (2005). The genome of the 809 parasite . Nature 433, 865. 810 66. Ehrenkaufer, G.M., Weedall, G.D., Williams, D., Lorenzi, H.A., Caler, E., Hall, N., and 811 Singh, U. (2013). The genome and transcriptome of the enteric parasite , a 812 model for encystation. Genome Biol. 14, R77. 813 67. Tachibana, H., Yanagi, T., Pandey, K., Cheng, X.-J., Kobayashi, S., Sherchand, J.B., and 814 Kanbara, H. (2007). An Entamoeba sp. strain isolated from rhesus monkey is virulent but 815 genetically different from Entamoeba histolytica. Mol. Biochem. Parasitol. 153, 107–114. 816 68. Schaap, P., Barrantes, I., Minx, P., Sasaki, N., Anderson, R.W., Bénard, M., Biggar, K.K., 817 Buchler, N.E., Bundschuh, R., Chen, X., et al. (2015). The Genome 818 Reveals Extensive Use of Prokaryotic Two-Component and Metazoan-Type Tyrosine Kinase 819 Signaling. Genome Biol. Evol. 8, 109–125. 820 69. Heidel, A.J., Lawal, H.M., Felder, M., Schilde, C., Helps, N.R., Tunggal, B., Rivero, F., 821 John, U., Schleicher, M., and Eichinger, L. (2011). Phylogeny-wide analysis of social amoeba 822 genomes highlights ancient origins for complex intercellular communication. Genome Res. 823 70. Cavalier-Smith, T., Fiore-Donno, A.M., Chao, E., Kudryavtsev, A., Berney, C., Snell, E.A., 824 and Lewis, R. (2015). Multigene phylogeny resolves deep branching of Amoebozoa. Mol 825 Phylogenet Evol 83, 293–304. 826 71. Li, W., and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large 827 sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659. 828 72. Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., and Amit, I. (2011). 829 Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat 830 Biotechnol 29. 831 73. Haas, B.J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P.D., Bowden, J., Couger, 832 M.B., Eccles, D., Li, B., Lieber, M., et al. (2013). De novo transcript sequence reconstruction from 833 RNA-seq using the Trinity platform for reference generation and analysis. Nat Protoc. 8, 1494– 834 1512. 835 74. Croning, M.D.R., Apweiler, R., and Möller, S. (2001). Evaluation of methods for the 836 prediction of membrane spanning regions. Bioinformatics 17, 646–653. 837 75. Langmead, B., and Salzberg, S.L. (2012). Fast gapped-read alignment with Bowtie 2. Nat 838 Methods 9. 839 76. Buchfink, B., Xie, C., and Huson, D.H. (2014). Fast and sensitive protein alignment using 840 DIAMOND. Nat. Methods 12, 59. 841 77. Fischer, S., Brunk, B.P., Chen, F., Gao, X., Harb, O.S., Iodice, J.B., Shanmugam, D., Roos, 842 D.S., and Stoeckert, C.J. (2002). Using OrthoMCL to Assign Proteins to OrthoMCL-DB Groups 843 or to Cluster Proteomes Into New Ortholog Groups. In Current Protocols in Bioinformatics (John 844 Wiley & Sons, Inc.). 845 78. Bolger, A.M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible trimmer for 846 Illumina sequence data. Bioinformatics 30.

35 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

847 79. Katoh, K., and Standley, D.M. (2013). MAFFT Multiple Sequence Alignment Software 848 Version 7: Improvements in Performance and Usability. Mol Biol Evol 30. 849 80. Criscuolo, A., and Gribaldo, S. (2010). BMGE (Block Mapping and Gathering with 850 Entropy): a new software for selection of phylogenetic informative regions from multiple sequence 851 alignments. BMC Evol Biol 10. 852 81. Nguyen, L.T., Schmidt, H.A., Haeseler, A., and Minh, B.Q. (2014). IQ-TREE: A Fast and 853 Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies. Mol Biol Evol 854 32. 855 82. Li, B., and Dewey, C.N. (2011). RSEM: accurate transcript quantification from RNA-Seq 856 data with or without a reference genome. BMC Bioinformatics 12. 857 83. Almagro Armenteros, J.J., Tsirigos, K.D., Sønderby, C.K., Petersen, T.N., Winther, O., 858 Brunak, S., von Heijne, G., and Nielsen, H. (2019). SignalP 5.0 improves signal peptide 859 predictions using deep neural networks. Nat. Biotechnol. 37, 420–423. 860 84. MacManes, M.D. (2017). The Oyster River Protocol: A Multi Assembler and Kmer 861 Approach For de novo Transcriptome Assembly. bioRxiv. Available at: 862 http://biorxiv.org/content/early/2017/08/16/177253.abstract. 863 85. Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., Ren, J., Li, 864 W.W., and Noble, W.S. (2009). MEME Suite: tools for motif discovery and searching. Nucleic 865 Acids Res. 37, W202–W208. 866 86. Nielsen, H., Almagro Armenteros, J.J., Sønderby, C.K., Sønderby, S.K., and Winther, O. 867 (2017). DeepLoc: prediction of protein subcellular localization using deep learning. 868 Bioinformatics 33, 3387–3395. 869 87. Huerta-Cepas, J., Serra, F., and Bork, P. (2016). ETE 3: Reconstruction, Analysis, and 870 Visualization of Phylogenomic Data. Mol. Biol. Evol. 33, 1635–1638. 871 88. El-Gebali, S., Mistry, J., Bateman, A., Eddy, S.R., Luciani, A., Potter, S.C., Qureshi, M., 872 Richardson, L.J., Salazar, G.A., Smart, A., et al. (2018). The Pfam protein families database in 873 2019. Nucleic Acids Res. 47, D427–D432. 874 89. Van Dongen, S.M. (2000). Graph clustering by flow simulation. 875 90. Blandenier, Q., Seppey, C.V.W., Singer, D., Vlimant, M., Simon, A., Duckert, C., and Lara, 876 E. (2017). Mycamoeba gemmipara nov. gen., nov. sp., the First Cultured Member of the 877 Environmental Dermamoebidae Clade LKM74 and its Unusual Life Cycle. J. Eukaryot. Microbiol. 878 64, 257–265. 879 91. Picelli, S., Faridani, O.R., Bjorklund, A.K., Winberg, G., Sagasser, S., and Sandberg, R. 880 (2014). Full-length RNA-seq from single cells using Smart-seq2. Nat Protoc 9. 881 92. Onsbring, H., Tice, A.K., Barton, B.T., Brown, M.W., and Ettema, T.J.G. (2019). An 882 efficient single-cell transcriptomics workflow to assess protist diversity and lifestyle. bioRxiv, 883 782235. 884 93. Boratyn, G.M., Schäffer, A.A., Agarwala, R., Altschul, S.F., Lipman, D.J., and Madden, 885 T.L. (2012). Domain enhanced lookup time accelerated BLAST. Biol. Direct 7, 12–12. 886 94. Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L.L. (2001). Predicting 887 transmembrane protein topology with a hidden markov model: application to complete 888 genomes11Edited by F. Cohen. J. Mol. Biol. 305, 567–580. 889 95. Fischer, S., Brunk, B.P., Chen, F., Gao, X., Harb, O.S., Iodice, J.B., Shanmugam, D., Roos, 890 D.S., and Stoeckert, C.J., Jr (2011). Using OrthoMCL to assign proteins to OrthoMCL-DB groups 891 or to cluster proteomes into new ortholog groups. Curr. Protoc. Bioinforma. Chapter 6, Unit- 892 6.12.19.

36 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

893 96. Xie, Y., Wu, G., Tang, J., Luo, R., Patterson, J., and Liu, S. (2014). SOAPdenovo-Trans: 894 de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics 30. 895 896 897 898 899 METHODS

900

901 Method Details

902 We have strategically sampled 118 Amorphea species and their transcriptomes and genomes to

903 cover all known Amorphea subgroups. Additionally, we also searched for IMAC proteins in any

904 of the eukaryotic data in NCBI.

905

906 Culturing and Isolation of Mycamoeba gemmipara

907 Mycamoeba gemmipara was obtained from Q. Blandenier and cultured as in Blandenier et al.

908 2017[90].

909

910 Ultra-low input/ Single-Cell cDNA library construction based on Smart-Seq2

911 The cultured amoeba from Mycamoeba gemmipara were subjected to Smart-Seq2 [91] for cDNA

912 library construction as in Onsbring et al. 2019[92]. About 50 cells scraped off of a agar plate using

913 a 30-gauge platinum wire placed directly into a 200 µL thin-walled PCR tube containing the cell

914 lysis mixture [91]. Each reaction was then subjected to six freeze-thaw cycles in -80 °C

915 isopropanol and ~25 °C DI H2O respectively, before following the remainder of the protocol. A

916 modified version of Smart-Seq2 [91] was used for RNA isolation and dscDNA preparation. cDNA

917 libraries were prepared using a Nextera XT DNA Library Prep Kit (Illumina, CA) following the

37 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

918 manufacturer's protocol with dual index primers. The Mycamoeba gemmipara library was

919 sequenced using an Illumina HiSeq 4000 at Genome Quebec.

920

921 Transcriptomic Sequencing Assembly

922 Raw sequence data from the Illumina sequencing platform or from Bioproject accession number

923 PRJNA380424 [14] were subjected to TRIMMOMATIC V 0.35 [78], for cleaning and trimming of

924 poorly called reads or specific to the adaptor will be used in sequencing with the paired-end data.

925 Quality filtered and cleaned high quality reads were assembled through TRINITY [73] for de novo

926 assembly. Nucleotide sequences were translated with TRANSDECODER V 5.5.0

927 (https://github.com/TransDecoder/TransDecoder/) which also predicts open reading frames, and

928 PFAM domains [88] as well. Contigs of transcriptomes were blasted against the non-redundant

929 database of NCBI using the BLASTX program of BLAST+ suite [93]. Each predicted ORFs were

930 assigned an ortholog number through the OrthoMCL v5 [77] database. For taxa with clearly

931 truncated predicted proteins, the computationally laborious transcriptome assembly pipeline,

932 Oyster River Protocol[84], was performed in an attempt to retrieve predicted a full-length integrin

933 proteins.

934

935 Integrin Protein analysis

936 We selected a IMAC proteins from NCBI: ITGA5:P08648, ITGB1:NP_002202,

937 Talin:AAF27330, Parvin:AAH16713, PINCH:NP_060450.2, vinculin:AAH39174,

938 FAK:AAA35819 Paxillin:AAC50104 ILK:NP_001014794, a-actinin:AAC17470,

939 Filamin:AAF72339 and Tensin:AAG33700. INTERPROSCAN 5.27-66.0[23] was used to determine

940 these IMAC protein domain architecture along with SIGNALIP v 5.0 [83] and TMHMM v 2.0 [94].

38 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

941 MEME-SUITE 5.0.4 [85]was used to examine integrin motifs. DEEPLOC v 1.0[86] was used to

942 determine the subcellular localization of integrin proteins. We set the minimum criteria of

943 canonical IMAC proteins based on their architecture and motifs. We used OrthoMCL to assign

944 IMAC proteins their own ortholog numbers.

945

946 Since OrthoMCL database is heavily metazoan biased, we created an ortholog database with

947 eukaryotes listed in resources table. A modified version of OrthoMCL-DB [95] was used to create

948 a novel ortholog database using the above listed transcriptomes and genomes as well as the whole

949 OrthoMCL DB v5. To do this, an all-against-all BLAST using DIAMOND-BLASTP[76] was

950 conducted using each protein from the above data as queries. The DIAMOND-BLASTP collected up

951 to 1,000 hits with e-value of 1e-5 as a cutoff for putative homology. Blast results were clustered

952 using the OrthoMCL pipeline methodology [95].

953

954 BLASTP was used to search for novel amoebozoan IMAC proteins through the new ortholog

955 database and using the previous amoebozoan ortholog protein as a query. Using a modified custom

956 pipeline was used to collect a novel IMAC orthologs. These methods generated various orthologs

957 with a similar protein architecture. We created a FASTA file of all novel eukaryote integrin

958 proteins based on their protein architecture. Clustering of these integrin proteins was performed

959 by CD- HIT [71], which used the condition of 0.95 global sequence identity. From CD-HIT output,

960 we used DIAMOND [76] to perform All-vs-ALL blast with database size of 50 million sequences.

961 Consequently, a custom script of Markov cluster (MCL) algorithm[89] was used to cluster the

962 output of DIAMOND. A custom PYTHON script was used to eliminate any MCL clustered integrin

39 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

963 proteins that were less than 500 AA. We used a custom PYTHON pipeline to collects ortholog

964 sequences that clustered to model metazoan IMAC orthologs listed above.

965

966 Sometimes, we fail to capture integrin proteins even after all these strenuous processes. Therefore,

967 PFAM ID of Integrins was used to grab any putative integrins from TRANSDECODER output. For

968 ITGA, we used FG-GAP (PF01839), and for ITGB we used PSI_integrin (PF17205), integrin_beta

969 (PF00362), integrin_b_cyt (PF08725) and integrin B tail (PF07965). We disregarded any protein

970 sequences with PFam ID lower than e-value of 1e-10. The identity of the putative integrins were

971 confirmed by INTERPROSCAN, SIGNALIP, TMHMM, DEEPLOC, and MEME-SUITE.

972

973 For each putative integrin protein, we examined proteins motifs with MEME-SUITE, and always

974 blasted our proteins of interest using BLASTP against NCBI’s GenBank NR database to check for

975 possible contaminations. For integrin proteins with a novel domains, the read depth of these

976 transcripts were examined with RSEM. The final amoebozoan integrin architectures were compared

977 with metazoan integrin proteins. A custom PYTHON script was used to create a presence/absence

978 binary (1,0) matrix of IMAC proteins across Amoebozoa. A custom R script were used to create a

979 heatmap.

980

981 In our current study these data are transcriptomic in nature and by this virtue some of the predicted

982 transcript may be truncated or missing. However, identifying full-length transcripts from short

983 reads from Illumina technology can be difficult because assembled transcript can be fragmented

984 or incomplete due to alternative splice sites and untranslated regions [96]. Therefore, it is possible

985 that some transcripts are not fully assembled, and therefore their corresponding predicted protein

40 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

986 is truncated, thus the transmembrane region and signal peptide region may be missing. The same

987 can be said for the absence of these proteins within our data, because transcriptomes contain only

988 genes being expressed at the time mRNA was harvested some genes within an organism’s genome

989 may be missed. None-the-less, our methodology permitted the unparalleled discovery of these

990 proteins within a whole supergroup, highlighting the utility of the method.

991

992 Manually assembled contigs in SEQUENCHER

993 Where necessary due to fragmentation, contigs of our automated de novo transcriptomic

994 assemblies as above from each gene of interest were blasted (BLASTN) back to the raw nucleotide

995 data and to the assembly for each transcriptome. Hits were collected and assembled using

996 SEQUENCHER v 5.4.6 (GeneCodes, Madison, WI, USA). The taxa which we had to take this

997 approach were Amphizonella sp. 2, aerophyla, Goncevia foncevia and Hyalosphenia

998 papilio, Pellita catalonica for ITGA, and Nebela sp. for both integrin proteins.

999

1000 Presence of integrin in the Mastigamoeba genome

1001 To confirm the presence of these genes in the genome, we blasted the integrin transcripts of M.

1002 balamuthi to the whole shotgun genome data (CBKX00000000;

1003 https://www.ncbi.nlm.nih.gov/nuccore/CBKX000000000.1) on NCBI [33]. The common splicing

1004 sites were for searched in integrins intron and exon boundaries by assembling the integrin genomic

1005 contig and integrin transcript in SEQUENCHER v 5.4.6. The genome contigs of which contained

1006 integrin gene were searched for additional ORFs to examine the phylogenetic affinity of other

1007 ORFs on the genomic contigs. We inferred maximum likelihood phylogenetic tree of tenascin

41 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1008 (adjacent to the ITGB gene on CBKX010020318.1) and PA14 (adjacent to the ITGB gene on

1009 CBKX010020319.1), we used same conditions as the integrin phylogenetic trees (see below).

1010 1011 Phylogenetic trees

1012 To look for evolutionary relationships of ITGA and ITGB within Amorphea, protein sequences

1013 were aligned by MAFFT-LINSI with the parameters “--maxiterate 1000” and “--local pair”.

1014 Ambiguous sites were trimmed from the alignments using BMGE [80] by gap penalty of 0.8 settings.

1015 Maximum likelihood (ML) trees were inferred from these trimmed alignments in IQTREE v 1.5.5

1016 [81]under the LG model with the C60 series model of site heterogeneity. Each tree is ML

1017 bootstrapped (MLBS) by 1,000 pseudoreplicates. From the resultant tree we used a custom

1018 PYTHON script implementing ETE-TOOLKIT (www.etetoolkit.org) to map protein domain

1019 architecture (IPRSCAN domain IDs) onto the tree.

1020

1021 Data and software availability

1022 The accession number for the Mycamoeba gemmipara and Mastigamoeba balamuthi

1023 transcriptome raw reads is NCBI-SRA: XXXXX and XXXXX also shown in Key resources table.

1024 All protein and nucleotide sequences of each gene, all protein alignments (trimmed and untrimmed)

1025 for phylogenetic trees, and phylogenetic trees are available on the DRYAD repository

1026 (https://datadryad.org/ | Accession: XXXXX).

1027

1028

1029 Supplemental Text

1030 SUPPLEMENTAL METHODS, NOTES, AND RESULTS 1031 Key resources Table

42 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Resources Sources Identifier 1032 Data Examined Acanthamoeba castellanii [56] https://doi.org/10.1186/gb-2013-14-2-r11 Acanthamoeba pyroformis [57] https://doi.org/10.1186/s13062-016-0171- 0 Acanthamoeba astronyxis PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/ PRJEB7687/ Acanthamoeba comandoni PRJNA354221 https://www.ncbi.nlm.nih.gov/bioproject/ PRJNA354221 Acanthamoeba culbertsoni PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/ PRJEB7687 Acanthamoeba divionensis PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/ PRJEB7687 Acanthamoeba healyi PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/ PRJEB7687 Acanthamoeba lenticulate PRJNA374454 https://www.ncbi.nlm.nih.gov/bioproject/ PRJNA374454 Acanthamoeba lugdunensis PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/ PRJEB7687 Acanthamoeba mauritaniensis PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/ PRJEB7687 Acanthamoeba pearcei PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/

PRJEB7687 Acanthamoeba polyphaga PRJNA307312 https://www.ncbi.nlm.nih.gov/bioproject/

PRJNA307312 Acanthamoeba quina PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/

PRJEB7687 Acanthamoeba rhysodes PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/

PRJEB7687 Acanthamoeba royreba PRJEB7687 https://www.ncbi.nlm.nih.gov/bioproject/

PRJEB7687 Acytostelium ellipticum PRJEB14640 https://www.ncbi.nlm.nih.gov/bioproject/

PRJEB14640 Acytostelium leptosomum PRJEB14640 https://www.ncbi.nlm.nih.gov/bioproject/

PRJEB14640 Acytostelium subglobosum [58] https://doi.org/10.1186/s12864-015-1278-

x

Amoeba proteus [14] https://doi.org/10.1093/molbev/msx162 Amphizonella sp. 2 [59] https://doi.org/10.1016/j.cub.2019.01.078 Amphizonella sp. 7 [59] https://doi.org/10.1016/j.cub.2019.01.078 Amphizonella sp. 8 [59] https://doi.org/10.1016/j.cub.2019.01.078 Amphizonella sp. 9 [59] https://doi.org/10.1016/j.cub.2019.01.078

Amphizonella sp.10 [59] https://doi.org/10.1093/molbev/msx162

Arcella gibbosa [14] https://doi.org/10.1093/molbev/msx162 vulgaris [59] https://doi.org/10.1016/j.cub.2019.01.078

43 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Armaparvus languidus MMETSP0420 https://www.imicrobe.us/project/view/304 Cavenderia deminutiva PRJEB14640 https://www.ncbi.nlm.nih.gov/bioproject/ PRJEB14640

Cavostelium apophysatum [14] https://doi.org/10.1093/molbev/msx162 Centropyxis aerophyla [59] https://doi.org/10.1016/j.cub.2019.01.078 Centropyxis sp. [59] https://doi.org/10.1016/j.cub.2019.01.078

Ceratiomyxa fruticulosa [14] https://doi.org/10.1093/molbev/msx162

Ceratiomyxella tahitiensis [14] https://doi.org/10.1093/molbev/msx162

Clastostelium recurvatum [14] https://doi.org/10.1093/molbev/msx162 sp. [60] https://doi.org/10.1016/j.ympev.2016.03.0

29

Cochliopodium minus [14] https://doi.org/10.1093/molbev/msx162

Copromyxa protea [14] https://doi.org/10.1093/molbev/msx162 Coremiostelium polycephalum PRJEB14640 https://www.ncbi.nlm.nih.gov/bioproject/ PRJEB14640

Cryptodifflugia operculata [14] https://doi.org/10.1093/molbev/msx162

Cunea sp. [JDS-Ruffled] [14] https://doi.org/10.1093/molbev/msx162 Cunea sp. [MMETSP0417] MMETSP0417 https://data.iplantcollaborative.org/dav/ipl ant/commons/community_released/imicro be/projects/104/samples/1830/ Cyclopyxis sp. [59] https://doi.org/10.1016/j.cub.2019.01.078

Dermamoeba algensis [14] https://doi.org/10.1093/molbev/msx162 Dictyostelium citrinum [61] https://doi.org/10.1093/molbev/msn088 Dictyostelium discoideum [62] https://doi.org/10.1038/nature03481 Dictyostelium firmibasis [63] https://doi.org/10.1006/plas.1998.1385 Dictyostelium intermedium PRJNA45879 https://www.ncbi.nlm.nih.gov/bioproject/ PRJNA45879 Dictyostelium purpureum [64] https://doi.org/10.1006/10.1186/gb-2011- 12-2-r20 bryophila [59] https://doi.org/10.1016/j.cub.2019.01.078

Difflugia sp. [14] https://doi.org/10.1093/molbev/msx162 Difflugia sp. D2 [59] https://doi.org/10.1016/j.cub.2019.01.078

Diplochlamys sp. [14] https://doi.org/10.1093/molbev/msx162 Dracoamoeba jomungandrii [57] https://doi.org/10.1186/s13062-016-0171- 0

Echinamoeba exudans [14] https://doi.org/10.1093/molbev/msx162

Echinosteliopsis oligospora [14] https://doi.org/10.1093/molbev/msx162

Echinostelium bisporum [14] https://doi.org/10.1093/molbev/msx162

Echinostelium minutum [14] https://doi.org/10.1093/molbev/msx162 Endostelium zonatum [PRA-191] [57] https://doi.org/10.1186/s13062-016-0171-

0

Endostelium zonatum LINKS [14] https://doi.org/10.1093/molbev/msx162 Entamoeba dispar PRJNA12916 https://www.ncbi.nlm.nih.gov/bioproject/ 12916

44 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Entamoeba histolytica [65] https://doi.org/10.1038/nature03291 Entamoeba invadens [66] https://doi.org/10.1186/gb-2013-14-7-r77 PRJNA12921 https://www.ncbi.nlm.nih.gov/bioproject/ 12921 Entamoeba nuttalli [67] https://doi.org/10.1016/j.molbiopara.2007.

02.006

Filamoeba nolandi MMETSP0413 https://www.imicrobe.us/assembly/view/2 93

Flabellula citata [14] https://doi.org/10.1093/molbev/msx162

Flamella aegyptia [14] https://doi.org/10.1093/molbev/msx162

Gocevia fonbrunei [14] https://doi.org/10.1093/molbev/msx162 Gocevia fonbrunei_Tekle_2016 [60] https://doi.org/10.1016/j.ympev.2016.03.0

29

Grellamoeba robusta [14] https://doi.org/10.1093/molbev/msx162 sphagni [59] https://doi.org/10.1016/j.cub.2019.01.078 Heleopera sylvatica [59] https://doi.org/10.1016/j.cub.2019.01.078 Heterostelium multicystogenum PRJNA495730 https://www.ncbi.nlm.nih.gov/bioproject/ PRJNA495730 Heterostelium pallidum [61] http://dictybase.org/ Hyalosphenia elegance [59] https://doi.org/10.1016/j.cub.2019.01.078 Hyalosphenia papilio [59] https://doi.org/10.1016/j.cub.2019.01.078 Idionectes vortex PRJNA531640 https://www.ncbi.nlm.nih.gov/bioproject/? term=Idionectes%20vortex Lesquereusia sp. [59] https://doi.org/10.1016/j.cub.2019.01.078

Lingulamoeba sp. [14] https://doi.org/10.1093/molbev/msx162 Luapelamoeba arachisporum [57] https://doi.org/10.1186/s13062-016-0171- 0 Luapelamoeba hula [57] https://doi.org/10.1186/s13062-016-0171- 0

Mastigamoeba abducta [14] https://doi.org/10.1093/molbev/msx162

Mastigamoeba balamuthi (genome) [33] http://doi.org/10.1073/pnas.1219590110 Mastigamoeba balamuthi SRA TBD To be provided (transcriptome)

Mastigella eilhardi [14] https://doi.org/10.1093/molbev/msx162

Mayorella cantabrigiensis [14] https://doi.org/10.1093/molbev/msx162

Micriamoeba sp. [14] https://doi.org/10.1093/molbev/msx162 Microchlamys sp. [59] https://doi.org/10.1016/j.cub.2019.01.078 Mycamoeba gemmipara SRA TBD https://doi.org/10.1111/jeu.1235 Nebela sp. [59] https://doi.org/10.1016/j.cub.2019.01.078

Nematostelium gracile [14] https://doi.org/10.1093/molbev/msx162 Netzelia oviformis [59] https://doi.org/10.1016/j.cub.2019.01.078

Nolandella sp. [14] https://doi.org/10.1093/molbev/msx162

Ovalopodium desertum [14] https://doi.org/10.1093/molbev/msx162

Paradermamoeba levis [14] https://doi.org/10.1093/molbev/msx162

45 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Paramoeba aestuarina MMETSP0161_2 https://www.imicrobe.us/assembly/view/2 05 atlantica MMETSP0151_2 https://www.imicrobe.us/assembly/view/2 10

Parvamoeba rugata [14] https://doi.org/10.1093/molbev/msx162 Parvamoeba monoura [32] https://doi.org/10.1016/j.ympev.2016.03.0

29

Pellita catalonica [14] https://doi.org/10.1093/molbev/msx162

Pelomyxa sp. [14] https://doi.org/10.1093/molbev/msx162

Phalansterium solitarium [14] https://doi.org/10.1093/molbev/msx162 Physarum polycephalum [68] https://dx.doi.org/10.1093/gbe/evv237

Pygsuia biforma [11] http://doi.org/10.1098/rspb.2013.1755 Planocarina carinata [59] https://doi.org/10.1016/j.cub.2019.01.078 pallidum [69] http://dictybase.org/

Protacanthamoeba bohemica [14] https://doi.org/10.1093/molbev/msx162

Protosporangium articulatum [14] https://doi.org/10.1093/molbev/msx162

Protostelium nocturnum [14] https://doi.org/10.1093/molbev/msx162 Pyxidicula operculatum [59] https://doi.org/10.1016/j.cub.2019.01.078

Rhizamoeba saonxica [14] https://doi.org/10.1093/molbev/msx162

Rhizomastix elongate [14] https://doi.org/10.1093/molbev/msx162

Rhizomastix libera [14] https://doi.org/10.1093/molbev/msx162

Ripella sp. [14] https://doi.org/10.1093/molbev/msx162 Sapocribrum chincoteaguense MMETSP0437 https://www.imicrobe.us/assembly/view/3 14

Sappinia pedata [14] https://doi.org/10.1093/molbev/msx162

Schizoplasmodiopsis [14] https://doi.org/10.1093/molbev/msx162 pseudoendospora

Schizoplasmodiopsis vulgaris [14] https://doi.org/10.1093/molbev/msx162 Speleostelium caveatum PRJNA495862 https://www.ncbi.nlm.nih.gov/bioproject/ PRJNA495862

Soliformovum irregularis [14] https://doi.org/10.1093/molbev/msx162

Squamamoeba japonica [14] https://doi.org/10.1093/molbev/msx162

Stenamoeba limacine [14] https://doi.org/10.1093/molbev/msx162

Stenamoeba stenopodia [14] https://doi.org/10.1093/molbev/msx162 Stygamoeba regulata MMETSP0447 https://www.imicrobe.us/assembly/view/3 16 Synstelium polycarpum PRJEB14640 https://www.ncbi.nlm.nih.gov/bioproject/ PRJEB14640 quadrilineata [60] https://doi.org/10.1016/j.ympev.2016.03.0

29

Thecamoeba sp. [14] https://doi.org/10.1093/molbev/msx162

Thecamoebidae isolate [RHP1-1] [14] https://doi.org/10.1093/molbev/msx162 Trichosphaerium sp. (ATCC 40318) MMETSP0405 https://www.imicrobe.us/assembly/view/3 00

46 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Tychosporium acutostipes [14] https://doi.org/10.1093/molbev/msx162

Vannella fimicola [14] https://doi.org/10.1093/molbev/msx162 robusta MMETSP0166 https://www.imicrobe.us/?#/samples/2469 Vannella schaefferi MMETSP0417 https://www.imicrobe.us/assembly/view/2 95 Vannella sp. MMETSP0168 https://www.imicrobe.us/assembly/view/3 4

Vermamoeba sp. [14] https://doi.org/10.1093/molbev/msx162

Vermamoeba vermiformis [14] https://doi.org/10.1093/molbev/msx162 Vermistella antarctica [60] https://doi.org/10.1016/j.ympev.2016.03.0

29 bacillipedes [70] http://doi.org/10.1016/j.ympev.2014.08.01

1 Vexillifera sp. MMETSP0173 https://www.imicrobe.us/#/samples/1757

1033 List of Stramenophile TAXA from MMETSP 1034 Alexandrium minutum MOORE-via-CAMERA-MMETSP0328 1035 Aureoumbra_lagunensis CCMP1509 MMETSP0890 1036 Bolidomonas_sp RCC1657 MMETSP1321 1037 Bolidomonas_sp RCC2347 MMETSP1320 1038 Bolidomonas pacifica MMETSP1319-RCC208 1039 Cafeteria_sp CaronLabIsolate MMETSP1104 1040 Chattonella_subsalsa CCMP2191 MMETSP0947 1041 Chrysocystis_fragilis CCMP3189 MMETSP1165 **** 1042 Dictyocha_speculum CCMP1381 MMETSP1174 1043 Dinobryon_sp UTEXLB2267 MMETSP0019 1044 Fibrocapsa japonica-CCMP1661 MOORE-via-CAMERA-MMETSP1339 1045 Florenciella_sp RCC1587 MMETSP1324 Stramenopiles 1046 Florenciella parvula MMETSP1323-RCC1693 1047 Heterosigma_akashiwo CCMP2393 MMETSP0292 1048 Love2098 non_described LOVEJOY CMP2098 MMETSP0990 1049 Mallomonas_sp CCMP3275 MMETSP1167 1050 Ochromonas_sp BG1 MMETSP1105 1051 Paraphysomonas_bandaiensis CaronLabIsolate MMETSP1103 1052 Paraphysomonas_vestita GFlagA MMETSP1107 1053 Pelagomonas_calceolata CCMP1756 MMETSP0886 1054 Pelagococcus_subviridis CCMP1429 MMETSP0882 1055 Phaeomonas_parva CCMP2877 MMETSP1163 1056 Pinguiococcus_pyrenoidosus CCMP2078 MMETSP1160 1057 Pseudopedinella_elastica CCMP716 MMETSP1068 1058 Rhizochromulina_marina cf CCMP1243 MMETSP1173 1059 Spumella_elongata CCAP955 1 MMETSP1098 1060 Thraustochytrium_sp LLF1b MMETSP0198 1061 Vaucheria_litorea CCMP2940 MMETSP0945

47 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1062 1063 Reagents 2-propanol(certified ACS) Sigma-aldrich I9516-25ML Ethanol (pure) 200 proof

Critical Commercial Assay Kits Nextera XT DNA Library Prep Kit Illumina FC-131-1096 Nextera XT index kit v2 set A Illumina FC-131-2001 Nextera XT index kit v2 set B Illumina FC-131-2002 Nextera XT index kit v2 set C Illumina FC-131-2003 Nextera XT index kit v2 set C Illumina FC-131-2003

1064 Software and Algorithms CD-HIT [71] https://github.com/weizhongli/cdhit TRINITY V2.4.0 [72] https://github.com/trinityrnaseq/trinityrnase q/ TRANSDECODER V 5.5.0 [73] https://transdecoder.github.io/ TMHMM V 2.0 [74] http://www.cbs.dtu.dk/services/TMHMM/ BOWTIE2 V 2.3.4.3 [75] https://sourceforge.net/projects/bowtie- bio/files/bowtie2/2.3.3.1 DIAMOND V 0.9.25 [76] https://github.com/bbuchfink/diamond ORTHOMCL V 5.0 [77] http://orthomcl.org/orthomcl/ TRIMMOMATIC V 0.35 [78] https://github.com/timflutre/trimmomatic MAFFT-LINSI V 7 [79] https://mafft.cbrc.jp/alignment/server/ BMGE V 1.12 [80] ftp://ftp.pasteur.fr/pub/GenSoft/projects/BM GE SEQUENCER v 5.4.6. http://www.genecodes.com/ BLAST 2.2.30+ https://blast.ncbi.nlm.nih.gov/ IQTREE v 1.5.5 [81] http://www.iqtree.org RSEM [82] https://github.com/deweylab/RSEM SIGNALIP 5.0 [83] http://www.cbs.dtu.dk/services/SignalP/ OYSTER RIVER PROTOCOL V 2.1.1 [84] https://oyster-river- protocol.readthedocs.io/en/latest/ MEME-SUITE V 5.0.4 [85] http://meme-suite.org/index.html DEEPLOC -1.0 [86] http://www.cbs.dtu.dk/services/DeepLoc/ INTERPROSCAN 5.27-66.0 [23] https://www.ebi.ac.uk/interpro/search/seque nce-search ETE3 [87] http://etetoolkit.org/documentation/ete- view/ PFAM [88] https://pfam.xfam.org/ MCL [89] https://micans.org/mcl/

48 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1065

1066

1067 1068 Contact for Reagent and resources sharing

1069 Further information and request for resources should be directed to, and will be fulfilled by, the

1070 corresponding author Matthew W. Brown ([email protected]).

1071

1072 Sib (similar to integrin beta) proteins found in dictyostelids | Sib Proteins comparison with

1073 ITB-like

1074 SibA, a cell adhesion molecule that binds to phagocytic particles similar to ITB, was discovered

1075 in Dictyostelium discoideum [45]. Sib proteins (A to E) are capable of binding to talin and SibC

1076 regulates cell adhesion [46]. However their protein structure is different from the canonical ITB

1077 in a lack of three cation binding motifs as well as a cysteine rich region at the C-terminus [26].

1078 Although the varioseans, described above, are phylogenetically close to dictyostelids their type

1079 two ITB-like proteins with three cation motifs are distinct from Sib proteins, though with a great

1080 variation in amino acid sequences. Interestingly, we did not observe Sib proteins outside of the

1081 Dictyostelia clade, suggesting that they evolved recently in the dictyostelids. The protein

1082 architecture of SibA has a similar composition to metazoan ITB solely with a GXXXG motif in

1083 the transmembrane region, but it is observed in type II ITB (see supplemental table 2). In

1084 addition, LamG3, vWD domains in type II ITB have never been observed in Sib.

1085

1086 1087 1088 1089 1090 1091 1092

49 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1093 1094 1095 1096

IntegrinIntegrin Actinin axillinalin AK ilamin α− β− P PINCHF α− F Vinculin T ILK TensinTaxon Amphizonella sp. 7 Amphizonella sp. 9 Amphizonella sp. 8 Amphizonella sp.10 Amphizonella sp. 2 Diplochlamys sp. Trichosphaerium sp. Micriamoeba sp. Echinamoeba exudans Vermamoeba vermiformis Vermamoeba sp. Flabellula citata Rhizamoeba_saxonica Cryptodifflugia operculata Pyxidicula operculatum Microchlamys sp. Heleopera sylvatica Heleopera sphagni Nebela sp. Hyalosphenia papilio Centropyxis aerophyla Lesquereusia sp. Difflugia sp. D2 Difflugia bryophila Arcella vulgaris Nolandella sp. Copromyxa protea Amoebozoa Armaparvus languidus Sapocribrum chincoteaguense Squamamoeba japonica Rhizomastix elongata Evosea Rhizomastix libera sp. eilhardi Mastigamoeba balamuthi Mastigamoeba abducta Entamoeba histolytica Entamoeba dispar Entamoeba invadens Entamoeba moshkovskii Entamoeba nuttalli ILK Dictyostelium discoideum Physarum polycephalum Echinosteliopsis oligospora Echinostelium minutum ILK Echinostelium bisporum Protosproangium articulatum Clastostelium recurvatum fruticulosa Flamella aegyptia solitarium Schizoplasmodiopsis vulgaris Tychosporium acutostipes Schizoplasmodiopsis pseudoendospora Cavostelium apophysatum Protostelium nocturnum Filamoeba nolandi Soliformovum irregularis Grellamoeba robusta Nematostelium gracile Acramoeba dendroidea Ceratiomyxella tahitiensis Stenamoeba stenopodia Stenamoeba limacina

Thecamoeba quadrilineata Discosea Thecamoeba sp. pedata isolate [RHP1-1] cantabrigiensis Paradermamoeba levis Dermamoeba algensis Mycamoeba Gemmipara Vermistella antarctica Stygamoeba regulata Vexillifera minutissima Vexillifera sp. [MMETSP0173] Paramoeba atlantica Paramoeba aestuarina Cunea sp. Cunea sp. [JDS-Ruffled] Vannella sp. Vannella robusta Ripella sp. sp. ILK Clydonella sp. Ovalopodium desertum Parvamoeba rugata Gocevia fonbrunei minus Pellita catalonica Endostelium zonatum LINKS Endostelium zonatum [PRA-191] Dracoamoeba jomungandrii Protacanthamoeba bohemica Luapelamoeba arachisporum Luapelamoeba hula Acanthamoeba castellanii Acanthamoeba pyroformis

1097 1098 Supplemental Figure 1: Repertoire of IMAC including filamin and tensin in Amoebozoa: 1099 names of IMAC proteins are listed at the top and names of amoebozoan taxa are listed on side of 1100 the map. Arrows indicates amoebozoan species that have nearly complete set of IMAC proteins.

50 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1101 Dark blue indicates the presence of a canonical IMAC protein, white indicates the absence of an 1102 IMAC protein and light blue indicates a truncated form of IMAC protein. 1103 1104

vin r rvin a a IntegrinIntegrinP P Actinin A axillinalin AK α− β− α− β− VinculinP T PINCHF ILK α−Taxon B β Arcella vulgaris Arcella Gibbosa Netzelia oviformis Cyclopyxis sp α Difflugia sp. D2 Difflugia bryophila

Lesquereusia sp. Tubulinea Difflugia sp. USP Centropyxis aerophyla Centropyxis sp. Planocarina carinata Nebela sp. Hyalosphenia papilio Hyalosphenia elegance Heleopera sylvatica Heleopera sphagni Pyxidicula operculatum Microchlamys sp. Cryptodifflugia operculata Nolandella sp. Extracellular Copromyxa protea Amoeba proteus Ptoleamoeba bullinesis Cytoplasmic Rhizamoeba_saxonica c-Src FAKVin Flabellula citata Amphizonella sp. 2 L Amphizonella sp. 7 Pin Amphizonella sp. 9 or Tal Amphizonella sp. 8

Amphizonella sp.10 Pax αA Diplochlamys sp. ILK e Trichosphaerium sp. Echinamoeba exudans Par Vermamoeba vermiformis Vermamoeba sp. Micriamoeba sp. Armaparvus languidus Idionectes vortex Actin Sapocribrum chincoteaguense Squamamoeba japonica Entamoeba nuttalli pax Entamoeba moshkovskii Fungi Entamoeba invadens Evosea vin Animals Entamoeba dispar Pygsuia pin Entamoeba histolytica tal Choanofagellates Rhizomastix libera ILK par Capsaspora Rhizomastix elongata Thecamonas integrins Pelomyxa sp. pin Mastigella eilhardi Mastigamoeba balamuthi ILK FAK Mastigamoeba abducta par Dictyostelium discoideum Dictyostelium citrinum Amoebozoa Tieghemostelium lacteum Dictyostelium firmibasis Dictyostelium intermedium Dictyostelium purpureum Speleostelium caveatum Cavenderia deminutiva integrins Acytostelium ellipticum Acytostelium leptosomum Acytostelium subglobosum ILK Synstelium polycarpum Coremiostelium polycephalum Heterostelium pallidum Heterostelium multicystogenum Physarum polycephalum Echinostelium bisporum Echinosteliopsis oligospora Echinostelium minutum Clastostelium recurvatum Protosproangium articulatum Opisthokonta Ceratiomyxa fruticulosa Phalansterium solitarium Flamella aegyptia Schizoplasmodiopsis vulgaris Tychosporium acutostipes Schizoplasmodiopsis pseudoendospora Cavostelium apophysatum Protostelium nocturnum Filamoeba nolandi Soliformovum irregularis Grellamoeba robusta Nematostelium gracile Obazoa Ceratiomyxella tahitiensis Acramoeba dendroidea Vexillifera minutissima Innovation/Gain Vexillifera sp. [MMETSP0173] Paramoeba atlantica Paramoeba aestuarina Loss Cunea sp. [MMETSP0417] Cunea sp. [JDS-Ruffled] Vannella sp. Vannella robusta Other Eukaryotes Amorphea Vannella firmcola Vannella schaefferi Ripella sp. Lingulamoeba sp. Clydonella sp. Stygamoeba regulata Vermistella antarctica Stenamoeba stenopodia

Stenamoeba limacina Discosea Thecamoeba quadrilineata Thecamoeba sp. Thecamoebidae isolate [RHP1-1] Mycamoeba Gemmipara Mayorella cantabrigiensis Paradermamoeba levis Dermamoeba algensis Ovalopodium desertum Parvamoeba rugata Parvamoeba monoura Cochliopodium minus Pellita catalonica Gocevia fonbrunei Gocevia fonbrunei_Tekle_2016 Endostelium zonatum LINKS Endostelium zonatum [PRA-191] Dracoamoeba jomungandrii Protacanthamoeba bohemica Luapelamoeba arachisporum Luapelamoeba hula Acanthamoeba castellanii Acanthamoeba pyroformis Acanthamoeba healyi Canonical Protein Present Acanthamoeba astronyxis Acanthamoeba commandoni Acanthamoeba culbertsoni Truncated Protein Present Acanthamoeba divionensis Acanthamoeba healyi Protein Absent in Data Acanthamoeba lenticulate Acanthamoeba lugdunensis Acanthamoeba mauritaniensis Nearly Complete IMAC Acanthamoeba pearcei Acanthamoeba polyphaga Acanthamoeba quina Acanthamoeba rhysodes 1105 Acanthamoeba royreba 1106 Supplemental Figure 2: Repertoire of IMAC in only amoebozoan species that have either 1107 integrin alpha or beta. Names of IMAC proteins are listed at the top and names of amoebozoan 1108 taxa are listed on side of the map. Arrows indicates amoebozoan species that have nearly 1109 complete set of IMAC proteins. Dark blue indicates the presence of a canonical IMAC protein, 1110 white indicates the absence of an IMAC protein and light blue indicates a truncated form of 1111 IMAC protein 1112 1113

51 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1114 1115 Supplemental Figure 3: Maximum likelihood tree of ITA homolog and other cell adhesion 1116 molecules that contain beta-propellers. Bootstrap values ≥50 are shown on the branch points. 1117 Phylogenetic tree was built with IQtree v1.5.5 under under the LG+C60+F+G model of protein 1118 evolution ML bootstrap (MLBS) (1000 ultrafast BS reps) values respectively. The protein 1119 sequences were aligned by MAFFT with the parameter linsi, maxiterate 1000 and local pair. 1120 BMGE was used to mask the alignment with the parameter of gap penalty of 0.8. ITA homolog 1121 phylogenetic tree are rooted with Linkin protein. Amoebozoan ITA genes are mostly 1122 concentrated on a single clade (colored) except for the Flabellula citata paralog. 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138

52 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Trichoplax sp. H2_3000888_ITGB6

Trichoplax sp. H2_3000889_ITGB1 100 Metazoan_ITGB

Ministeria_vibrans_c46

9 2 Pigoraptor_vietnamica_104197_Paralog2 100 Pigoraptor_chileana_93177_Paralog1

Syssomonas_multiformis_24197

Ministeria_vibrans_c334

7 5 Capsaspora_owczarzaki_CAOG01283T0 9 4 Pigoraptor_vietnamica_12581_Paralog1

100 Opistho-2_34916_Pigoraptor 8 6 100 Pigoraptor_chileana_34909_Paralog2

Ministeria_vibrans_c191

Capsaspora_owczarzaki_CAOG04086T0

Capsaspora_owczarzaki_CAOG05058T0 9 6 Sphaeroforma arctica_XP_014159356.1_ITGB 100 7 8 Creolimax_fragrantissima_CFRG6586T1

Capsaspora_owczarzaki_CAOG02005T0

Trichodesmium erythraeum_ABG51146.1_ITGB-like 9 1 Rhodobacteraceae bacterium_KPP84658.1_ITGB_like

Schizoplasmodiopsis_vulgaris_TR1491 9 3 Cavostelium_apophysatum_c13787 9 0 9 3

100 Filamoeba_nolandi_DN12164

9 8 Soliformovum_irregularis_c14427

9 2 9 0 Clastostelium_recurvatum_c10578

Didymoeca costata_m.142452_ITGB-like 9 0 9 2 Pygsuia_biforma_ITGB 9 8 Lenisia_limosa_LenisMan46

Thecamonas_trahens_AMSG_06622T0

Flabellula_citata_135911 8 7 9 6 Mastigamoeba_balamuthi_MMETSP0035-20110809 7 8 Mastigella_eilhardi_A1302_1610_2327 9 9 Idionectes_vortex_TRINITY_DN4916_c4_g12_paralog1

Rigifila_ramosa_a2731185122

Rhizamoeba_saxonica_TR7441 9 1 9 7 100 Rhizamoeba_saxonica_TR11158

7 5 Rhizamoeba_saxonica_TR7856

Idionectes_vortex_TRINITY_DN4117_C0_G1_I1_paralog2

Echinamoeba_exudans_TR10338

Pyxidicula_operculatum_TR11783 9 1 Cryptodifflugia_operculata_c2 100 9 7 Heleopera_sphagni_DN3395

9 0 Difflugia_sp._D2_TR12857 100 Difflugia_sp._USP_TR583 9 9

9 9 Hyalosphenia_papilio_DN7101

Nebela_sp._0_TR6609

100 Sib_Dictyostelium discoideum

1139 0.5 1140 Supplemental Figure 4: Maximum likelihood tree of ITB homolog and Sib protein. Bootstrap 1141 values ≥50 are shown on the branch points. Phylogenetic tree was built with IQtree v1.5.0 under 1142 the LG+C60+F+G model of protein evolution ML bootstrap (MLBS) (1000 ultrafast BS reps) 1143 values respectively. The protein sequences were aligned by MAFFT with the parameter linsi, 1144 maxiterate 1000 and local pair. BMGE was used to mask the alignment with the parameter of 1145 gap penalty of 0.6. ITB homolog phylogenetic tree are rooted with Sib protein. Amoebozoan ITB 1146 genes are mostly concentrated on a single clade (colored). 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157

53 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

ITGA Obazoan Motif

A. 4

3

bits 2 R F E Y A N 1 Q P V T FY N I L GS Q I S Y S SM G K T N L T E I M K V N V GT G G T E L S V T Y D F P THR S YL L RF S R K Q L G M NDNC VLVR D G I L M D AD V K C E L A V K S I A R A AA ADGRAS I PA S D F I MSC A 2 3 4 5 6 7 8 9 0 1 D AVI F 10 11 G12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 G N GAMEME (no SSC) 16.04.2019 22:29 B. 4 D DG D P 3

bits 2 S V F S 1 LWSYQA PL ER D GCY 2 3 4 5 6 7 8 9 0 1 Q C A MLGHD H 10 11 12 V13 E14 G Y MEME (no SSC) 26.06.2019P 15:18

C. 4 C G C 3

bits 2 GM Y L K P PT 1 K AL T F I K A T QFYEVN K S H A N E SA S M R M SE A M LA A D T CVLWRM M RENLP HQ L AFGVAGPHA L KT KTEDHDG 2 3 4 5 6 7 8 9 0 1 A AH K

GF 10 11 12 13 14 15 16 17 18 19 20 RK MEME (no SSC) 16.04.2019 22:34 D. P 4 F 3

bits 2 R V T I GY T GVT V LT AY EY 1 NV S SA T TY PS V R N Q N N S V V YT KSTVK T KDLQSL I SM TS TL FY I R I H P I L I S S A LFG L G C KL F R FF 2 3 4 5 6 7 8 9 0 I 1 LF A R K Q GA H FQ L AM V VL 10 11 P12 13 14 15 F16 17 18 19 I 20 21 22 T23 24 25 26 G27 28 29 30 1158 MEME (no SSC) 26.06.2019 16:24 1159 Supplemental Figure 6: Most significantly enrichedT motifs in Obazoan ITA, obtainedP byV 1160 MEME-suite. (A) Consensus cation motifG and FG-GAP obtained by RNA-seq for an Obazoan 1161 ITA, (B) EGF motif found exclusivelyN in MinisteriaA vibrans (C) GFFKR motif found in Obazoa, 1162 (D) immunoglobulin like motif found in Obazoa. 1163

54 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

ITGB Obazoan Motif

A. 4 B.4

3 3 bits 2 bits 2 T QQ G L A Y 1 TVPG L M L A K R L L A S 1 A P Y A T RC L DK R A N Y LQ R K L F R F S S A R I L S P P R S Q G Q I T L Y PS T R KL N G NV S A A S PY S V D E V L V M E V F Y N K E P Q Y I V K E V NS P Y L T S T T F G MV M T VY E R Q L S R A G LV Q K M I V I LVS E E KF V M N SV Q F A A D P S R F R D T D H Q V N I D E V M S AE K P F V I D R I I I I G I I I LT YN MR VE I G Q G G I G GG Q FLQ 2 3 4 5 6 7 8 9 1 T K F N D GFDY M 0 Y V D L K N A I 2 3 4 5 6 7 8 9 1 10 F11 12 13 14 15 16 17 18 19 20 F21 22 23 24 25 26 27 28 29 0 LM K L H S 10 11 12 13 14 15 16 17 18 19 20 21 22 MEME (no SSC) 27.06.2019 17:57 G V L S SM F T A H E MEME (no SSC) 16.04.2019 22:28 C. D. 4 D D D 4 R D 3 3

bits 2 bits 2 T W S M A TKLV L 1 S Y P L RT I I M LA V R T A L V KN I I N 1 AY I AY F QG Q F N V G R GT V E D S V R A VL S S T N T L H L A YR QV I N D L DD T V L T Q I N S L QSS S Q S PE W H F F VR SR S N E V A DT M F NI I S D A G A H A A G A A S PAM K M M L D K N I M I K LK RG S SVYL S E M S A EA G LD 2 3 4 5 6 7 8 9 1 D H N H C F 0 A STFQWRQ N AI KA L VN E H H V 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 I 2 3 4 5 6 7 8 9 S 0 1 R

P 10 11 12 13 14 15 16 17 DG MEME (no SSC) 16.04.2019 22:31 G E N D E Q G F KP MEME (no SSC) 16.04.2019 22:32 E. F. 4 C W 4 P 3 3 bits bits 2 2 K S D RND PLAT I GT DT S V V V S 1 Y T I P GV T T R 1 V K S D T Q D N T K KS S A S FR T S R N V V V Y G I QM V S E E S TI P HS R N S T T M G YQ R S D L T L GL G E D Q E N N Q T ET K Q T F N S D LT S N V Y L E DN M ENH GPK FL SQ F D QAG E ND E D L E D Q K Q Y A A A A N DAD A I F G A I D DAE E AAF Q G N NF E A A L T F 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1 0 1 G I N G A S P I 0 F 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 A E 10 11 NT12 13 14 15 L16 17 18 19 20 21 22 23 24 25 G26 27 28 29 GW G MEME (no CSSC) 16.04.2019 22:33 GH C G G MEME (no SSC) 16.04.2019 22:29 4 C C C C C C C C H. 4 G C C C G. W 3 3

bits 2 bits 2 S Q T 1 T K E Y I V R K E SA D VT 1 E Q Q L QM Q G N V N MAM E Q NT QPN D T K R F N L S I F Y I Q AD K Q S G Y Y V I VVR A R A F R K G L R S DG V N G I N L A N T A L Q K Q Q Y S RR N A A T I G K ALAE ED D I V L GF WLLE KD N QQ MR H AF Q F W VGHKKS GEK ES I G 2 3 4 5 6 7 8 9 1 H Q 0 K AHHQ D W G D ARCA ANNA AHHE L 10 11 12 13 14 15 16 17 18 19 20 21 R 2 3 4 5 6 7 8 9 T 0 1 R

S 10 11 12 13 14 15 16 17 18 19 20 21 Y MEME (no SSC) 16.04.2019 22:28 E K MEME (no SSC) 16.04.2019 20:55 1164 L E Y D F 1165 SupplementalW N PFigure 7: Most significantlyNP enriched motifsK in Obazoan ITB,W obtained by MEME. 1166 (A-D) Cation binding motifs (A) MIDAS binding motif DXSXS (B) Possible fifth position of 1167 MIDAS and ADMIDAS, (C) ITB SyMBS (green) and MIDAS (Yellow), (D) Possible E of 1168 SyMBS (first amino acid of SyMBS) (E-F) Cysteine rich motif, (E) PSI domain, (F) Cysteine- 1169 rich motif stalk, (G-H) C-terminal motifs (G) NPXY motif, (H) Possible KLXXXD motif 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190

55 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1191 1192 1193 1194 Sib Motif

A.

4

3

bits 2 V YT A SY QLTSVFWRVKQV N V 1 T T V S TYSQLVT Y L L VTQANS SSDLI ALVI RKATS I NGLVNN I TDT SARKLF I ATS I A 2 3 4 5 6 7 8 9 0 1 KA G FP I LASFALQ I LFMCSA I K Q I FG VM N 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 F31 A32 33 34 35 36 A37 38 39 40 41 42 43 44 45 46 47 N48 F49 50 I LQ LS F YE YP V G T MEMEL (noS SSC)P 27.06.2019Y 20:22 B. 4 D FSDKP G 3

bits 2 V PD E N SK DP S N L N 1 I T T Q V Y Y GKNK PT I SYK FYE KD HY EQK TAVT KV VI P E VI L T S 2 3 4 5 6 7 8 9 0 1 EGG SAKD L LT F D I VL D E G SQ M I D FT 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 D27 L28 29 30 K31 32 33 34 35 36 37 38 39 40 N41 42 43 44 45 46 47 Y48 49 Y50 C WD D N PP N S D MEME (no SSC)N 17.09.2019W 18:02 C.

4 C C G YG C C G N G G GKC C G 3 bits 2 V Q I VQ 1 M T V L VI RW L VYR TAYNG RN I PI Q KD 2 3 4 5 6 7 8 9 0 F1 H I I P NRD Q L A GR TN 10 11 12 13 14 L15 16 17 18 R19 20 21 22 23 24 25 26 27 N28 29 30 H31 32 33 SG V K N V S PMEMEN (no SSC) 17.09.2019 17:57 D. 4 N SPV M P I NN F A D 3 bits 2 Q 1 SST M KARSSA LQSS E 2 3 4 5 6 7 8 9 0 1 DGA A 10 Q11 12 13 G14 N15 G16 G17 I 18 19 20 21 22 E23 24 25 N26 27 GDG E AE MEME (noS SSC) 17.09.2019 18:05 1195 N L E AA D 1196 SupplementalV FigureNP 8:Y Most Ssignificantly enrichedNPLY motifs in Sib protein, obtained by MEME. 1197 (A)-(D) Cation binding motifs (A) Possible binding motif of MIDAS and ADMIDAS, (B) 1198 Possible fifth position of MIDAS and ADMIDAS (C) Possible binding motif of SyMBS (D) 1199 NPXY motif 1200 1201 1202 1203 1204 1205 1206 1207

56 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1208 Supplemental Table 1: The table shows the list of Amorphea ITGA proteins characteristic 1209 including: length, beta propeller motifs, subcellular location, any extra domains attached to 1210 ITGA, c-terminal motif signal peptide, transmembrane regions and type of ITGA. The order of 1211 taxons are listed in accordance to the Figure 2. The subcellular localization is displayed with 1212 likelihood value in parentheses. The location of extra domains attached to ITGA are shown as 1213 the C-terminal end (CTE) or at the start of N-terminal (NTB). The presence of ITGA interacting 1214 motif with ITGB is shown as GFFKR and GXXXG. The location of beta propeller motifs 1215 performed by MEME-suite is displayed. The presence of FG-GAP motifs are shown in asterixis. 1216 The location of beta motifs that corresponds to IPRSCAN output are highlighted by parentheses. 1217 The presence of consensus cation motifs are displayed in number 2 superscripts. Species Size Subcellular Extra Domain GFFKR Location of Signal Type I of localization C-terminal end = motif Beta propellor peptide(SP) and II AA Predictability based CTE GXXX *= predict to or on likelihood N-terminal G motif possess FG- transmembr Beginning =NTB GAP ane (TM) 2 = Cation motif (D/E-h-D/N-x- D/N-G-h-x- D/E) Filamoeba nolandi 631 Cell membrane PRICHEXTENS GFFKR 16, 47, [109], TM II (CAMPEP_01685 protein Likelihood N(CTE) GIAVG 171 71262) (0.971) protein kinase (CTE) Ministeria vibrans 917 Extracellular EGF 2 repeat 1952, 2552, II Paralog 2 (0.9862) (NT) [319]2, [436]2*, (2753598) Soluble (1) [581]2, 772, Likelihood (0.997) 8122 Rhizamoeba 449 Cell membrane None [55], 1222, 184, SP, TM I saxonica protein Likelihood 2542* (TR11094_c0_g1_ (0.437) i1) Flabellula citata 673 Cell membrane None 57*, 121, 184*, SP, TM I paralog3 protein Likelihood 251*, 313 (c2461_g1_i2) (0.301) Cryptodifflugia 120 Cell membrane (NTB) 8372, 8892*, SP II operculata Paralog 0 Protein Likelihood Thrombospondin [1033]2, [1107]2 2 (c8521_g1_i1) (0.257) type -1 repeat, EGF 4 repeat Intramolecular chaperone Nolandella sp. 975 Cell membrane (NTB) 613, [670]2*, SP II AFSM Protein Likelihood Thrombospondin [734]2*, 2 (TR12097c0_g1_i1) (0.672) type -1 repeat, [805] *, 934 EGF 4 repeat

57 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Arcella intermedia 107 Nucleus membrane (NTB) [899]2*, [965]2* SP II (c23786_g1_i1) 4 Protein Likelihood Thrombospondin (0.58) type -1 repeat, EGF 4 repeat Intramolecular chaperone Flabellula citata 626 Cell membrane EGF (CTE) 123, [259], SP II paralog1 Protein Likelihood [323]2* (c9004_g2_i1) (0.994) Flabellula citata 713 Cell membrane EGF (CTE) 88*, 150*, 255, TM II paralog2 Protein Likelihood 310 (c8227_g1_i1) (0.669) Ministeria vibrans 860 Peroxisome None 582, [125]2*, I Paralog 1 (0.3325), Soluble 185, 243, 310, (2757223) (0.9929) 3872, 448, Likelihood (0.386) [578]2*, [647], 748 Flabellula citata 604 Extracellular None 113, 172, I paralog5 (0.5213) [302]* (c8994_g2_i16) Soluble (0.9842) Likelihood (0.622) Flabellula citata 646 Cell membrane None 53, [116]*, SP, TM I paralog4 Protein Likelihood [178]* (c8386_g1_i3) (0.833) Cryptodifflugia 581 Cell membrane None 582, [120]*2, SP, TM I operculata Paralog Protein Likelihood [188]*, 2612, 2 1 (c43306_g1_i1) (0.445) 325 , 383*, [449]2 Amphizonella sp 9 565 Extracellular None [47], 121, SP I paralog2 (0.9902) [198]1, 262, 2 2 (TR2587c1_g1_i1) Soluble (0.9998) [344] , [414] , Likelihood (0.997) 4822 Mastigamoeba 813 Cytoplasm (0.3215) Cadherin/Ig-like 66, [200]2, 268, I balamuthi Soluble (0.8872) fold (CTE) [345], 416, 468, paralog1 Likelihood (0.493) [540] (MMETSP0035- 201108093729) Amphizonella sp 8 569 (0.4562) None [71],138, 207, SP I Paralog 2 Soluble (0.8119) 274 [353], 419 (TR9459c0_g1_i1) Likelihood (0.362) Heleopera 743 Cell membrane ADP-ribose GIVAG [56], 139, 2152, TM II sylvatica Protein Likelihood (NTB) [280]2 (Gene.370_NODE_157) (0.493) Nebela sp. 839 Peroxisome Cadherin (CTE) [36], [109], TM I (Gene.16036_NODE_111) (0.3381), Soluble [179], [244], (0.6007)

58 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Likelihood (0.125) [320], [388], [455] Difflugia sp. D2 266 Organelle (0.3485) None [55], [122], I (DN18414_c0_g1_ soluble (0.8094) [190] i5) Likelihood (0.287) Protosproangium 671 Cell membrane None GAVV [48], [124], TM I articulatum Protein Likelihood G [194] Paralog3 (0.937) (TR6387|m.15047) Ceratiomyxa 565 Vacuole (0.4506) None 742, [127], 198, SP, TM I fruticulosa membrane (0.7419) 259, 321, 393, (c6850_g1_i1) Likelihood (0.215) 471 Diplochlamys sp. 666 Cell membrane None GAIAG [120]2, 249, SP, TM Paralog2 (m.4379) Protein Likelihood [311]2, (0.989) Paramoeba 623 Cell membrane none 69, [125], 198, SP, TM I atlantica Protein Likelihood [337]2, 4252 (2184673) (0.498) Mastigamoeba 104 Cell membrane Cadherin/Ig-like [133], 202, 265, TM I balamuthi 2 Protein Likelihood fold (CTE) 343,522 613, paralog2 (0.674) [685], 754 (MMETSP0035- 20110809854) Amphizonella sp 9 685 Cell membrane PRICHEXTENS 68, [133]2, SP, TM II Paralog3 Protein Likelihood N(CTE) [195], 255, 2 2 (Gene.962_NODE_597) (0.87) [287] , [380] *, 455, 5202, 586 Amphizonella sp 9 677 extracellular None [62], [124], SP I paralog1 (0.8519) soluble [199]2, [268]*, 2 (TR17939c0_g1_i1) (0.8141) Likelihood [337], [409] , (0.936) [488]2 Diplochlamys sp. 540 Vacuole (0.559) None [68], 143, 397, SP I Paralog1 Soluble (0.5526) [468] (2112184) Likelihood (0.639) Vermistella 570 Extracellular None 83, 224, 2992, SP I antarctica (0.7931) 368, 4422, 516 (156452) Soluble (0.9906) Likelihood (0.905) Protosproangium 631 Vacuole (0.4093) None [108], [180], SP I articulatum Soluble (0.8173) 248, 329, Paralog4 Likelihood (0.459) [390]*, [487], (TR6387_c1_g4_i [567]* 7) Amphizonella sp 8 656 Extracellular PRICHEXTENS [53], [125], SP II Paralog 1 (0.8352) N(CTE) [200]2, 2662, Soluble (0.992)

59 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

(TR6668_c0_g1_i Likelihood (0.964) 3322, [403]2, 4) 473 Amphizonella sp. 563 Cytoplasm (0.625) PRICHEXTENS 16, 87, [161], II 10 Soluble (0.8464) N(CTE) 237, 2992, 370, Paralog1 Likelihood (0.758) [438] (TR4717c0_g1_i1) Amphizonella sp 2 544 Endoplasmic none 111, [189]2, SP I (TR2129_c1_g3_i reticulum (0.2477) [257], [334]2, 9) Soluble (0.5541) [403]2, 474, Likelihood (0.224) [539]2 Endostelium 683 Cell membrane PRICHEXTENS 2322, [295], SP II zonatum LINKS Protein Likelihood N(CTE) 432, [497]2, (789702) (0.451) [572]2, [647] Endostelium 681 Cell membrane PRICHEXTENS 2312, [293], SP II zonatum Protein Likelihood N(CTE) 430, [495]2, (m.5907) (0.479) [570]2, [645] Protosproangium 735 Extracellular none [70], [137], SP I articulatum (0.4086) [200], [268], Paralog5 Soluble (0.4867) [344], 407, 480 (TR2944c0_g1_i1) Likelihood (0.572) Protosproangium 851 Cell membrane protein kinase GIVFG 572, [131]2, SP, 2TM II articulatum Protein Likelihood [201], [275]2, Paralog1 (0.448) [344], 418, 485 (TR2074c0_g1_i2) Protosproangium 685 Cell membrane None GVGLG [105], 176, TM I articulatum Protein Likelihood [240], 303, Paralog2 (0.997) [378], 450, 512 (TR2817c1_g1_i1_6465) Stenamoeba 662 Cell membrane None [168]2, 3102, SP, TM I stenopodia Protein Likelihood 4442, 5142 (2782329) (0.93) Gocevia fonbrunei 869 Cell membrane Cadherin/Immun GAIIG [66]2, [207], TM II Paralog1 Protein Likelihood oglobulin-like 281, [348], 417, 2 (DN14950_c7_g1_i2) (0.999) fold (CTE) [488] ,

Gocevia fonbrunei 665 Extracellular None 52, [121], 194, SP I Paralog2 (0.5821) 257, [318]2, (DN13664_c1_g1_ Soluble (0.9361) [386]2*, 463 i1) Likelihood (0.656)

Amphizonella sp. 523 Extracellular None [56], [129]2, SP I 10 (0.9162) [201], [273]2, Paralog2 Soluble (0.9984) 3402, 3972, (TR11506c6_g2_i Likelihood (0.929) [468] 3)

60 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Ectocarpus 110 Cell membrane PRICHEXTENS [56], [127], 195, SP, TM II siliculosus 5 Protein Likelihood N(CTE) 260, 326, [396], (0.758) 464* Ministeria vibrans 634 Cell membrane None GFFKR [18]2*, 78 TM I Paralog 3 Protein Likelihood (m.23418) (0.214) Pygsuia biforma 331 Extracellular DUF11 4 repeats GFFKR 267, [685], TM I (ITGA) 2 (0.9162) (CTE) GLAIG [807] Soluble (0.9984) Likelihood (0.93) Lenisia limosa 334 Cytoplasm (0.5704) Ig fold like 11 GFFKR 3102, [701], I (LenisMan47) 6 Soluble (0.9615) repeats (CTE) [826]2 Likelihood (0.632) Thecamonas 211 Cell membrane Ig fold like 2 GFFKR 349*, [894]2*, SP, TM I trahens 4 Protein Likelihood repeats (CTE) [956]2* (0.975) Laminin_g3 (CTE) 1218 1219 Supplemental Table 2: The table shows ITGB motifs, subcellular localizations, signal peptide, 1220 transmembrane and extra domains attached to ITGB protein. The order of taxons are listed in 1221 accordance to the Figure 3. For ITGB motifs of MIDAS, AMIDAS and SyMBS, any non- 1222 canonical amino acids were highlighted in red colour, cysteine rich motifs are either shown as 1223 EGF or PSI domains. The presence of C-terminal motifs are shown in KLXXXD or NPXY/F and 1224 GXXXG. The parentheses indicate number of motifs present in ITGB.

61 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Species MIDA AMIDAS SyMBS CRM Size Signal B-tail Ty Subcellular S Cysteine rich aa Peptide motifs pe localization motifs and (Deeploc)

Transme

mbrane

Clastostelium 25D 29S 81E None 1405aa TM GAVVG II Extracellular recurvatum 27T 32S 120R Likelihood (0.3436) 29S 33D 121P soluble likelihood 123E 202A 122D (0.731) 139R 160D 123E Filamoeba 893D 897S 906D EGF, PSI 2714aa SP, TM GAFIG II Cell membrane nolandi 895Y 899K 949D (NTB) Protein likelihood 897S 900A 950E (0.5463) 956E 1066A 951P 991D 991D 952N Soliformovum 872D 874N 924D PSI 2218aa SP NPXY II Extracellular irregularis 874S 879Y 1012D Present likelihood (0.8972) 876N 880N 1013P soluble likelihood 958S 1071A 1014N (0.9172) 975N 975N 1015E Schizoplasmo 839D, 845S 889D PSI 3641aa SP, TM NPXY II Extracellular diopsis 841T 846Y 923N present likelihood (0.5465) vulgaris 845S 857E 925Q GAAG soluble likelihood 928E 1045A 927L (0.6244) 944D 944D 928E Cavostelium 507D 511S 546D PSI 3296aa TM NPXY II Cell membrane apophysatum 509T 514S 656N present protein likelihood 511S 515D 658D GAASG (0.7274) 592N 708A 659E 642D 642D 660D Mycamoeba 277D 281S 328E Unique 927aa TM NPXY I Cell membrane gemmipara 279T 285N 403N cysteine present protein likelihood 281S 286D 411D rich GVIAG (0.998) 375E 530A 414P region 416E 390D 416E pattern downstre am from 500AA- 860AA

62 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Mastigamoeb 159D, 163S, 198E, PSI, 1315 aa SP, TM NPXA I Cell membrane a balamuthi 161T, 170P, 238S, EGF-like present Protein likelihood 163S 171N 239D, present 5 GAASG (0.561) 241P, 242E 371S, 242E times 271D 271D Mastigella 127D, 131S 198E, PSI, 1292 aa SP, TM NPXA I Cell membrane eilhardi 129S, 134D 207N, EGF-like present Protein likelihood 131S 135D 209D, present 5 GAASG (0.7939) 212E 339A 211P, times 230D 230D 212E Rigifila 124D 128S 164D PSI, 932 aa SP, TM NPXY I Extracellular limosa 126S 131D 201N EGF-like present soluble likelihood 128S 133D 203D present GTATG (0.681) 206E 391A 205P 13 times 237D 237D 206E Echinamoeba 123D, 127S, 174E PSI, 593 aa SP, TM NPXY I Cell membrane exudans 125S, 140G 204N, EGF-like present (2) protein likelihood 127S, 141E 206E, present 3 KMAAA (0.5596) 209E, 338A 208P, times D 239D 239 D 209E GAVIG

Difflugia sp. 115D, 119S 177E, PSI, EGF 605 aa TM NPXY I Cell membrane D2 117S, 136D 208N, present 3 present (2) protein likelihood 119S, 137E 210E, times (0.8598) 213E, 295I 212P, 249D 249D 213E Nebula sp. 131D, 135S 192E, PSI, EGF 630 aa SP, TM NPXY I Cell membrane 133S, 138P 223N, present 3 present (2) protein likelihood 135S, 139Y 225E, times (0.7413) 225E, 361A, 227P, 264D 264D 228E Heleopera 127D 131S 174E PSI, EGF 622 aa SP, TM NPXY I Cell membrane sphagni 129S 134P 216N present 3 present (2) protein likelihood 131S 135H 218E times KADX(4) (0.9332) 221E 352A 220P E 266D 266D 221E

Difflugia sp. 146D, 150S 197E PSI, EGF 630 aa SP, TM NPXY I Cell membrane USP 148S, 153P 239N present 3 present (2) protein 150S, 154Y 241E times likelihood (0.9847) 245E 370A 243P 280D 280D 245E Hyalosphenia 35D 39S 85D EGF 531 aa TM NPXY I Cell membrane papilio 37S 42P 127N present 3 present (2) protein likelihood 39S 43Y 129E times KADX(4) (0.8186) 132E 270D 131P D

63 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

172D 172D 132E Rhizamoeba 143D 147S 182E PSI, EGF 951 aa SP, TM NPXY I Cell membrane saxonica 145S 150Q 223N, present 7 present (2) protein likelihood (paralog 1) 147S 151E 225D, times (0.7754) TR11158c0_g1_i1 228E 383A 227P, 259D 259D 228E Cryptodifflugi 144D 148S 194E PSI, EGF 633aa SP, TM NPXY I Cell membrane a operculum 146S 151P 236N present 3 present (2) protein likelihood 148S 152L 238E times (0.9426) 241E 370G 240P 278D 278D 241E Rhizamoeba 142D, 145S 181E, PSI, EGF 932aa TM NPXY I Cell membrane Saonxica 144S 148D 222N, present 3 present (2) protein likelihood (Paralog2) 146S 149D 224D, times (0.8204) TR7856c0_g1_i1 227E, 329A 226C, 258D 258D 227E Flabellula 163D, 167S 185D, PSI, EGF 1141 aa SP GAAVG I Cell membrane citata 165S 172M 229G present 5 protein likelihood 167S 173E 231D, times (0.7827) 234E, 364A 233P 282D 282D 234E Rhizamoeba 832D 836S 871E, EGF-like 1456 aa SP, TM NPXY I Extracellular saxonica 834S 839P 913N present 2 present (2) (05263) soluble (Paralog 3) 836S 840F 915D times likelihood (0.731) TR7441_c0_g1_i1 918E 1063A 917C 949D 949D 918E Pyxidicula 133D, 137S, 192E, PSI, EGF 621 aa SP, TM NPXY I Extracellular operculata 135S, 140P 224N, present 3 present (0.9867) soluble KADX(4)D 137S, 141Y, 226E, times likelihood (0.9995) GECGG 229E, 360A, 228P, 264D 264D 229E Pygsuia 110D 114S 161D PSI, EGF 1793 aa TM NPXY I Cell membrane biforma 112S 117D 200Q present present Protein likelihood 114S 118D 201D 34 times KAGX(4) (0.9792) 204E 331A 203P E present 231D 231D 204E Lenisia limosa 130D, 134S 182N PSI, EGF SP, TM NPXY I Cell membrane 132S 137N 215N present present Protein likelihood 134S 138D 216E 12 times, RFNAAM (0.9934) 218S 357N 217P EGF like KE 246D 246D 218S 4 times present

Ministeria 85D, 89S 124E EGF-like SP, TM NPXY I Cell membrane vibrans 87S 92D 174N present 5 present protein likelihood (paralog 1) 89S 93D 176D times KAWQQ (0.9827) c191|m.702 179E 310S 178P WEE

64 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

210D 210D 179E Ministeria 73D, 77S 122D EGF-like TM NPXF I Cell membrane vibrans 75S 80N 189N present present (2) Protein likelihood (paralog 2) 77S 81D 191D 14 times KLAQVA (0.9941) c46_198 194E 339A 193P AD 225D 225D 194E Ministeria 594D, 598S 633E EGF-like SP, TM NPXY I Cell membrane vibrans 596S 601N 677N present at present (2) Protein likelihood (paralog 3) 598S 602D 679H N- KLITGA (0.9985) c334|m.1235 682E 824A 681S terminal AD 714D 714D 682E and C- terminal

Didymoeca 444D 446S 524S None 1673aa SP, TM None I Cell membrane costata 446S, 449D 527D protein likelihood 448S 451D 529P (0.7458) 530E, 652A 530E 562D 562D Sphaeroforma 119D 123S 195E EGF-like 1141aa TM KGIQMV I Cell membrane arctica 121S 125D 227N present MD protein likelihood 123S 127D 229D 7 times (0.638) 231E 296A 230P 260D 260D 231E 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 Supplemental Table 3: This table shows Rsem value of protists that have either a complete set 1238 of IMAC, integrins with extra domains or unexpected discovery of an integrin protein. The 1239 values are shown in transcripts per million (TPM). Species ITA ITB Tali PINC ILK Paxillin Actinin Actin eF1α Vinculin n H Filamoeba 10.23 91.07(?) 45.54 30.62 11.71 0.00 138.81 2417.4 5932.12 36.07 nolandi 5 Difflugia sp. D2 0.00 0.00 0.00 6.01 N/A 0.00 63.16 12.24 0.00 0.00 Flabellula 16.31(Paralog1) 421.32 18.56 34.78 5.54 25.09 109.90 2708.6 6180.65 7.28 9.81(Paralog2) 5 citata 14.56(paralog3) 4.90 (Paralog4)

65 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

0.67(paralog 5) Nebela sp. 3.73 11.34 3.95 85.57 N/A 25.24 444.60 653.00 2388.17 6.42 Rhizamoeba 174.00 52.55(Paralog3) 25.63 29.96 3.23 13.13 106.07 12079. 5322.57 20.45 82.61(Paralog2) 17 saxonica 121.63(paralog1 ) Cryptodifflugia 4.93 (paralog1) 7.95 50.58 2.40 2.47 15.50 5.55 272.10 1857.80 9.65 1.80 (paralog2) (c7512 operculata _g1_i1) Nolandella sp. 2.77 N/A 9.84 41.65 N/A 11.66 3.44 2929.7 3621.35 12.38 AFSM 1 Arcella 2.98 N/A 4.91 1.19 N/A 11.19 43.20 110.29 2001.59 4.71 (c1709 intermedia 0_g2_i 1) Rigifila ramosa N/A 810.35 348.93 59.65 N/A 141.87 127.94 18519. 2284.60 7.37 18 Pygsuia biforma 2.14 1.45 41.95 111.50 N/A 13.47 103.13 17397. 7071.38 N/A 16 Ministeria 1.71 (Paralog 1) 0.00(Paralog 1) 3.88 13.43 3.38 13.43 8.72 277.70 310.59 1.15 2.90 (Paralog 2) 5.17(Paralog 2) vibrans 20.02 (Paralog 3) 10.56 (Paralog3) 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 Supplemental Table 4: A). Summary of amoebozoan ITA type I table listing: number of 1259 paralogs, Beta propeller, FG-GAP and canonical motifs. The table also shows presence of signal 1260 peptide and transmembrane region, c-terminal interacting motif, GXXXG motif and Ig-fold 1261 like/Cadherin domain. B). Summary of amoebozoan ITA type II table listing: number of 1262 paralogs, Beta propeller, FG-GAP and canonical motifs. The table also shows presence of signal 1263 peptide and transmembrane region, c-terminal interacting motif, GXXXG motif and extra 1264 domain attached to ITA. 1265 1266 1267

66 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.29.069435; this version posted April 30, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1268 1269

1270

67