Social complexity, life-history and lineage inuence the molecular basis of caste in a major transition in

Christopher Wyatt (  [email protected] ) University College London Michael Bentley University College London Daisy Taylor University College London Emeline Favreau University College London https://orcid.org/0000-0002-0713-500X Ryan Brock University of East Anglia https://orcid.org/0000-0003-2977-1370 Benjamin Taylor University College London Emily Bell School of Biological Sciences, University of Bristol https://orcid.org/0000-0002-7309-5456 Ellouise Leadbeater Royal Holloway University of London https://orcid.org/0000-0002-4029-7254 Seirian Sumner UCL https://orcid.org/0000-0003-0213-2018

Article

Keywords: Superorganismality, Castes, , Transcriptomics

Posted Date: August 27th, 2021

DOI: https://doi.org/10.21203/rs.3.rs-835604/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License 1 Social complexity, life-history and lineage influence the molecular basis of caste in a major 2 transition in evolution

3 4 Christopher D. R. Wyatt1*, Michael Bentley1,†, Daisy Taylor1,2,†, Emeline Favreau1, 5 Ryan E. Brock2,3, Benjamin A. Taylor1, Emily Bell2, Ellouise Leadbeater4 & Seirian 6 Sumner1* 7 8 1 Centre for Biodiversity and Environment Research, University College London, 9 London, UK. 10 2 School of Biological Sciences, University of Bristol, United Kingdom, BS8 1TQ. 11 3 School of Biological Sciences, University of East Anglia, Norwich Research Park, 12 Norwich, Norfolk, NR4 7TJ, UK 13 4 Department of Biological Sciences, Royal Holloway University of London, Egham, 14 UK. 15 * Corresponding authors 16 † Equal contribution 17 18 Christopher Douglas Robert Wyatt 19 Centre for Biodiversity and Environment Research, 20 University College London, London, UK. 21 E-mail: [email protected]; Phone +447307390637 22 23 Seirian Sumner 24 Centre for Biodiversity and Environment Research, 25 University College London, London, UK. 26 E-mail: [email protected] 27 28 Key words: Superorganismality, Castes, Wasps, Transcriptomics

1 29 Abstract

30 Major evolutionary transitions describe how biological complexity arises; e.g. in the

31 evolution of complex multicellular bodies, and superorganismal societies.

32 Such transitions involve the evolution of division of labour, e.g. among queen and

33 worker castes in insect societies. A key mechanistic hypothesis for the evolution of

34 division of labour is that a shared set of genes co-opted from a common solitary

35 ancestral ground plan - a so-called genetic toolkit for - regulates insect

36 castes across different levels of social complexity. The vespid wasps represent an

37 excellent system in which to test this. Here, using conventional and machine learning

38 analyses of brain transcriptome data from nine species of vespid wasps, we find

39 evidence of a shared genetic toolkit across species representing different levels of

40 social complexity, with a large suite of genes classifying castes correctly across

41 species. However, we also found evidence of additional fine-scale differences in

42 predictive gene sets, functional enrichment and rates of gene evolution that were

43 related to level of social complexity, but also life-history traits (e.g. mode of colony

44 founding). Thus, there appear to be shifts in the gene networks regulating social

45 behaviour and rates of gene evolution that are influenced by innovations in both

46 social complexity and life-history. These results suggest that the concept of a shared

47 genetic toolkit for sociality may be too simplistic to fully describe the process of the

48 major transition to sociality, even within a single lineage. Diversity in lineage, social

49 complexity and life-history traits must be taken into account in the quest to uncover

50 the molecular bases of the major transition to sociality.

51

2 52 Main

53 The major evolutionary transitions span all levels of biological organisation,

54 facilitating the evolution of life’s complexity on earth via cooperation between single

55 entities (e.g. genes in a genome, cells in a multicellular body, in a colony),

56 generating fitness benefits beyond those attainable by a comparable number of

57 isolated individuals1. The evolution of sociality is one of the major transitions and is

58 of general relevance across many levels of biological organisation from genes

59 assembled into genomes, single-cells into multi-cellular entities, and insects

60 cooperating in superorganismal societies. The best-studied examples of sociality are

61 in the hymenopteran insects (, wasps and ) - a group of over 17,000

62 species, exhibiting different forms of sociality across the transition from simple

63 sociality (with small societies where all group members are able to reproduce and

64 switch roles in response to opportunity), through to complex societies (consisting of

65 thousands of individuals, each committed during development to a specific

66 cooperative role and working for a shared reproductive outcome within the higher-

67 level ‘individual’ of the colony, known as the ‘superorganism’2). Recent analyses of

68 the molecular mechanisms of insect sociality have revealed how conserved suites of

69 genes, networks and functions are shared among independent evolutionary events

70 of insect superorganismality3–8. An outstanding question is whether there are fine-

71 scale differences in the genetic mechanisms operating at different levels of

72 complexity across the spectrum of the major transition – from the simplest to most

73 complex societies.

74 A key step in the evolution of sociality is the emergence of a reproductive

75 division of labour, where some individuals commit to reproductive or non-

3 76 reproductive roles, known as queens and workers respectively in the case of insect

77 societies. An overarching mechanistic hypothesis for social evolution is that the

78 repertoire of behaviours typically exhibited in the life cycle of the solitary ancestor of

79 social species were uncoupled to produce a division of labour among group

80 members with individuals specialising in either the reproductive (‘queen’) or

81 provisioning (‘worker’) phases of the solitary ancestor9. Such phenotypic decoupling

82 implies that there will be a conserved mechanistic toolkit that regulates queen and

83 worker phenotypes in species representing different levels of social complexity

84 across the spectrum of the major transition (reviewed in10). An alternative to the

85 shared toolkit hypothesis is that the molecular processes regulating social

86 behaviours in non-superorganismal societies (where caste remains flexible, and

87 selection acts primarily on individuals) differ fundamentally from those processes that

88 regulate social behaviours in superorganismal societies 11,12. Phenotypic innovations

89 across the kingdom have been linked to genomic evolution: taxonomically-

90 restricted genes13–16, rapid evolution of proteins17,18 and regulatory elements17,19

91 been found in most lineages of social insects20. Indeed, some recent studies have

92 suggested that the processes regulating different levels of social complexity may be

93 different8,17,19,21. The innovations in social complexity and the shift in the unit of

94 selection (from individual- to group-level22) that occur in a major transition may

95 therefore be accompanied by shifts in genomic processes and evolution, throwing

96 into question whether the concept of a universal conserved genomic toolkit that

97 regulates social behaviours across the spectrum of the major transition may be too

98 simplistic23. The roles of conserved and novel processes are not necessarily

99 mutually exclusive; novel processes may coincide with phenotypic innovations, whilst

4 100 conserved mechanisms may regulate core processes at all stages of social

101 evolution.

102 Until recently, the data available to test these hypotheses were largely limited

103 to species that represent either the most complex – superorganismal - forms of

104 sociality (e.g. most ants, honeybees24), or the simplest forms of social complexity as

105 non-superorganisms (e.g. Polistes wasps7,25–27 and incipiently social bees28–31), the

106 latter of which likely represent the first stages in the major transition. Recent studies

107 have identified core gene sets underlying caste-differentiated brain gene expression

108 across a range of ants5 and bees8; ants lack ancestrally non-superorganismal

109 representatives and so may not be informative for studying simpler societies and,

110 putatively, the process of social evolution32. Analyses of molecular evolution across

111 two independently evolved lineages of sociality in bees showed increased levels of

112 regulatory complexity with social complexity. These insights highlight the need for a

113 fine-scale dissection of the molecular processes regulating social behaviour for

114 species representing different forms of social complexity8, but also the need to

115 consider the influence of life-history traits on gene evolution, as a selective process

116 that is independent of complexity per se.

117 A promising group for unpicking the molecular basis of social complexity are

118 the vespid wasps33, which exhibit the full diversity of social complexity among some

119 1,100 species, as well as diversity in life-history traits such as mode of colony

120 founding. Aculeate wasps are of paramount importance in studying the evolution of

121 sociality in the as they represent the evolutionary root to both bees

122 and ants34. However, to date sociogenomic studies of caste evolution in wasps have

123 been limited to Polistes 7,25,35. Here we generated brain transcriptomic data of caste-

5 124 specific phenotypes for nine species of social wasps, representing diversity in social

125 complexity and mode of colony founding, a life-history trait that is independent of

126 social complexity (Fig. 1). We employed a machine-learning algorithm - support

127 vector machines (SVMs) – along-side more conventional gene expression analyses

128 to conduct fine-scale analysis of the molecular processes associated with caste, to

129 determine whether there is a conserved genetic toolkit for social behaviour in

130 species displaying different forms of sociality and that are putative proxy

131 representatives of the process of the major transition from non-superorganismal

132 (simple) to superorganismal (complex) species in Vespid wasps (Aim 1). We then

133 further interrogate these data to identify how transcriptomic signatures of social

134 behaviour are influenced by social complexity (i.e. simpler versus the more complex

135 societies), mode of colony foundation (i.e. independent-founding versus swarm-

136 founding) as key life-history trait in vespid, and lineage (vespines versus polistines).

137 Our results provide the first evidence of a conserved genetic toolkit across forms of

138 social complexity in the spectrum of the major transition to sociality in wasps; they

139 also reveal how there are fine-scale differences in the molecular patterns and

140 processes between simple and complex societies, and with life-history traits. Finally,

141 we discuss how patterns of gene evolution are contrasting with those reported in

142 bees and ants. These results suggest the concept of a shared genetic toolkit across

143 the spectrum of sociality and across lineages may be too simplistic, and that level of

144 social complexity and life-history traits influence the molecular processes

145 underpinning social behaviour.

146

6 147 Results

148 We chose one species from each of nine different genera of social wasps from

149 among the two main subfamilies - Polistinae and Vespinae; collectively these

150 species represent the full spectrum of social diversity (detail on species choice given

151 in Fig. 1 and Methods). For each species, we sequenced RNA extracted from whole

152 brains of adults to construct de novo brain transcriptomes for the two main social

153 phenotypes – adult reproductives (defined as mated females with developed ovaries,

154 henceforth referred to as ‘queens’ for simplicity; see Supplementary Table S1) and

155 adult non-reproductives (defined as unmated females with no ovarian development,

156 henceforth referred to as ‘workers’; see Supplementary Methods). Using these data

157 we could reconstruct a phylogenetic tree of the Hymenoptera using single-copy

158 orthologous genes (Orthofinder36), resulting in expected patterns of phylogenetic

159 relationships (Supp. Fig. 1). This dataset allows us to test the extent to which the

160 same molecular processes underpin the evolution of social phenotypes of different

161 forms of sociality, as extant proxies in the major transition to superorganismality in

162 wasps.

163

164 Aim 1: Is there a shared genetic toolkit for caste among species across the

165 major transition from non-superorganismality to superorganismality in Vespid

166 wasps?

167 We found several lines of evidence for a shared genetic toolkit for caste across the

168 species using two different analytical approaches.

7 169 Caste explains gene expression variation, after species-normalisation

170 The main factor explaining gene expression variation (using single copy

171 orthogroups) was species identity, and to some extent subfamily, as all the Polistinae

172 clustered tightly (Fig. 2a). However, since we are interested in determining whether

173 there is a shared toolkit of caste-biased gene expression across species, we needed

174 to control for the effect of species in our data. To do this, we performed a between-

175 species normalisation on the transcript per million (TPM) score, scaling the variation

176 of gene expression to a range of -1 to 1 (see Supplementary Methods). After

177 species-normalisation, the samples separate mostly into queen and worker

178 phenotypes in the top two principle components (Fig. 2b). This suggests that subsets

179 of genes (a potential toolkit) are shared across these species and are representative

180 of caste differences. However, there were outliers: Brachygastra did not cluster with

181 any of the other samples (despite correct PC1 queen direction); Agelaia showed little

182 caste-specific separation and in the opposite direction to the other species (Fig. 2b).

183 These two outlier species may not share the same caste-specific patterns as the

184 other species, but we cannot rule out data and/or sampling anomalies.

185 Next, we used inferred lists of orthologs allowing some flexibility in numbers of

186 isoforms (maximum of three isoforms, taking the most highly expressed) for each

187 orthogroup and missing data (maximum of two species), allowing larger matrices of

188 genes to be tested for differences between castes (Supplementary Table S2). Using

189 this approach we detected 57 orthologous genes with four or more caste-biased

190 species [queen or worker biased] (Fig. 3a; Supplementary Table S3) and 353

191 orthologous genes (Supplementary Table S3) with caste-biased expression in at

192 least two of nine species in the same direction (either queen- or worker-biased only).

8 193 There was overrepresentation of pheromone, chitin/odorant binding and muscle

194 related GO terms using these two cut-offs (Fig. 3b/c; Supplementary Table S3). No

195 ortholog showed consistent caste-biased differential expression across all nine wasp

196 species (Fig. 3a; Supplementary Table S3; using unadjusted <0.05 p values).

197 Despite this, notable signatures of caste regulation were apparent across the

198 species, with the exception of Brachygastra. For example, orthogroup OG0001418

199 was upregulated in queens from eight of the nine species and is predicted to belong

200 to the vitellogenin gene family (using the Metapolybia protein sequence to represent

201 the orthogroup), a transcriptional target and mediator of the insect physiology-

202 regulator juvenile hormone (JH). JH is known as a pleiotropic master regulator of

203 social traits, such as aggression, foraging behaviour and reproduction; in both

204 Vespinae and Polistinae JH has been shown experimentally to regulate reproduction

205 and is thought to be an honest fertility signal37,38. A second gene of interest that was

206 consistently queen-biased was oocyte zinc finger 22. Zinc finger proteins have

207 diverse functions but broadly they regulate gene expression by binding to DNA or

208 RNA and controlling transcription; they are caste-biased in a range of social insects

209 and have also been identified as targets for positive selection in social evolution 21

210 and so could be a key transcription factor driving wide expression patterns among

211 castes across our species.

212

213 A toolkit of many genes with small effects predicts caste across different

214 forms of sociality in wasps

215 Conventional differential expression analyses (e.g. edgeR) require a balance of P

216 value cut-offs and fold change requirements to reduce false-positive and false-

9 217 negative errors39. Therefore, consistent patterns of many genes with smaller effect

218 sizes may be missed when applying strict statistical measures. Support vector

219 machine (SVM) learning approaches use a supervised learning model capable of

220 detecting subtle but pervasive signals in differential expression between conditions

221 (e.g. for classification of single cells40,41, cancer cells42 and in social insect castes43).

222 We used this approach to test whether gene expression can successfully classify

223 caste identity for unknown samples; accurate classification of samples as queens or

224 workers based on their global transcription patterns would be evidence for a genetic

225 toolkit underpinning social phenotypes.

226 Starting with a “leave-one-out” SVM approach, we classified queen and

227 worker samples, using a predictive gene set generated from a model trained on

228 caste-specific gene expression from eight of the nine species, with the ninth species

229 being the test sample. We applied three different iterations of the leave-one-out

230 approach using different Orthofinder conversion tables (Supp. Table S2) and a radial

231 SVM model (see Methods; Supplementary Fig. 2). Our first approach used the “true-

232 single-copy” orthologues that could be detected with sufficient expression (Fig. 4a-

233 left; Supp. Table S4) across all nine species (n=1221 genes); after progressive

234 feature selection (based on linear regression, from left to right; see Methods) the top

235 259 genes (<0.05 from linear regression) could predict the queen caste correctly in

236 seven of the nine species (Fig. 4a- left; with >=0.6 likelihood in the queen sample

237 within the top 20% of genes); the two outlier species (Agelaia and Brachygastra)

238 showed generally lower predictions of queen likelihood (<0.5) and had previously

239 been highlighted as outliers (Figs. 2 & 3). We explored two other orthology lists in

240 order to explore whether, if the single-copy ortholog rule was relaxed and more

10 241 orthogroups were included, caste predictions would remain accurate: in one model

242 we allowed up to 3 isoforms per gene (Fig. 4a-middle: Supp. Table S4; n=1831

243 genes), and in the third we allowed up to two out of nine species to have a missing

244 gene representative of the orthogroup, plus up to three isoforms per gene (Fig. 4a-

245 right: Supp. Table S4; n=5536 genes). Even with these relaxed conditions, caste was

246 classified accurately for all species except for the aforementioned outliers. This

247 suggests that large numbers of genes with caste-biased expression exist across

248 social wasps.

249 We found some level of functional enrichment among the predictive genes

250 that remained in the models after feature selection. There were consistent patterns

251 of enrichment for GO terms in the two more relaxed models (Fig 4b- middle & 4b-

252 right), with genes enriched for mostly ionic transport and visual/sensory perception

253 (Supplementary Table S4). Expression of the top 50 genes (from linear regression;

254 chosen from Fig 4a- left) allow us to identify putatively most important genes (Fig.

255 4c); some were previously been identified as associated with caste differentiation in

256 social insects (e.g. vitellogenin; zinc finger (as mentioned above) and ubiquitin);

257 others have not been associated with caste previously but may have relevant

258 functions in regulating caste. For example, striatin-4 (OG0005432) is involved in

259 calmodulin binding and thus may play role in activating (or inactivating) calcium-

260 binding; adenosylhomocysteinase (OG0002120) is one of the most conserved

261 proteins in living organisms facilitates local transmethylation and thus may be

262 important in active transcription and epigenetic processes44. Interestingly, the vast

263 majority of the putatively important genes are downregulated in queens, and

264 upregulated in workers, suggesting that perhaps the transcriptomic demands of

11 265 worker behaviour may be more diverse than those involved in being a queen, whilst

266 queens express a more restricted molecular toolkit as expected if they are

267 specialised in reproduction.

268 We compared the genes and GO terms obtained from the SVM and edgeR

269 methods (Fig. 3c). There was little overlap in enrichment of GO terms between the

270 two datasets. Moreover, there was less overlap than expected by chance between

271 the 1420 SVM predictor genes and the 353 genes identified as differentially

272 expressed (edgeR; <0.05 P value): in comparisons that included least two species in

273 the same direction, only 35 genes overlapped (Hypergeometric: 1.89e-14, under-

274 enriched 2.59 fold; Supplementary Table S4). Of the genes that are present in both

275 sets, some have previously been identified as having relevance to social evolution

276 and caste differentiation; these include Vitellogenin (as above) 45 (two protein

277 isoforms: OG0001119 and OG0001418), zinc finger (as above) (OG0005731) and

278 gonadotrophin-releasing hormone, which is related to the regulation of reproduction.

279 ATP-synthase has caste specific expression in the weaver ant46 (OG0006255 and

280 OG0006612) and esterase E4-like (OG0000783) is upregulated in young honeybee

281 queens compared to nurses at the proteomic level47. There are also other genes of

282 interest, which to our knowledge have not previously associated with caste, including

283 Toll-like receptor 8 (OG0001441) (see Supplementary Table S4). Finally, 40% of the

284 overlapping genes (n=14) are uncharacterised proteins, suggesting a role for

285 potentially novel genes; novel (or taxon-restricted) genes have been previously

286 highlighted for their potential role in social evolution across bees and wasps48.

287

288

12 289 Aim 2: How does level of social complexity, phylogeny and life-history

290 influence the molecular basis of castes?

291 Level of social complexity may influence the genes regulating caste.

292 To explore whether there are any finer-scale differences in among the different levels

293 of social complexity, we trained an SVM model using Mischocyttarus, Polistes,

294 Angiopolybia and Metapolybia as representatives of the simpler societies in the

295 major transition (see Fig. 1), and tested how well this gene set classified castes for

296 the four species representing the more complex societies in the major transition

297 (Polybia, Agelaia, Vespa and Vespula; see Fig. 1). Brachygastra was excluded due

298 to its poor performance overall (see Fig. 4) and to balance sample sizes in training

299 and test sets. If castes in the test species classify well, this would suggest that the

300 processes regulating castes in the simpler societies are equally important in the

301 more complex societies (i.e. there is no specific caste toolkit for social behaviour in

302 simple societies, which is absent (putatively lost) in more complex societies).

303 Conversely, if the test species do not classify well, this would suggest that there are

304 distinct processes regulating caste in the simpler societies that are less important in

305 more complex forms of sociality.

306 When training the model on the four species with simple societies, we found

307 that both vespines (Vespa and Vespula) had good classification confidences at

308 around 0.8 (Fig. 5a), comprising around 800 genes after feature selection. Similarly,

309 Polybia also classified well, though with lower confidence. Agelaia could not classify

310 castes correctly; however, this species was identified as an outlier in our first

311 analyses and this may reflect some facet of their biology that we do not yet

312 understand, or an anomaly with the data. Considering these four tests together, a set

13 313 of 277 gene orthologs were found (Supplementary Table S5; p value <0.05 in all four

314 species tests), that represent a core set of genes that classify caste correctly in

315 simple societies. Based on these species, these results suggest that the genetic

316 toolkit for simple societies is well conserved in the more complex societies that we

317 sampled, although future work should confirm this by expanding the repertoire of

318 species.

319 The reciprocal test performed less well, suggesting there may be different

320 processes regulating more complex sociality, which is absent from the simpler

321 societies. We trained the SVM using the four species with the more complex

322 societies (Polybia, Agelaia, Vespa and Vespula) and tested it on the four species

323 with simpler societies (Mischocyttarus, Polistes, Metapolybia and Angiopolybia).

324 After feature selection, around 350 genes were found in each of the four tests, and

325 289 were found in all four tests, describing castes in these more complex societies

326 (Supplementary Table S5). In general, these genes were successful in classifying

327 castes for the simpler societies (Fig. 5a-lower), although none of the species

328 matched the high classification confidence of the vespines in the reciprocal test. The

329 lower overall success of the gene set trained on the more complex species in

330 classifying castes in species from the simpler societies suggests there may be

331 different (additional) processes regulating castes at the more complex levels of

332 sociality.

333 To explore this idea further, we tested whether there was significant overlap

334 among the three putative toolkits, by comparing the identities of genes predictive of

335 ‘simpler societies’ (n=277), ‘more complex societies’ (n=289), and ‘full spectrum’ (Fig

336 4a- Left; n=259) toolkits. We found significant overlap between the combined genes

14 337 in the simpler and the more complex ‘toolkits’ with the full spectrum set (Fig 5b grey

338 vs pink; blue vs pink), but not between the simpler and more complex societies (grey

339 vs blue). This suggests that there may be distinct differences in the genes regulating

340 castes at different levels of social complexity. We found further suggestive evidence

341 for this at the putative functional level: there were no consistent patterns of GO terms

342 across the two toolkits, with only the complex society toolkit having significant levels

343 of enrichment after correction (Fig. 5c- mostly cation/ion transport; Supplementary

344 Table S5).

345 Further support for there being different molecular processes shaping caste

346 transcription at different levels of social complexity was found in the patterns of gene

347 evolution. Overall, the highest rates of dN/dS were in the species with simpler

348 societies; this was true across all nine species (dN/dS mean of orthogroups in

349 foreground branch with simple species = 0.1639218; with complex species =

350 0.1015024, Wilcoxon test W = 307318, p-value < 2.2e-16) and also just within the

351 Polistinae (dNdS mean of orthogroups in foreground branch with simple species =

352 0.180918; with complex species = 0.1327845; Wilcoxon test W = 119615, p-value =

353 2.824e-11). Some orthogroups had experienced significant positive selection (i.e. on

354 the given foreground branch, null model rejected, chi-square test p < 0.05, dNdS =>

355 1; a total of eleven orthogroups). Six of these orthogroups were under positive

356 selection in the simplest polistines relative to the more complex polistines

357 (Supplementary Table S7-ST15 n = 6; OG0002957 (transmembrane 234 homolog

358 isoform X1), OG0003204 (mitochondrial ribonuclease P 3, upregulated in zebrafish

359 behavioural experiments49), OG0003159 (lysozyme c-1-like, in venom50),

360 OG0003184 (doublesex isoform X1, female-specific gene expression in ant51),

15 361 OG0005376 (uncharacterized protein LOC107063793), OG0005104 (box C D

362 snoRNA 1)). Four were among the more complex polistines, relative to the simpler

363 species (Supplementary Table S7-ST16: OG0003362 (UMP-CMP kinase isoform

364 X1), OG0003491 (uncharacterized protein LOC107067502), OG0004867

365 (uncharacterized protein LOC106789623), OG0005164 (probable tubulin

366 polyglutamylase TTLL9)). However, when the vespines were also included, only one

367 orthogroup was under positive selection (Supplementary Table S7-ST14;

368 OG0003694 (transmembrane 59-like isoform X1)), suggesting there may be some

369 lineage effects here, although wider taxonomic sampling of vespines would be

370 required to confirm this. Of the 11 significant orthogroups identified across these

371 analyses, only one was differentially expressed with respect to caste in two species

372 (OG0004867, uncharacterized protein LOC106789623), and four other orthogroups

373 (OG0003362: UMP-CMP kinase isoform X1; OG0003491: uncharacterized protein

374 LOC107067502; OG0003159: lysozyme c-1-like; OG0003184: doublesex isoform

375 X1) were differentially expression in just one species (Supplementary Table S7-

376 ST27).

377 Taken together, these results provide evidence that the molecular processes

378 regulating caste differentiation vary depending on the level of social complexity. The

379 finding of more rapid rates of protein evolution in the simpler societies

380 (Supplementary Figure 3c) was unexpected since studies of gene evolution in social

381 bees have found the higher rates of evolution in the more complex societies8.

382 Molecular processes shaping the evolution of sociality may differ with lineage and/or

383 life-history. Patterns of molecular evolution should be further explored with broader

16 384 analysis of species representing different levels of complexity within lineages, but

385 also with wider taxonomic breadth when comparing across lineages.

386

387 No evidence that the genes regulating castes vary among subfamilies

388 Using the same reciprocal SVM approach, we found little evidence that sub-family

389 influences the molecular processes regulating caste. Using the seven polistines as

390 the training set, queens and workers in the two vespines classified with 100%

391 confidence (Supplementary Table 6 [tab Vespula_Polistinae + Vespa_Polistinae]; the

392 reverse of this test could not be performed due to low sample sizes for a Vespine

393 training set). Despite the small number of vespines sampled, these results suggests

394 that the genes regulating castes are shared across these two subfamilies.

395

396 Life-history traits may influence the genes regulating caste

397 A key life-history trait that varies in social wasps is their mode of colony founding.

398 Some species form new colonies alone (‘independent founders’) whilst others form

399 new colonies as a swarm containing queens and workers (‘swarm founders’).

400 Founding behaviour is independent of level of complexity: e.g. the species with the

401 simplest (e.g. Polistes and Mischocyttarus) and the most complex (Vespa and

402 Vespula) societies are all independent founders; equally the swarm-founders

403 (Angiopolybia, Metapolybia, Agelaia, and Polybia) exhibit a spectrum of social

404 complexity (see Fig. 1). It is possible therefore, that the molecular basis of caste

405 reflects this key life-history trait, and not just social complexity. We found some

406 evidence to support this. The gene set obtained from an SVM trained to classify

407 castes in the swarm-founders was very successful in classifying castes for the

17 408 independent founders, with classifications between 0.7-1 for all test species.

409 Conversely, the reciprocal analysis performed poorly: the gene set obtained from an

410 SVM trained on the independent-founders was unable to predict castes at >0.7 for

411 three of the four species (Supplementary Figure 4; Supplementary Table 6

412 [Swarm/Independent]). The influence of colony founding behaviour on molecular

413 processes was further supported when comparing the rates of protein evolution in

414 the swarm-founders relative to the independent founding species (Supplementary

415 Table 7 - ST12): the rates of protein evolution are on average higher among the

416 swarm-founders (mean = 0.1461955) than the species with complex species but

417 smaller than simple species (Wilcoxon tests W = 384912, p-value = 1.847e-07, and

418 W = 126972, p-value = 0.0001657 respectively). These results suggest that there

419 may be additional caste-related molecular processes involved in the evolution of

420 innovations in life-history traits (in this case, mode of colony founding) that are

421 independent of social complexity.

422

423

424 Discussion

425 Major transitions in evolution provide a conceptual framework for understanding the

426 emergence of biological complexity. Discerning the processes by which such

427 transitions arise provides us with critical insights into the origins and elaboration of

428 the complexity of life. In this study we explored the evidence for two key hypotheses

429 on the molecular bases of social evolution by analysing caste transcription in nine

430 species of wasps. As predicted, we find evidence of a shared genetic toolkit across

18 431 the spectrum of social complexity in wasps; importantly, using machine learning we

432 reveal that this toolkit likely consists of many hundreds of genes of small effect (Fig.

433 4). However, in sub-setting the data by level of social complexity and by a key life-

434 history trait (colony founding), several important new insights are revealed. Firstly,

435 there appears to be a putative toolkit for castes in the simpler societies that largely

436 persists across the major transition, through to superorganismality. Secondly,

437 different (additional) processes appear to become important at more complex levels

438 of sociality. Further sampling is required to determine the extent to which the role of

439 these additional processes is driven by the evolution of superorganismality, and

440 specific innovations that have been key in the major transition to sociality. Thirdly,

441 the so-called toolkit for sociality is also influenced by the mode of colony founding;

442 the influence of ecology and life-history is largely overlooked in the current literature

443 on the molecular basis of sociality. Finally, we discuss apparent differences among

444 lineages, highlighting the importance of wider taxonomic consideration in the quest

445 to understand the molecular basis of sociality and the nature of the major transition.

446 Taken together, we suggest that the concept of a shared molecular toolkit for

447 sociality may be too simplistic and that future studies should consider the

448 evolutionary impact of other traits on molecular processes, specifically life-history

449 and ecology, alongside castes.

450

451 The first important finding is that we identified a substantial set of genes that

452 consistently classify caste across most of the species, irrespective of the level of

453 social complexity. The taxonomic range of samples used meant we were able to

454 confirm that specific genes are consistently differentially expressed, with respect to

19 455 caste, across the species. These patterns would be difficult to detect if only looking

456 at a few species within a lineage, or species across several lineages, or species

457 representing only a limited selection of social complexities. In addition to typical

458 caste-biased molecular processes, we also identified that genes related to synaptic

459 vesicles are different between castes; this is interesting as the regulation of synaptic

460 vesicles affects learning and memory in insects52. To our knowledge, this is the first

461 evidence of what may be a conserved genetic toolkit for sociality, from the simplest

462 stages of social living through to true superorganismality, including intermediate

463 stages of complexity in wasps. Greater taxonomic sampling will help recover the full

464 spectrum of genes involved that are regulated across the major transition and may

465 have been important in the evolution of sociality.

466 The underlying assumption, based on the conserved toolkit hypothesis, has

467 been that whatever processes regulate castes in complex societies must also

468 regulate castes in simpler societies6,23,53. Unexpectedly, our analyses suggest that

469 different molecular processes underpinning castes come into play at the more

470 complex levels of sociality compared to species with more simple societies. The

471 predictive gene set identified in the SVM trained on more complex species

472 performed less well in classifying caste than the predictive toolkit derived from the

473 simpler species. There may be fundamental differences discriminating (near)

474 superorganismal societies from non-superorganismal societies54. This highlights the

475 importance of examining different stages in the major transition when attempting to

476 elucidate its patterns and processes8,10,33.

477 Vespinae and Polistinae display a wide diversity in traits other than social

478 complexity. One of these in the species we sampled was the mode of colony

20 479 foundation. Some species start new colonies alone, or with a small group of

480 reproductively potent females (independent founders) whilst others form a new

481 colony as a group of queen(s) and workers (swarm founders). Among the Polistinae,

482 mode of colony founding has a phylogenetic basis, with the Polistini and

483 Mischocyttarini tribes being independent founders, whilst the tribe are all

484 swarm founders (see Fig. 1)55,56. Within the Polistinae, therefore, there is a tendency

485 to consider swarm founding as a trait that is associated with social complexity.

486 However, the Vespinae sampled in this study are also independent founders, and yet

487 display the most complex forms of sociality. As such, our sampling afforded the

488 opportunity to determine whether this important life-history trait influenced brain

489 transcription among social phenotypes, independently of the form of social

490 complexity. Consistent with the idea that independent founding is an ancestral trait,

491 we detected additional brain transcriptomic processes at play among castes of the

492 swarm founders, that were did not distinguish castes among the independent

493 founders. Clearly, wider taxonomic sampling is required to explore this further, but

494 these analyses bring into focus how the transcriptomic basis of castes is not a simple

495 reflection of social complexity; ecology and life-history deserve more attention in

496 explaining variation in the molecular bases of social phenotypes57. This is likely to be

497 especially important when seeking to compare patterns of caste transcription across

498 lineages with contrasting life-histories and ecologies.

499 Phenotypic evolution is often accompanied by increased rates of gene

500 evolution. Indeed, high rates of protein evolution have been detected in worker-

501 biased genes of more complex societies like the honeybee, and some ants16–19. This

502 makes sense as the worker caste (with altruism and partial/complete sterility) is the

21 503 dominant phenotypic innovation in social evolution and has led to the suggestion that

504 rapid gene evolution is expected in the most complex forms of sociality, where

505 phenotypic specialisations are most pronounced17,20. Our dataset afforded a first test

506 of this idea in the wasps. We were surprised to find that the highest rates of gene

507 evolution were detected in the wasp species with the simpler societies rather than

508 the more complex societies; these results are in stark contrast with the findings of a

509 recent study on a range of social complexity in bees8 where (in keeping with

510 previous studies), rapid gene evolution was most apparent among species with the

511 most complex societies. There are several interpretations of these contrasting

512 results. First, the level of social complexity displayed by the simplest species in Shell

513 et al 2021 represents a much more rudimentary form of sociality compared to the

514 simplest societies in our study, including four solitary species, one facultatively social

515 species (e.g. females of Ceratina australensis can nest either alone or in a small

516 group) and some barely exhibiting any reproductive division of labour (e.g. Ceratina

517 calcarata). Facultative sociality is rare in the vespid wasps, being largely limited to a

518 few species of Stenogastrinae; the simplest Polistinae are represented in our sample

519 and all are obligately social (meaning they always nest in a group). It is possible

520 therefore, that the rapid gene evolution occurs only when a species is committed to

521 group living, with a clear division of reproductive labour. An alternative explanation

522 lies in ecology and life-history: bees and wasps differ fundamentally in how they

523 provision brood, with wasps hunting prey and bees collecting pollen. These findings

524 add to the complexities of explaining the molecular basis of social evolution, and

525 highlight a greater need to emphasis and account for variation in natural history,

526 ecology and life-history as well as social behaviour per se10,24,33,54.

22 527 Despite being able to extract consistent SVM predictions, our models are only

528 as good as the initial data used to train them. Our study suffers from a few

529 limitations. Firstly, the sample size (number of species) is good for a sociogenomics

530 study, but relatively low for machine learning studies; SVMs are generally used on

531 very large datasets such as clinical trials in the medical sciences58. Although our

532 models did perform well, the analyses would be more robust by using more species

533 in the training datasets. Indeed, we observed reduced performance in our model

534 predictions when fewer species were included in the training set. Secondly, by

535 comparing across multiple species, we can only train our model on genes that have

536 a single representative isoform per species in each separate test. This reduces the

537 numbers of genes we can test in each SVM model, especially where more distantly

538 related species are included. We overcame this limitation by merging gene isoforms

539 within the same orthogroup (potential gene duplications); yet this comes with some

540 additional costs as some genes are discarded in this process. Finally, genomes are

541 not available for most of the species we tested; our measurements are based on de

542 novo sequenced transcriptomes, which potentially contain misassembled transcripts,

543 which could reduce the ability to find single copy orthologs across species. For these

544 reasons, the numbers of genes detected in our putative toolkit for sociality is likely to

545 be conservative and modest (potentially by several fold). These limitations are likely

546 to apply to many similar studies, due to the difficulty and expense of obtaining high

547 quality genomic data for specific phenotypes for non-model organisms. Our study

548 illustrates the power of SVMs in detecting large suites of genes with small effects,

549 which largely differ from those identified from conventional differential expression

550 analysis43. We advocate the use of the two methods in parallel: our conventional

23 551 analyses suggested that metabolic genes appear to be responsible for the

552 differences between castes, whereas the SVM genes were mostly enriched in neural

553 vesicle transportation genes, which have not previously been connected with caste

554 evolution. SVMs may therefore reveal new target for genes involved in the evolution

555 of sociality. We anticipate that machine learning approaches, as demonstrated here,

556 may become a useful tool in a wide range of ecological and evolutionary studies on

557 the molecular basis of phenotypic diversity.

558

559 In conclusion, our analyses of brain transcriptomes for castes of social wasps

560 suggest that the molecular processes underpinning sociality are conserved

561 throughout the major transition to superorganismality. However, different (putatively

562 additional) processes may come into play in more complex societies. There may be

563 fundamental differences in the molecular machinery that discriminates key points of

564 innovation in the major transition; for example, the evolution of irreversible caste

565 commitment (in superorganisms) may require a fundamental shift in the underlying

566 regulatory molecular machinery. Such shifts may be apparent in the evolution of

567 sociality at other levels of biological organisation, such as the evolution of

568 multicellularity, taking us a step closer to determining whether there is a unified

569 process underpinning the major transitions in evolution.

570

571 Acknowledgements

572 We would like to thank Laura Butters for her help with the morphometric analyses of

573 Brachygastra, Robin Southon, Daniel Fabbro, Liam Crowley and Sam Morris for their

574 assistance in the field, James Carpenter at the American Natural History Museum

24 575 and Christopher K. Starr for confirming species identification, and the Bristol

576 Genomics Facility for their assistance with library preparations and sequencing. This

577 work was conducted under collection and export permits for Trinidad (Forestry

578 Division Ministry of Agriculture: #001162) and Panama (Autoridad nacional del

579 ambiente (ANAM) SE/A-55-13). It was funded by the Natural Environment Research

580 Council (NE/M012913/2; NE/K011316/1) awarded to SS, and a NERC studentship

581 and Smithsonian Tropical Research Institute pre-doctoral fellowship awarded to

582 E.F.B.

25 583 Methods

584 Study Species

585 Nine species of vespid wasps were chosen to represent different levels of social

586 complexity across the major transition (Fig. 1). The simplest societies in our study

587 are represented by Mischocyttarus basimacula basimacula (Cameron) and Polistes

588 canadensis (Linnaeus); wasps in these two genera are all independent nest founders

589 and lack morphological castes (defined as allometric differences in body shape,

590 rather than overall size) or any documented form of life-time caste-role

591 commitment59–62. They live in small family groups of reproductively totipotent

592 females, one of whom usually dominates reproduction (the queen); if the queen dies

593 she is succeeded by a previously-working individual21. As such, these societies

594 represent some of the earliest stages in the major transition, where caste roles are

595 least well defined, and where individual-level plasticity is advantageous for

596 maximising inclusive fitness.

597 The Neotropical swarm-founding wasps (Hymenoptera: ; Epiponini) include

598 over 20 genera with at least 229 species, exhibiting a range of social complexity

599 measures, from complete absence of morphological caste (pre-imaginal)

600 determination to colony-stage specific morphological differentiation, through to

601 permanent morphological queen-worker differentiation56. As examples of species for

602 which there is little or no evidence of developmental (morphological) caste

603 determination, we chose Angiopolybia pallens (Lepeletier) which is phylogenetically

604 basal in the Epiponines55,63 and Metapolybia cingulata (Fabricius)55,63. We confirmed

26 605 the lack of clear caste allometric differences in M. cingulata as data were lacking

606 (see Morphometrics methods (below) and Supplementary Table S8).

607 As examples of species showing subtle, colony-stage-specific caste allometry, we

608 chose a species of Polybia. The social organisation of Polybia spp is highly variable,

609 ranging from complete absence of morphological queen-worker differentiation64.

610 Polybia quadricincta (Saussure) is a relatively rare (and little studied) epiponine

611 wasp which can be found across , , Columbia, , Guyana,

612 , Suriname and Trinidad (Richards, 1978). Our morphometric analyses found

613 some evidence of subtle allometric morphological differentiation in this species, but

614 with variation through the colony cycle (Supplementary Table S8); this suggest it is a

615 representative species for the evolution of the first signs of pre-imaginal caste

616 differentiation.

617 Many species of the genera Agelaia and Brachygastra appear to show pre-imaginal

618 caste determination with allometric morphological differences between adult queens

619 and workers55. We chose one species from each of these genera as representatives

620 of the most socially complex Polistine wasps. Although no morphological data were

621 available for Agelaia cajennensis (Fabricius) all species of Agelaia studied show

622 some level of preimaginal caste determination55,65. Brachygastra exhibit a diversity of

623 caste differentiation55,66; our morphological analysis of caste differentiation B.

624 mellifica (Say) confirms that this species is highly socially complex, with large colony

625 sizes55 and pre-imaginal caste determination resulting in allometric caste differences

626 (Supplementary Table S8).

627 All species of Vespines are independent nest founders and superorganisms, with a

628 single mated queen establishing a new colony alone and with morphological castes

27 629 that are determined during development. However, some species exhibit derived

630 superorganismal traits, such as multiple mating67, which have likely evolved under

631 different selection pressures to the major transition itself68. The European hornet,

632 Vespa crabro (Linnaeus), exhibits the hallmarks of superorganismality (see Fig. 1)

633 but little evidence of more derived superorganismal traits, such as high levels of

634 multiple mating. Conversely, multiple mating is common in Vespula species,

635 including V. vulgaris (Linnaeus) with larger colony sizes than Vespa67, suggesting a

636 more complex level of social organisation.

637

638 Sample collection

639 Where possible, we sampled from colonies representing different stages in the

640 colony cycle, as caste differentiation can vary as the colony matures in some species

641 (Supplementary Table S1. Metapolybia cingulata (6 colonies), Polistes canadensis (3

642 colonies), Agelaia cajennensis (1 colony) and Mischocyttarus

643 basimacula basimacula (3 colonies) were collected from wild populations in Panama

644 in June 2013. Brachygastra mellifica (4 colonies) were collected from populations in

645 Texas, USA in June 2013. Angiopolybia pallens (2 colonies) and Polybia

646 quadricincta (2 colonies) were collected from Arima Valley, Trinidad in July 2015.

647 Vespa crabro (4 colonies) and Vespula vulgaris (4 colonies) were collected from

648 various locations in South East England, UK in 2017. Queens and workers were

649 collected directly from their nests during the daytime, placed immediately into

650 RNAlater (Ambion, Invitrogen) and stored at -20°C until further use. Samples were

651 ultimately pooled within castes for all analyses, such that each pool consisted of 3-6

652 individual brains from wasps sampled across 2-4 colonies to capture individual-level

28 653 and colony-level variation in gene expression (see Supplementary Table S1).

654 Samples of M. cingulata, A. cajennensis, M. basimacula and B. mellifica were sent to

655 James Carpenter at the American Natural History Museum for species verification.

656 A. pallens and P. quadricincta were identified by Christopher K. Starr, at University of

657 West Indies, Trinidad and Tobago.

658 Morphometrics

659 Data on morphological differentiation among colony members (and thus information

660 on whether pre-imaginal (developmental) caste determination was present) was

661 lacking for M. cingulata, P. quadricinta and B. mellifica; therefore, we conducted

662 morphometric analyses on these three species in order to ascertain the level of

663 social complexity. Morphometric analyses were carried out using GXCAM-1.3 and

664 GXCapture V8.0 (GT Vision) to provide images for assessing morphology. We

665 measured 7 morphological characters using ImageJ v1.49 for queens and workers

666 for each species. The body parts measured were: head length (HL), head width

667 (HW), minimum interorbital distance (MID), mesoscutum length (MSL), mesoscutum

668 width (MSW), mesosoma height (MSH) and alitrunk length (AL) (for measurement

669 details, see 69). Abdominal measurements were not recorded as ovary development

670 could alter the size of abdominal measurements, therefore biasing the results. The

671 morphological data were analysed to determine whether the phenotypic

672 classification, as determined from reproductive status, could be explained by

673 morphological differences. ANOVA was used to determine size differences between

674 castes for each morphological characteristic. A linear discriminant analysis was also

675 employed to see if combinations of characters were helpful in discriminating between

676 castes. The significance of Wilks’ lambda values were tested to determine which

29 677 morphological characters were the most important for caste prediction. All statistical

678 analyses were carried out using SPSS v23.0 or Exlstat 2018. Data and analyses

679 given in Supplementary Table S8.

680

681 Dissections & RNA extractions

682 Individual heads were stored in RNAlater for brain dissections; abdomens were

683 removed and dissected to determine reproductive status. Ovary development was

684 scored according to 33,54 and the presence/absence of sperm in the spermathecae

685 was identified to determine insemination. Inseminated females with developed

686 ovaries were scored as ‘queens’; non-inseminated females with undeveloped ovaries

687 were scored as workers. Brains were dissected directly into RNAlater; RNA was

688 extracted from individuals and then pooled after extraction into caste-specific pools;

689 pooling after RNA extraction allowed for elimination of any samples with low quality

690 RNA. Pooling individuals was generally necessary to ensure sufficient RNA for

691 analyses, as well as accounting for individual variation to ensure expression

692 differences are due to caste or species, and not dependent on colony or random

693 differences between individuals. One exception to this was the V. vulgaris samples

694 which were sequenced as individual brains and pooled bioinformatically after

695 sequencing. Individual sample sizes per species are given in Supplementary Table

696 S1.

697

698 Total RNA was extracted using the RNeasy Universal Plus Mini kit (Qiagen,

699 #73404), according to the manufacturer’s instructions, with an extra freeze-thawing

700 step after homogenization to ensure complete lysis of tissue, as well as an additional

30 701 elution step to increase RNA concentration. RNA yield was determined using a

702 NanoDrop ND-8000 (Thermo Fisher Scientific); all samples showed A260/A280

703 values between 1.9 and 2.1. An Agilent 2100 Bioanalyser was used to determine

704 RNA integrity. Samples of sufficient quality and concentration were pooled and sent

705 for sequencing. Libraries were prepared using Illumina TruSeq RNA-seq sample

706 prep kit at the University of Bristol Genomics Facility. Five samples were pooled per

707 lane to give ~ 50M read per sample. Paired-end libraries were sequenced using an

708 Illumina HiSeq 2000. Descriptions of pooling of individuals and pooled sets into

709 single representatives of caste are shown in (Supplementary Table S1). Raw reads

710 are available on SRA/GEO (GSE159973).

711

712 Preparation of de novo transcriptomes

713 Transcriptomes of Agelaia, Angiopolybia, Metapolybia, Brachygastra, Polybia,

714 Polistes and Mischocyttarus were assembled using the following steps. First, reads

715 were first filtered for rRNA contaminants using tools from the BBTools

716 (version:BBMap_38) software suite (https://jgi.doe.gov/data-and-tools/bbtools/). We

717 then used Trimmomatic v0.3970 to trim reads containing adapters and low-quality

718 regions. Using these filtered RNA sequences, we could assemble a de novo

719 transcriptome for each species (merging queen and worker samples) using Trinity

720 v2.871 and filter protein coding genes to retain a single transcript (most expressed)

721 for each gene and transcript per million value (TPM), which we use for the rest of the

722 analyses.

723 For Vespula and Vespa, reads from both queen and worker samples were

724 assembled into de novo transcriptomes using a Nextflow pipeline

31 725 (github:biocorecrg/transcriptome_assembly). This involved read adapter trimming

726 with Skewer72, de-novo transcriptome assembly with Trinity v2.8.471 and use of

727 TransDecoder v5.5.071 to identify likely protein-coding transcripts, and retain all

728 translated transcripts. These were further filtered to retain the largest open read

729 frame-containing transcript, which we listed as the major isoform of each protein.

730 Trinity assembly statistics are shown in Supplementary Table S2.

731

732 Identification of orthologs

733 To identify gene-level orthologs, we used Orthofinder v.2.2.736 with diamond blast73,

734 multiple sequence alignment program MSA74 and tree inference using FastTree

735 v2.1.1075, for our nine species (Supplementary Fig. 1). The largest spliced isoform

736 per gene (from Trinity) was designated the representative sequence for each gene.

737 In addition for specific iterations, we allowed the existence of multiple gene isoforms

738 belonging to the same species in a single orthogroup (potential duplications); using

739 up to 3 gene isoforms, only keeping the gene most highly expressed. This strategy

740 allows us to test a greater number of orthogroups, where a single duplication in one

741 species would otherwise remove an entire orthogroup from the study. The drawback

742 is that some gene isoforms are removed from the study that could have been

743 informative. Similarly, we were able to include orthogroups that had missing gene

744 information (NAs) for up to two species, filling in these missing genes with read

745 counts of 10 for both queen and worker, thereby allowing us to test orthogroups

746 where some species data are missing. Missing gene orthologs in a single species

747 could be due to a true absence (likely gene loss) in a species, or due to assembly

748 weaknesses, where the use of non-genome guided RNA-Seq assemblies from just

32 749 brain tissue, could lead to missing orthologs. Scripts to perform these steps are

750 included in the Github (https://github.com/Sumner-

751 lab/Multispecies_paper_ML/tree/master/Scripts_to_improve_orthogroups).

752 Orthofinder was also used to create the phylogenetic tree in Supplementary Figure

753 1. Using the single copy orthogroups only, and with the additional of Apis mellifera

754 (GCF_000002195.4), Nomia melanderi (GCA_003710045.1), Dinoponera

755 quadriceps (GCA_001313825.1) and Solenopsis invicta (GCF_016802725.1).

756

757 Gene expression and differential expression

758 We calculated abundances of transcripts within queen and worker samples using

759 “align_and_estimate_abundance.pl” within Trinity, using estimation method RSEM v1.3.174,

760 “trinity_mode” and bowtie276 aligner. We then used edgeR v3.26.5 77 (R version 3.6.0) to

761 compare gene expression between queens and workers. Given that we were comparing a

762 single sequencing pool of several individuals per caste, we used a hard-coded dispersion of

763 0.1 and the robust parameter set to true to account for n = 1. Raw P values for each gene

764 were corrected for multiple testing using a false discovery rate (FDR) cut-off value of 0.05.

765 We did not take advantage of genome data (where available), as only two of the species had

766 published genomes at the time of analysis; using transcriptome-only analyses makes the

767 analysis more consistent across species. Trinity assemblies and RSEM counts are available

768 on GEO/SRA (GSE159973). To compare differential expression across the nine species

769 (Figure 3), we used the orthology list based on three splice isoforms represented as the

770 most expressed and using only orthogroups with a single entry in all nine species after the

771 merging of three isoforms.

772

33 773 Building a normalised gene expression matrix

774 Focused on our set of one-to-one orthologs (merging 3 or less isoforms per species),

775 we began by computing log transformed TPMs (transcripts per million reads) for

776 each gene in each sample from the raw counts, followed by quantile normalisation.

777 Next, we normalised for species, using an approach that is comparable to calculating

778 a species-specific z-score for each sample. Specifically, we transformed the

779 expression scores calculated above by subtracting the species mean and dividing by

780 the species mean for each sample within a species. This calculation has two

781 important effects. Firstly, subtracting the species mean from each sample within a

782 species centres the mean expression of each species on zero, making the units of

783 expression more comparable across species. Secondly, dividing by the species

784 mean from each sample standardises the expression scores, producing a measure

785 that is independent of the units of measurement, so that the magnitude of difference

786 between queens and workers in each sample is no longer important. The

787 transformed expression score thus allows us to focus on the relative expression in

788 queens versus workers across species. Finally, we removed orthogroups where the

789 counts per million were below 10 (with some deviations in specific iteration; listed

790 with run plots) in both Queen and Worker samples of each species, to remove lowly

791 expressed genes that may contribute noise to subsequent analyses. We then

792 performed principal component analysis (PCA) in R on the raw TPM values and

793 those with the normalisation steps outlines above (see: https://github.com/Sumner-

794 lab/Multispecies_paper_ML/blob/master/Scripts_to_run_svm/template_scripts/getEx

795 pressionAllOrthofinder.noNameReplace.R for full R code of this section).

34 796

797 Machine learning (support vector machines)

798 Support vector machine (SVM) classification was used to classify caste across the

799 species. In brief, this analysis involved taking species-scaled, logged and normalised

800 gene expression data from the nine species, filtering the lowly expressed genes,

801 splitting into training and testing datasets (8 training, 1 testing), creating an SVM

802 classifier that best separates caste in the training species (find best gamma and C

803 tuning variables) and testing one of species left out, given the classification. The

804 code to run these steps is shown on github (https://github.com/Sumner-

805 lab/Multispecies_paper_ML), and is executed in perl using subsidiary R scripts. The

806 full detail of these steps are outlined below.

807 The first step is to normalise the expression matrix, given the svm classifier

808 expects the fully normalised and scaled data. This has been outlined in the previous

809 section, and involves orthologous TPMs being logged, quantile and species-scaled;

810 as well as filtering of lowly expressed genes. Second, data were split into test and

811 train sets, so each species was addressed in turn, with queen and worker samples

812 registered based on sample names (e.g. caste <- grepl("Worker ",

813 colnames(data.train))), and converted into a binary format. Third, a svm classifier

814 could be tuned using the train data (8 of 9 species), using the R package e107178,

815 with optional grid search ranges of gamma = 10^(-8:-3) and cost = 2^(1:10). These

816 settings allow the classifier to explore the best gamma and cost settings to use in the

817 model. Fourth, based on the classifier and the best model it predicted (always the

818 radial model, and variable C and gamma values), we could predict the likelihood of

819 the nines sample being a queen or a worker. To explore which kernel was most

35 820 appropriate, we ran the script using radial, polynomial, linear and sigmoidal kernels

821 (Supplementary Figure 2), which showed consistent changes in prediction certainty

822 among the radial, polynomial and sigmoidal kernels but more erratic predictions in

823 the linear kernel. We decided to continue with the radial kernel for the rest of our

824 predictions, as it has decent prediction accuracy once the dataset is filtered, and has

825 progressive improvement across the filtering spectrum (Supplementary Figure 2; top

826 left), unlike the linear kernel (bottom left).

827 Given the majority of genes are likely not to contribute to caste differentiation,

828 on top of the basic svm test, we performed feature selection, which is commonly

829 used in machine learning to increase the power of the classifier79. For this we used

830 linear regression of each gene on caste (lm(caste ~ expr, data)), using the training

831 data only. With regression beta coefficients per orthologous gene, we could then

832 rank genes by their statistical association with caste (Supp. Table 4 using the

833 absolute values of the regression coefficients). This enabled us to measure how the

834 classification certainty changed as we filtered out genes statistically un-associated

835 with reproductive division of labour (Figure 4a).

836 Classification certainty of 0.5 would indicate the SVM could not tell the

837 difference between the two castes (maximal uncertainty), a classification certainty of

838 0 would indicate maximal certainty of a worker and 1 a maximal certainty of being a

839 queen.

840

841 dNdS scores

842 Using the single-copy orthologous genes for the nine species, we aligned nucleotide

843 sequences for each of the 1,971 orthogroups (single-copy allowing for 3 isoforms)

36 844 that successfully went through PRANK (version v.15112080). For each species and

845 for each orthogroup, we then calculated dn/ds ratios (omega) in branch models

846 (Supplementary Table S7 [ST2-ST16] and Supplementary Figure 3; null hypothesis

847 model in PAML codeml (model = 0, NSSites = 0, tree file without foreground) and in

848 an alternative hypothesis model (model = 2, NSSites = 2, tree file without some taxa

849 as foreground)); as well as in branch-site models (ST17-ST26; null hypothesis model

850 (omega fixed at 1, model = 2, NSSites = 2, tree file with one species branch as

851 foreground) and in an alternative hypothesis model (omega not fixed, all other

852 parameters set as the null model) (version v. 4.881)). We assessed log likelihoods

853 using a chi-square test and reported the adjusted p-values using R (version 3.6.3)

854 and Benjamini and Hochberg adjustment82.

855

856 GO enrichment and BLAST

857 To perform GO enrichment tests, we used the R package TopGO v2.42.083, using

858 Bonferroni cut-off P values of <=0.05. In order to assign gene ontology terms to

859 genes in our new species, we used our Orthofinder homology table with annotations

860 to Drosophila melanogaster (downloaded from Ensembl Biomart 1.10.2019). Within

861 species, we calculated enrichment of each species’ gene to a background of all the

862 genes expressed above a mean of 1 TPM. When comparing across the orthogroups

863 (OG), we used Metapolybia GO annotations (derived from homology to Drosophila),

864 with a background of those genes that have a mean >1 TPM in all species

865 orthologues. GO comparisons were similar using other species as a database of

866 gene to GO terms.

37 867 Blast2GO v 1.4.484 with default settings was used to annotate Metapolybia

868 genes.

869

870 Abbreviations

871 DOL = division of labour

872 ORF = open reading frame

873 MT = major transition

874 GO = gene ontology

875 SRA = sequence read archive

876 NCBI = National Centre for Biotechnology Information

877 BUSCO = Benchmarking set of Universal Single-Copy Orthologs

878 SVM = support vector machine

879 PCA = principal components analysis

880 Blast = Basic local alignment search tool

881 nr = non-redundant

882

883 Author’s contributions

884 SS conceived the study and supervised the project; SS, EL, EB and BT collected the

885 samples; DT, EB, BT and RB performed molecular lab work; DT & RB carried out the

886 morphological work; EF, MB & CW executed the bioinformatics pipelines, performed

887 the statistical analyses; CW & SS drafted the manuscript, with input from all authors.

38 888

889 References

890

891 1. Szathmáry, E. & Maynard Smith, J. The major evolutionary transitions. Nature

892 374, 227–232 (1995).

893 2. Kennedy, P. et al. Deconstructing Superorganisms and Societies to Address

894 Big Questions in Biology. Trends in Ecology and Evolution 32, 861–872

895 (2017).

896 3. Toth, A. L. & Robinson, G. E. Evo-devo and the evolution of social behavior.

897 Trends Genet. 23, 334–341 (2007).

898 4. Berens, a. J., Hunt, J. H. & Toth, a. L. Comparative Transcriptomics of

899 Convergent Evolution: Different Genes but Conserved Pathways Underlie

900 Caste Phenotypes across Lineages of Eusocial Insects. Mol. Biol. Evol. 32,

901 690–703 (2014).

902 5. Qiu, B. et al. Towards reconstructing the ancestral brain gene-network

903 regulating caste differentiation in ants. Nat. Ecol. Evol. 2, 1782 (2018).

904 6. Warner, M. R., Qiu, L., Holmes, M. J., Mikheyev, A. S. & Linksvayer, T. A.

905 Convergent eusocial evolution is based on a shared reproductive groundplan

906 plus lineage-specific plastic genes. Nat. Commun. 10, 1–11 (2019).

907 7. Patalano, S. et al. Molecular signatures of plastic phenotypes in two eusocial

908 insect species with simple societies. Proc. Natl. Acad. Sci. U. S. A. 112,

909 13970–5 (2015).

910 8. Shell, W. A. et al. Sociality sculpts similar patterns of molecyalr evolution in

39 911 two independently evolved lineages of eusocial bees. Commun. Biol. 4, 1–9

912 (2021).

913 9. West-Eberhard MJ. Wasp societies as microcosms for the study of

914 development and evolution. in Natural history and evolution of paper wasps.

915 (eds. Turillazzi, S. & West-Eberhard, M. J.) 290–317 (Oxford University Press,

916 1996).

917 10. Toth, A. L. & Rehan, S. M. Molecular Evolution of Insect Sociality: An Eco-Evo-

918 Devo Perspective. Annual Review of 62, 419–442 (Annual

919 Reviews, 2017).

920 11. Boomsma, J. J. Lifetime monogamy and the evolution of . Philos.

921 Trans. R. Soc. Lond. B. Biol. Sci. 364, 3191–207 (2009).

922 12. Boomsma, J. J. & Gawne, R. Superorganismality and caste differentiation as

923 points of no return: how the major evolutionary transitions were lost in

924 translation. Biol. Rev. (2017). doi:10.1111/brv.12330

925 13. Ferreira, P. G. et al. Transcriptome analyses of primitively eusocial wasps

926 reveal novel insights into the evolution of sociality and the origin of alternative

927 phenotypes. Genome Biol. 14, R20 (2013).

928 14. Sumner, S. The importance of genomic novelty in social evolution. Mol. Ecol.

929 23, (2014).

930 15. Feldmeyer, B., Elsner, D. & Foitzik, S. Gene expression patterns associated

931 with caste and reproductive status in ants: Worker-specific genes are more

932 derived than queen-specific ones. Mol. Ecol. 23, 151–161 (2014).

933 16. Johnson, B. R. & Tsutsui, N. D. Taxonomically restricted genes are associated

934 with the evolution of sociality in the honey . BMC Genomics 12, 164

40 935 (2011).

936 17. Simola, D. F. et al. Social insect genomes exhibit dramatic evolution in gene

937 composition and regulation while preserving regulatory features linked to

938 sociality. Genome Res. 23, 1235–1247 (2013).

939 18. Harpur, B. a et al. Population genomics of the honey bee reveals strong

940 signatures of positive selection on worker traits. Proc. Natl. Acad. Sci. U. S. A.

941 111, 2614–9 (2014).

942 19. Rubin, B. E. R., Jones, B. M., Hunt, B. G. & Kocher, S. D. Rate variation in the

943 evolution of non-coding DNA associated with social evolution in bees. Philos.

944 Trans. R. Soc. B Biol. Sci. 374, (2019).

945 20. Kapheim, K. M. Genomic sources of phenotypic novelty in the evolution of

946 eusociality in insects. Curr. Opin. Insect Sci. 1–9 (2015).

947 doi:10.1016/j.cois.2015.10.009

948 21. Dogantzis, K. A. et al. Insects with similar social complexity show convergent

949 patterns of adaptive molecular evolution. 1–8 (2018). doi:10.1038/s41598-018-

950 28489-5

951 22. Taylor, B. A., Reuter, M. & Sumner, S. Patterns of reproductive differentiation

952 and reproductive plasticity in the major evolutionary transition to

953 superorganismality. Curr. Opin. Insect Sci. (2019).

954 23. Toth, Amy L. Rehan, S. . Climbing the social ladder: The molecular evolution

955 of sociality. (2015).

956 24. Branstetter, M. et al. Genomes of the Hymenoptera. Curr. Opin. Insect Sci. 25,

957 65–75 (2017).

958 25. Standage, D. S. et al. Genome, transcriptome and methylome sequencing of a

41 959 primitively eusocial wasp reveal a greatly reduced DNA methylation system in

960 a social insect. Mol. Ecol. 25, 1769–1784 (2016).

961 26. Toth, A. L. et al. Shared genes related to aggression, rather than chemical

962 communication, are associated with reproductive dominance in paper wasps

963 (Polistes metricus). BMC Genomics 15, 75 (2014).

964 27. Bluher, S. E., Miller, S. E. & Sheehan, M. J. Fine-scale population structure but

965 limited genetic differentiation in a cooperatively breeding paper wasp. Genome

966 Biol. Evol. (2020).

967 28. Rehan, S. M. et al. Conserved Genes Underlie Phenotypic Plasticity in an

968 Incipiently Social Bee. Genome Biol. Evol. 10, 2749–2758 (2018).

969 29. Kocher, S. D. et al. The genetic basis of a social polymorphism in halictid

970 bees. Nat. Commun. 9, (2018).

971 30. Shell, W. A. & Rehan, S. M. Social modularity: Conserved genes and

972 regulatory elements underlie caste-antecedent behavioural states in an

973 incipiently social bee. Proc. R. Soc. B Biol. Sci. (2019).

974 doi:10.1098/rspb.2019.1815

975 31. Kapheim, K. M. et al. Developmental plasticity shapes social traits and

976 selection in a facultatively eusocial bee. Proc. Natl. Acad. Sci. 202000344

977 (2020). doi:10.1073/pnas.2000344117

978 32. Linksvayer, T. A. & Johnson, B. R. Re-thinking the social ladder approach for

979 elucidating the evolution and molecular basis of insect societies. Curr. Opin.

980 Insect Sci. (2019). doi:10.1016/j.cois.2019.07.003

981 33. Taylor, D., Bentley, M. A. & Sumner, S. Social wasps as models to study the

982 major evolutionary transition to superorganismality. Curr. Opin. Insect Sci. 28,

42 983 26–32 (2018).

984 34. Branstetter, M. G. et al. Phylogenomic Insights into the Evolution of Stinging

985 Wasps and the Origins of Ants and Bees. Curr. Biol. 27, 1019–1025 (2017).

986 35. Ferreira, P. G. et al. Transcriptome analyses of primitively eusocial wasps

987 reveal novel insights into the evolution of sociality and the origin of alternative

988 phenotypes Transcriptome analyses of primitively eusocial wasps reveal novel

989 insights into the evolution of sociality. Genome Biol. 14, R20 (2013).

990 36. Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole

991 genome comparisons dramatically improves orthogroup inference accuracy.

992 Genome Biol. (2015). doi:10.1186/s13059-015-0721-2

993 37. Oi, C. A. et al. Effects of juvenile hormone in fertility and fertility-signaling in

994 workers of the common wasp Vespula vulgaris. PLoS One 16, 1–13 (2021).

995 38. Kelstrup, H., Hartfelder, K., Nascimento, F. S. & Riddiford, L. M. Frontiers in

996 The role of juvenile hormone in dominance behavior , reproduction

997 and cuticular pheromone signaling in the caste-flexible epiponine wasp ,

998 Synoeca surinama. Front. Zool. 11, (2014).

999 39. De Smet, F. et al. Balancing false positives and false negatives for the

1000 detection of differential expression in malignancies. Br. J. Cancer (2004).

1001 doi:10.1038/sj.bjc.6602140

1002 40. Ilicic, T. et al. Classification of low quality cells from single-cell RNA-seq data.

1003 Genome Biol. (2016). doi:10.1186/s13059-016-0888-1

1004 41. Alquicira-Hernandez, J., Sathe, A., Ji, H. P., Nguyen, Q. & Powell, J. E.

1005 ScPred: Accurate supervised method for cell-type classification from single-cell

1006 RNA-seq data. Genome Biol. (2019). doi:10.1186/s13059-019-1862-5

43 1007 42. Furey, T. S. et al. Support vector machine classification and validation of

1008 cancer tissue samples using microarray expression data. Bioinformatics

1009 (2000). doi:10.1093/bioinformatics/16.10.906

1010 43. Taylor, B. A., Cini, A., Wyatt, C. D. R., Reuter, M. & Sumner, S. The molecular

1011 basis of socially mediated phenotypic plasticity in a eusocial paper wasp. Nat.

1012 Commun. 2020.07.15.203943 (2020). doi:10.1101/2020.07.15.203943

1013 44. Vizan, P., Di Croce, L. & Aranda, S. Functional and patholological roles of

1014 AHCY. Front. Cell Dev. Biol. 9, 654344 (2021).

1015 45. Hoffmann, K., Gowin, J., Hartfelder, K. & Korb, J. The scent of royalty: A P450

1016 gene signals reproductive status in a social insect. Mol. Biol. Evol. 31, 2689–

1017 2696 (2014).

1018 46. Sheeja, C. C., Thushara, V. V. & Divya, L. Caste-Specific Expression of Na +

1019 /K + -ATPase in the Asian Weaver Ant, Oecophylla smaragdina (Fabricius,

1020 1775). Neotrop. Entomol. (2018). doi:10.1007/s13744-018-0598-3

1021 47. Iovinella, I. et al. Antennal protein profile in honeybees: Caste and task matter

1022 more than age. Front. Physiol. (2018). doi:10.3389/fphys.2018.00748

1023 48. Sumner, S. The importance of genomic novelty in social evolution. Mol. Ecol.

1024 23, (2014).

1025 49. Simões, J. M. The social brain : how social stimuli are translated into

1026 neuroendocrine signals. (University of Lisboa, 2014).

1027 50. Mariano, D. O. C. et al. Bottom-up proteomic analysis of polypeptide venom

1028 components of the giant ant Dinoponera quadriceps. Toxins (Basel). (2019).

1029 doi:10.3390/toxins11080448

1030 51. Sasa, C., Miyazaki, S., Higashi, S. & Miura, T. Gene expressions for the

44 1031 sexually-dimorphic antennae in a ponerine ant. in (IUSSI).

1032 52. Yanay, C., Morpurgo, N. & Linial, M. Evolution of insect proteomes: Insights

1033 into synapse organization and synaptic vesicle life cycle. Genome Biol. (2008).

1034 doi:10.1186/gb-2008-9-2-r27

1035 53. Rehan, S. M., Berens, A. J. & Toth, A. L. At the brink of eusociality:

1036 transcriptomic correlates of worker behaviour in a small carpenter bee. BMC

1037 Evol. Biol. 14, 260 (2014).

1038 54. Kronauer, D. J. & Libbrecht, R. Back to the roots: the importance of using

1039 simple insect societies to understand the molecular basis of complex social

1040 life. Curr. Opin. Insect Sci. 28, 33–39 (2018).

1041 55. Noll, F. B., Wenzel, J. W. & Zucchi, R. Evolution of Caste in Neotropical

1042 Swarm-Founding Wasps ( Hymenoptera : Evolution of Caste in Neotropical

1043 Swarm-Founding Wasps ( Hymenoptera : Vespidae ; Epiponini ). Am. Museum

1044 Novit. 3467, 1–24 (2004).

1045 56. Richards, O. W. The social wasps of the Americas. (1978).

1046 57. Sumner, S., Bell, E. & Taylor, D. A molecular concept of caste in insect

1047 societies. Curr. Opin. Insect Sci. 42–50 (2017). doi:10.1016/j.cois.2017.11.010

1048 58. Kohannim, O. et al. Boosting power for clinical trials using classifiers based on

1049 multiple biomarkers. Neurobiol. Aging (2010).

1050 doi:10.1016/j.neurobiolaging.2010.04.022

1051 59. Montagna, T. S. & Antonialli, W. F. Morphological differences between

1052 reproductive and non-reproductive females in the social wasp Mischocyttarus

1053 consimilis Zikán (Hymenoptera: Vespidae). Sociobiology 63, 693–698 (2016).

1054 60. Jeanne, R. L. Social Biology of the neotropical wasp, Mischocyttarus drewseni.

45 1055 Bull. Museum Comp. Zool. 144, 63–150 (1972).

1056 61. Murakami, A. S. N., Shima, S. N. & Desuó, I. C. More than one inseminated

1057 female in colonies of the independent-founding wasp Mischocyttarus

1058 cassununga von Ihering (Hymenoptera, Vespidae). Rev. Bras. Entomol. 53,

1059 653–662 (2009).

1060 62. Reeve, H. K. Polistes. in The Social Biology of Wasps (ed. Ross KG, M. R.)

1061 99–148 (Cornell University Press, 1991).

1062 63. Gelin, L. F. F. et al. Morphological Caste Studies In The Neotropical Swarm-

1063 Founding Polistinae Wasp Angiopolybia pallens ( Lepeletier ) (Hymenoptera :

1064 Vespidae). Neotrop. Entomol. 37, 691–701 (2008).

1065 64. West-Eberhard, M. J. Temporary Queens in Metapolybia Wasps :

1066 Nonreproductive Helpers Without Altruism ? Science (80-. ). 200, 441–443

1067 (1978).

1068 65. Sakagami, S. F., Zucchi, R., Yamane, S., Noll, F. B. & Camargo, J. M. P.

1069 Morphological caste differences in Agelaia vicina, the neotropical swarm-

1070 founding polistine wasp with the largest colony size among social wasps

1071 (Hymenoptera: Vespidae). Sociobiology 28, 207–223 (1996).

1072 66. V., B. M., Noll, F. B. & Zucchi, R. Morphological Caste Differences and Non-

1073 Sterility of Workers in Brachygastra augusti ( Hymenoptera , Vespidae ,

1074 Epiponini ), a Neotropical Swarm-Founding Wasp Author ( s ): J. New York

1075 Entomol. Soc. 111, 242–252 (2003).

1076 67. Loope, K. J., Chien, C. & Juhl, M. Colony size is linked to paternity frequency

1077 and paternity skew in yellowjacket wasps and hornets. BMC Evol. Biol. 14, 1–

1078 12 (2014).

46 1079 68. Hastings, M. D., Queller, D. C., Eischen, F. & Strassmann, J. E. Kin selection ,

1080 relatedness , and worker control of reproduction in a large-colony epiponine

1081 wasp , Brachygastra mellifica. Behav. Ecol. 9, 573–581 (1998).

1082 69. Gobbi, N., Noll, F. B. & Penna, M. A. H. ‘Winter’ aggregations, colony cycle,

1083 and seasonal phenotypic change in the paper wasp Polistes versicolor in

1084 subtropical Brazil. Naturwissenschaften (2006). doi:10.1007/s00114-006-0140-

1085 z

1086 70. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for

1087 Illumina sequence data. Bioinformatics (2014).

1088 doi:10.1093/bioinformatics/btu170

1089 71. Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq

1090 using the Trinity platform for reference generation and analysis. Nat. Protoc. 8,

1091 1494 (2013).

1092 72. DI Tommaso, P. et al. Nextflow enables reproducible computational workflows.

1093 Nature Biotechnology (2017). doi:10.1038/nbt.3820

1094 73. Buchfink, B., Reuter, K. & Drost, H. G. Sensitive protein alignments at tree-of-

1095 life scale using DIAMOND. Nat. Methods (2021). doi:10.1038/s41592-021-

1096 01101-x

1097 74. Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using

1098 DIAMOND. Nat. Methods 12, 59 (2014).

1099 75. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 - Approximately maximum-

1100 likelihood trees for large alignments. PLoS One (2010).

1101 doi:10.1371/journal.pone.0009490

1102 76. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq

47 1103 data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

1104 77. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2.

1105 Nat. Methods (2012). doi:10.1038/nmeth.1923

1106 78. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D. & Weingessel, A. e1071: Misc

1107 Functions of the Department of Statistics (e1071), TU Wien. R package

1108 version (2011).

1109 79. Karagiannopoulos, M., Anyfantis, D., Kotsiantis, S. B. & Pintelas, P. E. Feature

1110 Selection for Regression Problems. 8th Hell. Eur. Res. Comput. Math. its Appl.

1111 HERCMA 2007 (2007). doi:10.1109/ICDM.2014.63

1112 80. Löytynoja, A. Phylogeny-aware alignment with PRANK. Methods Mol. Biol.

1113 (2014). doi:10.1007/978-1-62703-646-7_10

1114 81. Yang, Z. Paml: A program package for phylogenetic analysis by maximum

1115 likelihood. Bioinformatics (1997). doi:10.1093/bioinformatics/13.5.555

1116 82. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical

1117 and powerful approach to multiple testing. J. R. Stat. Soc. 57, 289–300 (1995).

1118 83. Alexa, A. & Rahnenführer, J. Gene set enrichment analysis with topGO.

1119 Bioconductor Improv. (2007).

1120 84. Götz, S. et al. High-throughput functional annotation and data mining with the

1121 Blast2GO suite. Nucleic Acids Res. 36, 3420–35 (2008).

1122 85. Piekarski, P. K., Carpenter, J. M., Lemmon, A. R., Lemmon, E. M. &

1123 Sharanowski, B. J. Phylogenomic evidence overturns current conceptions of

1124 social evolution in wasps (vespidae). Mol. Biol. Evol. 35, 2097–2109 (2018).

1125 86. O’Donnell, S. Reproductive caste determination in eusocial wasps

1126 (Hymenoptera: Vespidae). Annu. Rev. Entomol. 43, 323–46 (1998).

48 1127 87. Matsuura, M. & Yamane, S. Biology of the vespine wasps. (Springer Verlag,

1128 1990).

1129 88. Menezes, R. S. T., Lloyd, M. W. & Brady, S. G. Phylogenomics indicates

1130 Amazonia as the major source of Neotropical swarm-founding social wasp

1131 diversity. Proc. R. Soc. B (2020).

1132

1133

49 1134 Main Figures

1135

1136 Fig. 1 | Social wasps as a model group. The nine species of social wasps used in

1137 this study, and their characteristics of social complexity. The Polistinae and Vespinae

1138 are two subfamilies comprising 1100+ and 67 species of social wasp respectively, all

1139 of which share the same common non-social ancestor, an eumenid-like solitary

1140 wasp85. The Polistinae are an especially useful subfamily for studying the process of

1141 the major transition as they include species that exhibit simple group living

1142 comprised of small groups (<10 individuals) of totipotent relatives, as well as species

1143 with varying degrees of more complex forms of sociality, with different colony sizes,

1144 levels of caste commitment and reproductive totipotency86. The Vespinae include the

1145 yellow-jackets and hornets, and are all superorganismal, meaning caste is

1146 determined during development in caste-specific brood cells; they also show

1147 species-level variation in complexity, in terms of colony size and other

1148 superorganismal traits (e.g. multiple mating, worker policing)87. Ranked in order of

1149 increasing levels of social complexity, from simple to more complex, these species

1150 are: Mischocyttarus basimacula basimacula, Polistes canadensis, Metapolybia

50 1151 cingulata, Angiopolybia pallens, Polybia quadricincta, Agelaia cajennensis,

1152 Brachygastra mellifica, Vespa crabro and Vespula vulgaris (see Supplementary

1153 Methods for further details of species choice). Where data on evidence of

1154 morphological castes was not available from the literature, we conducted

1155 morphometric analyses of representative queens and workers from several colonies

1156 per species (see Supplementary Methods; Supplementary Table S8). Image credits:

1157 M. basimacula (Stephen Cresswell). A. cajennesis (Gionorossi; Creative Commons);

1158 V. vulgaris (Donald Hobern; Creative Commons). V. crabro (Patrick Kennedy); P.

1159 canadensis; M. cingulata, A. pallens, P. quadricinta, (Seirian Sumner), B. mellifica

1160 (Amante Darmanin; Creative Commons).

1161

1162

1163

1164

1165

1166

1167

1168

1169

51 a Pre normalisation b Post normalisation

Vespa Brachygastra Vespa Vespula Agelaia 0.2 Brachygastra

Polybia 0.25 Mischocyttarus Polybia Angiopolybia Angiopolybia Vespa 0.0 Angiopolybia Mischocyttarus Polybia Metapolybia Metapolybia Agelaia Polistes Polistes Polistes 0.00 Polistes Metapolybia

PC2 (16.02%) PC2 (15.64%) Agelaia −0.2 Polybia Angiopolybia Vespa

−0.4 −0.25 Mischocyttarus

Vespula Vespula

−0.6 Vespula Brachygastra

−0.6 −0.4 −0.2 0.0 0.2 −0.50 −0.25 0.00 0.25 0.50 PC1 (28.9%) PC1 (36.85%) Queen upregulated Worker upregulated 1170

1171 Fig. 2 | Principal component analyses of orthologous gene expression before

1172 and after between-species normalisation. a) Principal component analyses

1173 performed using log2 transcript per million (TPM) gene expression values. This

1174 analysis used single-copy orthologs (using Orthofinder), allowing up to three gene

1175 isoforms in a single species to be present, whereby we took the most highly

1176 expressed to represent the orthogroup, as well as filtering of orthogroups which have

1177 expression below 10 counts per million. b) Principal component analysis of the

1178 species-normalised and scaled TPM gene expression values using same filters as

1179 (a). Caste denoted by purple (queen) or blue (worker). Each species has a different

1180 shape/symbol.

1181

52 1182

1183 Fig. 3 | Overlap of differential caste-biased genes (queen vs worker) and their

1184 functions. a) Heatmap showing the differential genes that are caste-biased in at

1185 least four species (identified using edgeR) using the orthologous genes present in

1186 the nine species. In cases where three isoforms exist for a single species

1187 orthogroup, only the isoform with greatest expression was considered, plus at least 2

1188 species could have missing gene data for each orthogroup. Listed for each species

1189 (in brackets), is the total number of orthologous differentially expressed genes per

1190 species. Metapolybia Blast hits are listed along with orthogroup name. b) Gene

1191 ontology histogram of overrepresented terms of genes found differentially expressed

53 1192 in at least four out of the nine species (n=57 genes; in either queen or worker [not

1193 both]), using a background of all genes tested across the nine species. P values are

1194 single-tailed and bonferroni corrected. c) Gene ontology for orthogroups with at least

1195 two species with caste-biased gene expression in the same direction (n=353 genes).

54 1196

1197 Fig. 4 | A genetic toolkit for social behaviours across eusocial wasps. a)

1198 Change in certainty of correct caste classifications through progressive feature

55 1199 selection using Support Vector Machine (SVM) approaches. Models were trained on

1200 eight species and tested on the ninth species. Features (a.k.a. orthologous genes)

1201 were sorted by linear regression with regard to caste identity within the 8 training

1202 species, beginning at 99% where almost all genes were used for the predictions of

1203 caste, to 1% where only the top one percent of genes from the linear regression

1204 (sorted by P value) were used to train the model. ‘1’ equates to high queen

1205 classification certainty. This test was repeated using various filters, showing: LEFT:

1206 use of only 1to1 orthogroups, where no gene isoforms or missing values were

1207 accepted into the model, MIDDLE: allowing three gene isoforms (using the most

1208 highly expressed as the representative) and RIGHT: allowing three gene isoforms

1209 and missing values (NAs; up to 2 species with no ortholog), filling in with non-caste

1210 biased gene values for queen and worker. Horizontal guide markers are placed at

1211 0.4, 0.5 and 0.6 respectively indicating the confidence score for classifying queens,

1212 and the vertical dotted line marks the percentage at which the regression score

1213 becomes < 0.05. b): Histograms indicate the Bonferroni-corrected enriched gene

1214 ontology (GO) terms for the top orthologous genes from the corresponding model

1215 above (linear regression P value < 0.05; shown as dotted line in upper plots), tested

1216 against a background of all genes used in the corresponding SVM model (i.e. with a

1217 single gene representative for all species in the test). P values are single-tailed. c)

1218 Heatmap of the top 50 genes after sorting by regression, in SVM with 3 merged

1219 isoforms and 2 NAs; showing species-normalised gene expression levels in the nine

1220 species for queen (Q) and worker (W) samples (where red indicated down-regulation

1221 and blue up; Row Z-score), orthogroup name and top Metapolybia BLAST hit.

1222

56 1223

1224 Fig.5 | Testing for the presence of a distinct ‘simple society toolkit’ and a

1225 ‘complex society toolkit’. a): Using either the four species with the simpler

1226 societies or the four with the more complex societies, we trained SVM models, using

1227 progressive filtering of genes (based on the linear regression); we then tested how

1228 well these models classified caste for the opposite set of species. UPPER panel:

1229 training set was the four simpler species, and the test set was the four more complex

1230 species. LOWER panel: training set was the four more complex species, and the test

1231 set was the four simpler species. Likelihood of being a queen from 0 to 1 is plotted

1232 for each species across the progressively filtered sets. The total number of

57 1233 orthologous genes used in each SVM model are shown (bottom left of each panel);

1234 the total number of these genes that were significant predictors of caste (p value <

1235 0.05) is listed at the dotted line. For each test (pair of Queen/Worker in each

1236 species), the SVM model was run using genes with only 1 homologous gene copy

1237 per species. b) Overlap of significant genes among the three putative toolkits,

1238 comparing those genes for ‘simpler species’ (Fig 5a UPPER panel), ‘more complex

1239 species’ (Fig 5a LOWER panel), and ‘full spectrum’ (Fig 4a- Left) toolkits. For each

1240 experiment, the number of genes (orthogroups) tested and the total number that

1241 were significant after linear regression are given outside of the Venn diagram. Inside

1242 the Venn diagram are the numbers of these significant genes shared (or not)

1243 between the three groups. Significant overlap is shown using hypergeometric tests

1244 (one-tailed). The colours represent the genes that predict caste in: the more complex

1245 societies (blue), the simpler societies (grey) and across the full spectrum (pink).

1246 There is significant overlap between the simpler and the more complex genes with

1247 the full spectrum set, but not between the simpler and complex societies, suggesting

1248 that there may be distinct differences in the genes regulating castes at different

1249 levels of social complexity. c) Enriched gene ontology terms (TopGO) using a

1250 background of all genes tested in each individual experiment, using an uncorrected

1251 single-tailed p values.

1252

58 1253 Supplementary Figures

1254 1255 Supp. Figure. 1 | Phylogenetic tree of social wasps used in this study, and

1256 related hymenopterans generated using Orthofinder (SpeciesTree_rooted.txt).

1257 Drosophila was chosen as the root of the species tree (not shown), with two

1258 representative ants (Dinoponera quadriceps and Solenopsis invicta) and two bees

1259 (Apis mellifera and Nomia melanderi). Colours show groupings of ants, bees and

1260 wasps (Vespinae or Polistinae). For the nine wasp species (this study) we have

1261 sequenced adult caste-specific brain transcriptomic data (queen and worker). As

1262 expected: Vespinae are clearly separated from the Polistinae; Angiopolybia is basal

1263 to the other Epiponini; independent-founding, non-superorganismal wasps (Polistes

1264 and Mischocyttarus) are basal to the Polistinae85,88.

1265

1266

1267

1268

1269

59 Radial Polynomial

V_cra Queen V_cra Queen V_vul Queen V_vul Queen A_caj Queen A_caj Queen P_can Queen P_can Queen A_pal Queen A_pal Queen M_cin Queen M_cin Queen M_bas Queen M_bas Queen P_qua Queen P_qua Queen B_mel Queen B_mel Queen Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2

Percentage kept after filtering Percentage kept after filtering

Linear SIgmoidal

V_cra Queen V_cra Queen V_vul Queen V_vul Queen A_caj Queen A_caj Queen P_can Queen P_can Queen A_pal Queen A_pal Queen M_cin Queen M_cin Queen M_bas Queen M_bas Queen P_qua Queen P_qua Queen B_mel Queen B_mel Queen Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2

Percentage kept after filtering Percentage kept after filtering 1270

1271 Supp. Figure. 2 | SVM kernel type iterations. For each of the four kernel types, a

1272 filtered SVM plot shows the leave one out experiment. Each coloured line represents

1273 the prediction of the queen sample at various regression filters, from 99% data

1274 inclusion to 1% inclusion of only the top genes in the regression analysis. Species

1275 are listed in their (capital letter) and species (first three lower case letters of

1276 each species).

60 1277

1278 Supp. Figure 3 | dNdS. a) All orthogroups that had at least one gene significantly

1279 under selection are listed (branch-site models). b) Overlap of positive selection

61 1280 genes and various toolkit gene lists. Top, shows the differential gene list overlap.

1281 Middle, shows the SVM gene list <0.05 for singleton orthogroups. Bottom, shows the

1282 three gene isoform merge and 2 NA gene list. The number of genes overlapping are

1283 in the middle and in each sphere shows the total number of genes compared for

1284 each list (given that the two backgrounds do not overlap). A mutual background is

1285 listed showing the total number of genes that were possible to have been significant

1286 in the two compared lists.

1287

1288

1289

1290

62 a Angiopolybia Queen Metapolybia Queen Training with Swarm founding species ONLY Agelaia Queen Polybia Queen Mischocyttarus Queen Polistes Queen Vespa Queen Vespula Queen Classification confidence values Classification confidence values Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2

Percentage kept after filtering Percentage kept after filtering Percentage kept after filtering Percentage kept after filtering

b Vespula Queen Vespa Queen Training with Independent founding species ONLY Mischocyttarus Queen Polistes Queen

Agelaia Queen Angiopolybia Queen Metapolybia Queen Polybia Queen Classification confidence values Classification confidence values Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 c Percentage kept after filtering Percentage kept after filtering Percentage kept after filtering Percentage kept after filtering Mischocyttarus Queen Polistes Queen Training with Polistinae species ONLY Angiopolybia Queen Metapolybia Queen Agelaia Queen Polybia Queen

Vespa Queen Vespula Queen Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 1291 Percentage kept after filtering Percentage kept after filtering 1292 Supp. Figure. 4. | SVM predictions using founding behaviour and phylogeny to

1293 subsets of the training data. SVM predictions are given for each single species

1294 given training on species listed to left, showing the prediction after progressive

1295 feature selection from 95 to 1% of genes remaining after selection. Numbers of

1296 genes in each test are indicated in the lower left part of each plot. a) Training with

1297 independent-founders only, tested on the four swarm-founders (see Figure 1 for

1298 species). b) Training with swarm-founders and testing on the four independent-

1299 founders. c) Training with Polistine species only, and testing on the Vespines (Vespa

1300 and Vespula). The reverse was not possible, as the SVM requires a minimum

1301 number of samples to work.

63 Supplementary Files

This is a list of supplementary les associated with this preprint. Click to download.

Supp.Table.S1RNAseqsamples.xlsx Supp.Table.S2.OrthologyplusMerge.xlsx Supp.Table.S3.OrthogroupDEGs.v2.xlsx Supp.Table.S4.NormalisedSVMandresult.xlsx Supp.Table.S5.SVMSuperNotSuper.xlsx Supp.Table.S6.SVMNestTypePolistinae.xlsx Supp.Table.S7.DnDs.xlsx Supp.Table.S8Morphometricdataresults.xlsx