Social complexity, life-history and lineage in uence the molecular basis of caste in a major transition in evolution
Christopher Wyatt ( [email protected] ) University College London Michael Bentley University College London Daisy Taylor University College London Emeline Favreau University College London https://orcid.org/0000-0002-0713-500X Ryan Brock University of East Anglia https://orcid.org/0000-0003-2977-1370 Benjamin Taylor University College London Emily Bell School of Biological Sciences, University of Bristol https://orcid.org/0000-0002-7309-5456 Ellouise Leadbeater Royal Holloway University of London https://orcid.org/0000-0002-4029-7254 Seirian Sumner UCL https://orcid.org/0000-0003-0213-2018
Article
Keywords: Superorganismality, Castes, Wasps, Transcriptomics
Posted Date: August 27th, 2021
DOI: https://doi.org/10.21203/rs.3.rs-835604/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License 1 Social complexity, life-history and lineage influence the molecular basis of caste in a major 2 transition in evolution
3 4 Christopher D. R. Wyatt1*, Michael Bentley1,†, Daisy Taylor1,2,†, Emeline Favreau1, 5 Ryan E. Brock2,3, Benjamin A. Taylor1, Emily Bell2, Ellouise Leadbeater4 & Seirian 6 Sumner1* 7 8 1 Centre for Biodiversity and Environment Research, University College London, 9 London, UK. 10 2 School of Biological Sciences, University of Bristol, United Kingdom, BS8 1TQ. 11 3 School of Biological Sciences, University of East Anglia, Norwich Research Park, 12 Norwich, Norfolk, NR4 7TJ, UK 13 4 Department of Biological Sciences, Royal Holloway University of London, Egham, 14 UK. 15 * Corresponding authors 16 † Equal contribution 17 18 Christopher Douglas Robert Wyatt 19 Centre for Biodiversity and Environment Research, 20 University College London, London, UK. 21 E-mail: [email protected]; Phone +447307390637 22 23 Seirian Sumner 24 Centre for Biodiversity and Environment Research, 25 University College London, London, UK. 26 E-mail: [email protected] 27 28 Key words: Superorganismality, Castes, Wasps, Transcriptomics
1 29 Abstract
30 Major evolutionary transitions describe how biological complexity arises; e.g. in the
31 evolution of complex multicellular bodies, and superorganismal insect societies.
32 Such transitions involve the evolution of division of labour, e.g. among queen and
33 worker castes in insect societies. A key mechanistic hypothesis for the evolution of
34 division of labour is that a shared set of genes co-opted from a common solitary
35 ancestral ground plan - a so-called genetic toolkit for sociality - regulates insect
36 castes across different levels of social complexity. The vespid wasps represent an
37 excellent system in which to test this. Here, using conventional and machine learning
38 analyses of brain transcriptome data from nine species of vespid wasps, we find
39 evidence of a shared genetic toolkit across species representing different levels of
40 social complexity, with a large suite of genes classifying castes correctly across
41 species. However, we also found evidence of additional fine-scale differences in
42 predictive gene sets, functional enrichment and rates of gene evolution that were
43 related to level of social complexity, but also life-history traits (e.g. mode of colony
44 founding). Thus, there appear to be shifts in the gene networks regulating social
45 behaviour and rates of gene evolution that are influenced by innovations in both
46 social complexity and life-history. These results suggest that the concept of a shared
47 genetic toolkit for sociality may be too simplistic to fully describe the process of the
48 major transition to sociality, even within a single lineage. Diversity in lineage, social
49 complexity and life-history traits must be taken into account in the quest to uncover
50 the molecular bases of the major transition to sociality.
51
2 52 Main
53 The major evolutionary transitions span all levels of biological organisation,
54 facilitating the evolution of life’s complexity on earth via cooperation between single
55 entities (e.g. genes in a genome, cells in a multicellular body, insects in a colony),
56 generating fitness benefits beyond those attainable by a comparable number of
57 isolated individuals1. The evolution of sociality is one of the major transitions and is
58 of general relevance across many levels of biological organisation from genes
59 assembled into genomes, single-cells into multi-cellular entities, and insects
60 cooperating in superorganismal societies. The best-studied examples of sociality are
61 in the hymenopteran insects (bees, wasps and ants) - a group of over 17,000
62 species, exhibiting different forms of sociality across the transition from simple
63 sociality (with small societies where all group members are able to reproduce and
64 switch roles in response to opportunity), through to complex societies (consisting of
65 thousands of individuals, each committed during development to a specific
66 cooperative role and working for a shared reproductive outcome within the higher-
67 level ‘individual’ of the colony, known as the ‘superorganism’2). Recent analyses of
68 the molecular mechanisms of insect sociality have revealed how conserved suites of
69 genes, networks and functions are shared among independent evolutionary events
70 of insect superorganismality3–8. An outstanding question is whether there are fine-
71 scale differences in the genetic mechanisms operating at different levels of
72 complexity across the spectrum of the major transition – from the simplest to most
73 complex societies.
74 A key step in the evolution of sociality is the emergence of a reproductive
75 division of labour, where some individuals commit to reproductive or non-
3 76 reproductive roles, known as queens and workers respectively in the case of insect
77 societies. An overarching mechanistic hypothesis for social evolution is that the
78 repertoire of behaviours typically exhibited in the life cycle of the solitary ancestor of
79 social species were uncoupled to produce a division of labour among group
80 members with individuals specialising in either the reproductive (‘queen’) or
81 provisioning (‘worker’) phases of the solitary ancestor9. Such phenotypic decoupling
82 implies that there will be a conserved mechanistic toolkit that regulates queen and
83 worker phenotypes in species representing different levels of social complexity
84 across the spectrum of the major transition (reviewed in10). An alternative to the
85 shared toolkit hypothesis is that the molecular processes regulating social
86 behaviours in non-superorganismal societies (where caste remains flexible, and
87 selection acts primarily on individuals) differ fundamentally from those processes that
88 regulate social behaviours in superorganismal societies 11,12. Phenotypic innovations
89 across the animal kingdom have been linked to genomic evolution: taxonomically-
90 restricted genes13–16, rapid evolution of proteins17,18 and regulatory elements17,19
91 been found in most lineages of social insects20. Indeed, some recent studies have
92 suggested that the processes regulating different levels of social complexity may be
93 different8,17,19,21. The innovations in social complexity and the shift in the unit of
94 selection (from individual- to group-level22) that occur in a major transition may
95 therefore be accompanied by shifts in genomic processes and evolution, throwing
96 into question whether the concept of a universal conserved genomic toolkit that
97 regulates social behaviours across the spectrum of the major transition may be too
98 simplistic23. The roles of conserved and novel processes are not necessarily
99 mutually exclusive; novel processes may coincide with phenotypic innovations, whilst
4 100 conserved mechanisms may regulate core processes at all stages of social
101 evolution.
102 Until recently, the data available to test these hypotheses were largely limited
103 to species that represent either the most complex – superorganismal - forms of
104 sociality (e.g. most ants, honeybees24), or the simplest forms of social complexity as
105 non-superorganisms (e.g. Polistes wasps7,25–27 and incipiently social bees28–31), the
106 latter of which likely represent the first stages in the major transition. Recent studies
107 have identified core gene sets underlying caste-differentiated brain gene expression
108 across a range of ants5 and bees8; ants lack ancestrally non-superorganismal
109 representatives and so may not be informative for studying simpler societies and,
110 putatively, the process of social evolution32. Analyses of molecular evolution across
111 two independently evolved lineages of sociality in bees showed increased levels of
112 regulatory complexity with social complexity. These insights highlight the need for a
113 fine-scale dissection of the molecular processes regulating social behaviour for
114 species representing different forms of social complexity8, but also the need to
115 consider the influence of life-history traits on gene evolution, as a selective process
116 that is independent of complexity per se.
117 A promising group for unpicking the molecular basis of social complexity are
118 the vespid wasps33, which exhibit the full diversity of social complexity among some
119 1,100 species, as well as diversity in life-history traits such as mode of colony
120 founding. Aculeate wasps are of paramount importance in studying the evolution of
121 sociality in the Hymenoptera as they represent the evolutionary root to both bees
122 and ants34. However, to date sociogenomic studies of caste evolution in wasps have
123 been limited to Polistes 7,25,35. Here we generated brain transcriptomic data of caste-
5 124 specific phenotypes for nine species of social wasps, representing diversity in social
125 complexity and mode of colony founding, a life-history trait that is independent of
126 social complexity (Fig. 1). We employed a machine-learning algorithm - support
127 vector machines (SVMs) – along-side more conventional gene expression analyses
128 to conduct fine-scale analysis of the molecular processes associated with caste, to
129 determine whether there is a conserved genetic toolkit for social behaviour in
130 species displaying different forms of sociality and that are putative proxy
131 representatives of the process of the major transition from non-superorganismal
132 (simple) to superorganismal (complex) species in Vespid wasps (Aim 1). We then
133 further interrogate these data to identify how transcriptomic signatures of social
134 behaviour are influenced by social complexity (i.e. simpler versus the more complex
135 societies), mode of colony foundation (i.e. independent-founding versus swarm-
136 founding) as key life-history trait in vespid, and lineage (vespines versus polistines).
137 Our results provide the first evidence of a conserved genetic toolkit across forms of
138 social complexity in the spectrum of the major transition to sociality in wasps; they
139 also reveal how there are fine-scale differences in the molecular patterns and
140 processes between simple and complex societies, and with life-history traits. Finally,
141 we discuss how patterns of gene evolution are contrasting with those reported in
142 bees and ants. These results suggest the concept of a shared genetic toolkit across
143 the spectrum of sociality and across lineages may be too simplistic, and that level of
144 social complexity and life-history traits influence the molecular processes
145 underpinning social behaviour.
146
6 147 Results
148 We chose one species from each of nine different genera of social wasps from
149 among the two main subfamilies - Polistinae and Vespinae; collectively these
150 species represent the full spectrum of social diversity (detail on species choice given
151 in Fig. 1 and Methods). For each species, we sequenced RNA extracted from whole
152 brains of adults to construct de novo brain transcriptomes for the two main social
153 phenotypes – adult reproductives (defined as mated females with developed ovaries,
154 henceforth referred to as ‘queens’ for simplicity; see Supplementary Table S1) and
155 adult non-reproductives (defined as unmated females with no ovarian development,
156 henceforth referred to as ‘workers’; see Supplementary Methods). Using these data
157 we could reconstruct a phylogenetic tree of the Hymenoptera using single-copy
158 orthologous genes (Orthofinder36), resulting in expected patterns of phylogenetic
159 relationships (Supp. Fig. 1). This dataset allows us to test the extent to which the
160 same molecular processes underpin the evolution of social phenotypes of different
161 forms of sociality, as extant proxies in the major transition to superorganismality in
162 wasps.
163
164 Aim 1: Is there a shared genetic toolkit for caste among species across the
165 major transition from non-superorganismality to superorganismality in Vespid
166 wasps?
167 We found several lines of evidence for a shared genetic toolkit for caste across the
168 wasp species using two different analytical approaches.
7 169 Caste explains gene expression variation, after species-normalisation
170 The main factor explaining gene expression variation (using single copy
171 orthogroups) was species identity, and to some extent subfamily, as all the Polistinae
172 clustered tightly (Fig. 2a). However, since we are interested in determining whether
173 there is a shared toolkit of caste-biased gene expression across species, we needed
174 to control for the effect of species in our data. To do this, we performed a between-
175 species normalisation on the transcript per million (TPM) score, scaling the variation
176 of gene expression to a range of -1 to 1 (see Supplementary Methods). After
177 species-normalisation, the samples separate mostly into queen and worker
178 phenotypes in the top two principle components (Fig. 2b). This suggests that subsets
179 of genes (a potential toolkit) are shared across these species and are representative
180 of caste differences. However, there were outliers: Brachygastra did not cluster with
181 any of the other samples (despite correct PC1 queen direction); Agelaia showed little
182 caste-specific separation and in the opposite direction to the other species (Fig. 2b).
183 These two outlier species may not share the same caste-specific patterns as the
184 other species, but we cannot rule out data and/or sampling anomalies.
185 Next, we used inferred lists of orthologs allowing some flexibility in numbers of
186 isoforms (maximum of three isoforms, taking the most highly expressed) for each
187 orthogroup and missing data (maximum of two species), allowing larger matrices of
188 genes to be tested for differences between castes (Supplementary Table S2). Using
189 this approach we detected 57 orthologous genes with four or more caste-biased
190 species [queen or worker biased] (Fig. 3a; Supplementary Table S3) and 353
191 orthologous genes (Supplementary Table S3) with caste-biased expression in at
192 least two of nine species in the same direction (either queen- or worker-biased only).
8 193 There was overrepresentation of pheromone, chitin/odorant binding and muscle
194 related GO terms using these two cut-offs (Fig. 3b/c; Supplementary Table S3). No
195 ortholog showed consistent caste-biased differential expression across all nine wasp
196 species (Fig. 3a; Supplementary Table S3; using unadjusted <0.05 p values).
197 Despite this, notable signatures of caste regulation were apparent across the
198 species, with the exception of Brachygastra. For example, orthogroup OG0001418
199 was upregulated in queens from eight of the nine species and is predicted to belong
200 to the vitellogenin gene family (using the Metapolybia protein sequence to represent
201 the orthogroup), a transcriptional target and mediator of the insect physiology-
202 regulator juvenile hormone (JH). JH is known as a pleiotropic master regulator of
203 social traits, such as aggression, foraging behaviour and reproduction; in both
204 Vespinae and Polistinae JH has been shown experimentally to regulate reproduction
205 and is thought to be an honest fertility signal37,38. A second gene of interest that was
206 consistently queen-biased was oocyte zinc finger 22. Zinc finger proteins have
207 diverse functions but broadly they regulate gene expression by binding to DNA or
208 RNA and controlling transcription; they are caste-biased in a range of social insects
209 and have also been identified as targets for positive selection in social evolution 21
210 and so could be a key transcription factor driving wide expression patterns among
211 castes across our species.
212
213 A toolkit of many genes with small effects predicts caste across different
214 forms of sociality in wasps
215 Conventional differential expression analyses (e.g. edgeR) require a balance of P
216 value cut-offs and fold change requirements to reduce false-positive and false-
9 217 negative errors39. Therefore, consistent patterns of many genes with smaller effect
218 sizes may be missed when applying strict statistical measures. Support vector
219 machine (SVM) learning approaches use a supervised learning model capable of
220 detecting subtle but pervasive signals in differential expression between conditions
221 (e.g. for classification of single cells40,41, cancer cells42 and in social insect castes43).
222 We used this approach to test whether gene expression can successfully classify
223 caste identity for unknown samples; accurate classification of samples as queens or
224 workers based on their global transcription patterns would be evidence for a genetic
225 toolkit underpinning social phenotypes.
226 Starting with a “leave-one-out” SVM approach, we classified queen and
227 worker samples, using a predictive gene set generated from a model trained on
228 caste-specific gene expression from eight of the nine species, with the ninth species
229 being the test sample. We applied three different iterations of the leave-one-out
230 approach using different Orthofinder conversion tables (Supp. Table S2) and a radial
231 SVM model (see Methods; Supplementary Fig. 2). Our first approach used the “true-
232 single-copy” orthologues that could be detected with sufficient expression (Fig. 4a-
233 left; Supp. Table S4) across all nine species (n=1221 genes); after progressive
234 feature selection (based on linear regression, from left to right; see Methods) the top
235 259 genes (<0.05 from linear regression) could predict the queen caste correctly in
236 seven of the nine species (Fig. 4a- left; with >=0.6 likelihood in the queen sample
237 within the top 20% of genes); the two outlier species (Agelaia and Brachygastra)
238 showed generally lower predictions of queen likelihood (<0.5) and had previously
239 been highlighted as outliers (Figs. 2 & 3). We explored two other orthology lists in
240 order to explore whether, if the single-copy ortholog rule was relaxed and more
10 241 orthogroups were included, caste predictions would remain accurate: in one model
242 we allowed up to 3 isoforms per gene (Fig. 4a-middle: Supp. Table S4; n=1831
243 genes), and in the third we allowed up to two out of nine species to have a missing
244 gene representative of the orthogroup, plus up to three isoforms per gene (Fig. 4a-
245 right: Supp. Table S4; n=5536 genes). Even with these relaxed conditions, caste was
246 classified accurately for all species except for the aforementioned outliers. This
247 suggests that large numbers of genes with caste-biased expression exist across
248 social wasps.
249 We found some level of functional enrichment among the predictive genes
250 that remained in the models after feature selection. There were consistent patterns
251 of enrichment for GO terms in the two more relaxed models (Fig 4b- middle & 4b-
252 right), with genes enriched for mostly ionic transport and visual/sensory perception
253 (Supplementary Table S4). Expression of the top 50 genes (from linear regression;
254 chosen from Fig 4a- left) allow us to identify putatively most important genes (Fig.
255 4c); some were previously been identified as associated with caste differentiation in
256 social insects (e.g. vitellogenin; zinc finger (as mentioned above) and ubiquitin);
257 others have not been associated with caste previously but may have relevant
258 functions in regulating caste. For example, striatin-4 (OG0005432) is involved in
259 calmodulin binding and thus may play role in activating (or inactivating) calcium-
260 binding; adenosylhomocysteinase (OG0002120) is one of the most conserved
261 proteins in living organisms facilitates local transmethylation and thus may be
262 important in active transcription and epigenetic processes44. Interestingly, the vast
263 majority of the putatively important genes are downregulated in queens, and
264 upregulated in workers, suggesting that perhaps the transcriptomic demands of
11 265 worker behaviour may be more diverse than those involved in being a queen, whilst
266 queens express a more restricted molecular toolkit as expected if they are
267 specialised in reproduction.
268 We compared the genes and GO terms obtained from the SVM and edgeR
269 methods (Fig. 3c). There was little overlap in enrichment of GO terms between the
270 two datasets. Moreover, there was less overlap than expected by chance between
271 the 1420 SVM predictor genes and the 353 genes identified as differentially
272 expressed (edgeR; <0.05 P value): in comparisons that included least two species in
273 the same direction, only 35 genes overlapped (Hypergeometric: 1.89e-14, under-
274 enriched 2.59 fold; Supplementary Table S4). Of the genes that are present in both
275 sets, some have previously been identified as having relevance to social evolution
276 and caste differentiation; these include Vitellogenin (as above) 45 (two protein
277 isoforms: OG0001119 and OG0001418), zinc finger (as above) (OG0005731) and
278 gonadotrophin-releasing hormone, which is related to the regulation of reproduction.
279 ATP-synthase has caste specific expression in the weaver ant46 (OG0006255 and
280 OG0006612) and esterase E4-like (OG0000783) is upregulated in young honeybee
281 queens compared to nurses at the proteomic level47. There are also other genes of
282 interest, which to our knowledge have not previously associated with caste, including
283 Toll-like receptor 8 (OG0001441) (see Supplementary Table S4). Finally, 40% of the
284 overlapping genes (n=14) are uncharacterised proteins, suggesting a role for
285 potentially novel genes; novel (or taxon-restricted) genes have been previously
286 highlighted for their potential role in social evolution across bees and wasps48.
287
288
12 289 Aim 2: How does level of social complexity, phylogeny and life-history
290 influence the molecular basis of castes?
291 Level of social complexity may influence the genes regulating caste.
292 To explore whether there are any finer-scale differences in among the different levels
293 of social complexity, we trained an SVM model using Mischocyttarus, Polistes,
294 Angiopolybia and Metapolybia as representatives of the simpler societies in the
295 major transition (see Fig. 1), and tested how well this gene set classified castes for
296 the four species representing the more complex societies in the major transition
297 (Polybia, Agelaia, Vespa and Vespula; see Fig. 1). Brachygastra was excluded due
298 to its poor performance overall (see Fig. 4) and to balance sample sizes in training
299 and test sets. If castes in the test species classify well, this would suggest that the
300 processes regulating castes in the simpler societies are equally important in the
301 more complex societies (i.e. there is no specific caste toolkit for social behaviour in
302 simple societies, which is absent (putatively lost) in more complex societies).
303 Conversely, if the test species do not classify well, this would suggest that there are
304 distinct processes regulating caste in the simpler societies that are less important in
305 more complex forms of sociality.
306 When training the model on the four species with simple societies, we found
307 that both vespines (Vespa and Vespula) had good classification confidences at
308 around 0.8 (Fig. 5a), comprising around 800 genes after feature selection. Similarly,
309 Polybia also classified well, though with lower confidence. Agelaia could not classify
310 castes correctly; however, this species was identified as an outlier in our first
311 analyses and this may reflect some facet of their biology that we do not yet
312 understand, or an anomaly with the data. Considering these four tests together, a set
13 313 of 277 gene orthologs were found (Supplementary Table S5; p value <0.05 in all four
314 species tests), that represent a core set of genes that classify caste correctly in
315 simple societies. Based on these species, these results suggest that the genetic
316 toolkit for simple societies is well conserved in the more complex societies that we
317 sampled, although future work should confirm this by expanding the repertoire of
318 species.
319 The reciprocal test performed less well, suggesting there may be different
320 processes regulating more complex sociality, which is absent from the simpler
321 societies. We trained the SVM using the four species with the more complex
322 societies (Polybia, Agelaia, Vespa and Vespula) and tested it on the four species
323 with simpler societies (Mischocyttarus, Polistes, Metapolybia and Angiopolybia).
324 After feature selection, around 350 genes were found in each of the four tests, and
325 289 were found in all four tests, describing castes in these more complex societies
326 (Supplementary Table S5). In general, these genes were successful in classifying
327 castes for the simpler societies (Fig. 5a-lower), although none of the species
328 matched the high classification confidence of the vespines in the reciprocal test. The
329 lower overall success of the gene set trained on the more complex species in
330 classifying castes in species from the simpler societies suggests there may be
331 different (additional) processes regulating castes at the more complex levels of
332 sociality.
333 To explore this idea further, we tested whether there was significant overlap
334 among the three putative toolkits, by comparing the identities of genes predictive of
335 ‘simpler societies’ (n=277), ‘more complex societies’ (n=289), and ‘full spectrum’ (Fig
336 4a- Left; n=259) toolkits. We found significant overlap between the combined genes
14 337 in the simpler and the more complex ‘toolkits’ with the full spectrum set (Fig 5b grey
338 vs pink; blue vs pink), but not between the simpler and more complex societies (grey
339 vs blue). This suggests that there may be distinct differences in the genes regulating
340 castes at different levels of social complexity. We found further suggestive evidence
341 for this at the putative functional level: there were no consistent patterns of GO terms
342 across the two toolkits, with only the complex society toolkit having significant levels
343 of enrichment after correction (Fig. 5c- mostly cation/ion transport; Supplementary
344 Table S5).
345 Further support for there being different molecular processes shaping caste
346 transcription at different levels of social complexity was found in the patterns of gene
347 evolution. Overall, the highest rates of dN/dS were in the species with simpler
348 societies; this was true across all nine species (dN/dS mean of orthogroups in
349 foreground branch with simple species = 0.1639218; with complex species =
350 0.1015024, Wilcoxon test W = 307318, p-value < 2.2e-16) and also just within the
351 Polistinae (dNdS mean of orthogroups in foreground branch with simple species =
352 0.180918; with complex species = 0.1327845; Wilcoxon test W = 119615, p-value =
353 2.824e-11). Some orthogroups had experienced significant positive selection (i.e. on
354 the given foreground branch, null model rejected, chi-square test p < 0.05, dNdS =>
355 1; a total of eleven orthogroups). Six of these orthogroups were under positive
356 selection in the simplest polistines relative to the more complex polistines
357 (Supplementary Table S7-ST15 n = 6; OG0002957 (transmembrane 234 homolog
358 isoform X1), OG0003204 (mitochondrial ribonuclease P 3, upregulated in zebrafish
359 behavioural experiments49), OG0003159 (lysozyme c-1-like, in ant venom50),
360 OG0003184 (doublesex isoform X1, female-specific gene expression in ant51),
15 361 OG0005376 (uncharacterized protein LOC107063793), OG0005104 (box C D
362 snoRNA 1)). Four were among the more complex polistines, relative to the simpler
363 species (Supplementary Table S7-ST16: OG0003362 (UMP-CMP kinase isoform
364 X1), OG0003491 (uncharacterized protein LOC107067502), OG0004867
365 (uncharacterized protein LOC106789623), OG0005164 (probable tubulin
366 polyglutamylase TTLL9)). However, when the vespines were also included, only one
367 orthogroup was under positive selection (Supplementary Table S7-ST14;
368 OG0003694 (transmembrane 59-like isoform X1)), suggesting there may be some
369 lineage effects here, although wider taxonomic sampling of vespines would be
370 required to confirm this. Of the 11 significant orthogroups identified across these
371 analyses, only one was differentially expressed with respect to caste in two species
372 (OG0004867, uncharacterized protein LOC106789623), and four other orthogroups
373 (OG0003362: UMP-CMP kinase isoform X1; OG0003491: uncharacterized protein
374 LOC107067502; OG0003159: lysozyme c-1-like; OG0003184: doublesex isoform
375 X1) were differentially expression in just one species (Supplementary Table S7-
376 ST27).
377 Taken together, these results provide evidence that the molecular processes
378 regulating caste differentiation vary depending on the level of social complexity. The
379 finding of more rapid rates of protein evolution in the simpler societies
380 (Supplementary Figure 3c) was unexpected since studies of gene evolution in social
381 bees have found the higher rates of evolution in the more complex societies8.
382 Molecular processes shaping the evolution of sociality may differ with lineage and/or
383 life-history. Patterns of molecular evolution should be further explored with broader
16 384 analysis of species representing different levels of complexity within lineages, but
385 also with wider taxonomic breadth when comparing across lineages.
386
387 No evidence that the genes regulating castes vary among subfamilies
388 Using the same reciprocal SVM approach, we found little evidence that sub-family
389 influences the molecular processes regulating caste. Using the seven polistines as
390 the training set, queens and workers in the two vespines classified with 100%
391 confidence (Supplementary Table 6 [tab Vespula_Polistinae + Vespa_Polistinae]; the
392 reverse of this test could not be performed due to low sample sizes for a Vespine
393 training set). Despite the small number of vespines sampled, these results suggests
394 that the genes regulating castes are shared across these two subfamilies.
395
396 Life-history traits may influence the genes regulating caste
397 A key life-history trait that varies in social wasps is their mode of colony founding.
398 Some species form new colonies alone (‘independent founders’) whilst others form
399 new colonies as a swarm containing queens and workers (‘swarm founders’).
400 Founding behaviour is independent of level of complexity: e.g. the species with the
401 simplest (e.g. Polistes and Mischocyttarus) and the most complex (Vespa and
402 Vespula) societies are all independent founders; equally the swarm-founders
403 (Angiopolybia, Metapolybia, Agelaia, and Polybia) exhibit a spectrum of social
404 complexity (see Fig. 1). It is possible therefore, that the molecular basis of caste
405 reflects this key life-history trait, and not just social complexity. We found some
406 evidence to support this. The gene set obtained from an SVM trained to classify
407 castes in the swarm-founders was very successful in classifying castes for the
17 408 independent founders, with classifications between 0.7-1 for all test species.
409 Conversely, the reciprocal analysis performed poorly: the gene set obtained from an
410 SVM trained on the independent-founders was unable to predict castes at >0.7 for
411 three of the four species (Supplementary Figure 4; Supplementary Table 6
412 [Swarm/Independent]). The influence of colony founding behaviour on molecular
413 processes was further supported when comparing the rates of protein evolution in
414 the swarm-founders relative to the independent founding species (Supplementary
415 Table 7 - ST12): the rates of protein evolution are on average higher among the
416 swarm-founders (mean = 0.1461955) than the species with complex species but
417 smaller than simple species (Wilcoxon tests W = 384912, p-value = 1.847e-07, and
418 W = 126972, p-value = 0.0001657 respectively). These results suggest that there
419 may be additional caste-related molecular processes involved in the evolution of
420 innovations in life-history traits (in this case, mode of colony founding) that are
421 independent of social complexity.
422
423
424 Discussion
425 Major transitions in evolution provide a conceptual framework for understanding the
426 emergence of biological complexity. Discerning the processes by which such
427 transitions arise provides us with critical insights into the origins and elaboration of
428 the complexity of life. In this study we explored the evidence for two key hypotheses
429 on the molecular bases of social evolution by analysing caste transcription in nine
430 species of wasps. As predicted, we find evidence of a shared genetic toolkit across
18 431 the spectrum of social complexity in wasps; importantly, using machine learning we
432 reveal that this toolkit likely consists of many hundreds of genes of small effect (Fig.
433 4). However, in sub-setting the data by level of social complexity and by a key life-
434 history trait (colony founding), several important new insights are revealed. Firstly,
435 there appears to be a putative toolkit for castes in the simpler societies that largely
436 persists across the major transition, through to superorganismality. Secondly,
437 different (additional) processes appear to become important at more complex levels
438 of sociality. Further sampling is required to determine the extent to which the role of
439 these additional processes is driven by the evolution of superorganismality, and
440 specific innovations that have been key in the major transition to sociality. Thirdly,
441 the so-called toolkit for sociality is also influenced by the mode of colony founding;
442 the influence of ecology and life-history is largely overlooked in the current literature
443 on the molecular basis of sociality. Finally, we discuss apparent differences among
444 lineages, highlighting the importance of wider taxonomic consideration in the quest
445 to understand the molecular basis of sociality and the nature of the major transition.
446 Taken together, we suggest that the concept of a shared molecular toolkit for
447 sociality may be too simplistic and that future studies should consider the
448 evolutionary impact of other traits on molecular processes, specifically life-history
449 and ecology, alongside castes.
450
451 The first important finding is that we identified a substantial set of genes that
452 consistently classify caste across most of the species, irrespective of the level of
453 social complexity. The taxonomic range of samples used meant we were able to
454 confirm that specific genes are consistently differentially expressed, with respect to
19 455 caste, across the species. These patterns would be difficult to detect if only looking
456 at a few species within a lineage, or species across several lineages, or species
457 representing only a limited selection of social complexities. In addition to typical
458 caste-biased molecular processes, we also identified that genes related to synaptic
459 vesicles are different between castes; this is interesting as the regulation of synaptic
460 vesicles affects learning and memory in insects52. To our knowledge, this is the first
461 evidence of what may be a conserved genetic toolkit for sociality, from the simplest
462 stages of social living through to true superorganismality, including intermediate
463 stages of complexity in wasps. Greater taxonomic sampling will help recover the full
464 spectrum of genes involved that are regulated across the major transition and may
465 have been important in the evolution of sociality.
466 The underlying assumption, based on the conserved toolkit hypothesis, has
467 been that whatever processes regulate castes in complex societies must also
468 regulate castes in simpler societies6,23,53. Unexpectedly, our analyses suggest that
469 different molecular processes underpinning castes come into play at the more
470 complex levels of sociality compared to species with more simple societies. The
471 predictive gene set identified in the SVM trained on more complex species
472 performed less well in classifying caste than the predictive toolkit derived from the
473 simpler species. There may be fundamental differences discriminating (near)
474 superorganismal societies from non-superorganismal societies54. This highlights the
475 importance of examining different stages in the major transition when attempting to
476 elucidate its patterns and processes8,10,33.
477 Vespinae and Polistinae display a wide diversity in traits other than social
478 complexity. One of these in the species we sampled was the mode of colony
20 479 foundation. Some species start new colonies alone, or with a small group of
480 reproductively potent females (independent founders) whilst others form a new
481 colony as a group of queen(s) and workers (swarm founders). Among the Polistinae,
482 mode of colony founding has a phylogenetic basis, with the Polistini and
483 Mischocyttarini tribes being independent founders, whilst the Epiponini tribe are all
484 swarm founders (see Fig. 1)55,56. Within the Polistinae, therefore, there is a tendency
485 to consider swarm founding as a trait that is associated with social complexity.
486 However, the Vespinae sampled in this study are also independent founders, and yet
487 display the most complex forms of sociality. As such, our sampling afforded the
488 opportunity to determine whether this important life-history trait influenced brain
489 transcription among social phenotypes, independently of the form of social
490 complexity. Consistent with the idea that independent founding is an ancestral trait,
491 we detected additional brain transcriptomic processes at play among castes of the
492 swarm founders, that were did not distinguish castes among the independent
493 founders. Clearly, wider taxonomic sampling is required to explore this further, but
494 these analyses bring into focus how the transcriptomic basis of castes is not a simple
495 reflection of social complexity; ecology and life-history deserve more attention in
496 explaining variation in the molecular bases of social phenotypes57. This is likely to be
497 especially important when seeking to compare patterns of caste transcription across
498 lineages with contrasting life-histories and ecologies.
499 Phenotypic evolution is often accompanied by increased rates of gene
500 evolution. Indeed, high rates of protein evolution have been detected in worker-
501 biased genes of more complex societies like the honeybee, and some ants16–19. This
502 makes sense as the worker caste (with altruism and partial/complete sterility) is the
21 503 dominant phenotypic innovation in social evolution and has led to the suggestion that
504 rapid gene evolution is expected in the most complex forms of sociality, where
505 phenotypic specialisations are most pronounced17,20. Our dataset afforded a first test
506 of this idea in the wasps. We were surprised to find that the highest rates of gene
507 evolution were detected in the wasp species with the simpler societies rather than
508 the more complex societies; these results are in stark contrast with the findings of a
509 recent study on a range of social complexity in bees8 where (in keeping with
510 previous studies), rapid gene evolution was most apparent among species with the
511 most complex societies. There are several interpretations of these contrasting
512 results. First, the level of social complexity displayed by the simplest species in Shell
513 et al 2021 represents a much more rudimentary form of sociality compared to the
514 simplest societies in our study, including four solitary species, one facultatively social
515 species (e.g. females of Ceratina australensis can nest either alone or in a small
516 group) and some barely exhibiting any reproductive division of labour (e.g. Ceratina
517 calcarata). Facultative sociality is rare in the vespid wasps, being largely limited to a
518 few species of Stenogastrinae; the simplest Polistinae are represented in our sample
519 and all are obligately social (meaning they always nest in a group). It is possible
520 therefore, that the rapid gene evolution occurs only when a species is committed to
521 group living, with a clear division of reproductive labour. An alternative explanation
522 lies in ecology and life-history: bees and wasps differ fundamentally in how they
523 provision brood, with wasps hunting prey and bees collecting pollen. These findings
524 add to the complexities of explaining the molecular basis of social evolution, and
525 highlight a greater need to emphasis and account for variation in natural history,
526 ecology and life-history as well as social behaviour per se10,24,33,54.
22 527 Despite being able to extract consistent SVM predictions, our models are only
528 as good as the initial data used to train them. Our study suffers from a few
529 limitations. Firstly, the sample size (number of species) is good for a sociogenomics
530 study, but relatively low for machine learning studies; SVMs are generally used on
531 very large datasets such as clinical trials in the medical sciences58. Although our
532 models did perform well, the analyses would be more robust by using more species
533 in the training datasets. Indeed, we observed reduced performance in our model
534 predictions when fewer species were included in the training set. Secondly, by
535 comparing across multiple species, we can only train our model on genes that have
536 a single representative isoform per species in each separate test. This reduces the
537 numbers of genes we can test in each SVM model, especially where more distantly
538 related species are included. We overcame this limitation by merging gene isoforms
539 within the same orthogroup (potential gene duplications); yet this comes with some
540 additional costs as some genes are discarded in this process. Finally, genomes are
541 not available for most of the species we tested; our measurements are based on de
542 novo sequenced transcriptomes, which potentially contain misassembled transcripts,
543 which could reduce the ability to find single copy orthologs across species. For these
544 reasons, the numbers of genes detected in our putative toolkit for sociality is likely to
545 be conservative and modest (potentially by several fold). These limitations are likely
546 to apply to many similar studies, due to the difficulty and expense of obtaining high
547 quality genomic data for specific phenotypes for non-model organisms. Our study
548 illustrates the power of SVMs in detecting large suites of genes with small effects,
549 which largely differ from those identified from conventional differential expression
550 analysis43. We advocate the use of the two methods in parallel: our conventional
23 551 analyses suggested that metabolic genes appear to be responsible for the
552 differences between castes, whereas the SVM genes were mostly enriched in neural
553 vesicle transportation genes, which have not previously been connected with caste
554 evolution. SVMs may therefore reveal new target for genes involved in the evolution
555 of sociality. We anticipate that machine learning approaches, as demonstrated here,
556 may become a useful tool in a wide range of ecological and evolutionary studies on
557 the molecular basis of phenotypic diversity.
558
559 In conclusion, our analyses of brain transcriptomes for castes of social wasps
560 suggest that the molecular processes underpinning sociality are conserved
561 throughout the major transition to superorganismality. However, different (putatively
562 additional) processes may come into play in more complex societies. There may be
563 fundamental differences in the molecular machinery that discriminates key points of
564 innovation in the major transition; for example, the evolution of irreversible caste
565 commitment (in superorganisms) may require a fundamental shift in the underlying
566 regulatory molecular machinery. Such shifts may be apparent in the evolution of
567 sociality at other levels of biological organisation, such as the evolution of
568 multicellularity, taking us a step closer to determining whether there is a unified
569 process underpinning the major transitions in evolution.
570
571 Acknowledgements
572 We would like to thank Laura Butters for her help with the morphometric analyses of
573 Brachygastra, Robin Southon, Daniel Fabbro, Liam Crowley and Sam Morris for their
574 assistance in the field, James Carpenter at the American Natural History Museum
24 575 and Christopher K. Starr for confirming species identification, and the Bristol
576 Genomics Facility for their assistance with library preparations and sequencing. This
577 work was conducted under collection and export permits for Trinidad (Forestry
578 Division Ministry of Agriculture: #001162) and Panama (Autoridad nacional del
579 ambiente (ANAM) SE/A-55-13). It was funded by the Natural Environment Research
580 Council (NE/M012913/2; NE/K011316/1) awarded to SS, and a NERC studentship
581 and Smithsonian Tropical Research Institute pre-doctoral fellowship awarded to
582 E.F.B.
25 583 Methods
584 Study Species
585 Nine species of vespid wasps were chosen to represent different levels of social
586 complexity across the major transition (Fig. 1). The simplest societies in our study
587 are represented by Mischocyttarus basimacula basimacula (Cameron) and Polistes
588 canadensis (Linnaeus); wasps in these two genera are all independent nest founders
589 and lack morphological castes (defined as allometric differences in body shape,
590 rather than overall size) or any documented form of life-time caste-role
591 commitment59–62. They live in small family groups of reproductively totipotent
592 females, one of whom usually dominates reproduction (the queen); if the queen dies
593 she is succeeded by a previously-working individual21. As such, these societies
594 represent some of the earliest stages in the major transition, where caste roles are
595 least well defined, and where individual-level plasticity is advantageous for
596 maximising inclusive fitness.
597 The Neotropical swarm-founding wasps (Hymenoptera: Vespidae; Epiponini) include
598 over 20 genera with at least 229 species, exhibiting a range of social complexity
599 measures, from complete absence of morphological caste (pre-imaginal)
600 determination to colony-stage specific morphological differentiation, through to
601 permanent morphological queen-worker differentiation56. As examples of species for
602 which there is little or no evidence of developmental (morphological) caste
603 determination, we chose Angiopolybia pallens (Lepeletier) which is phylogenetically
604 basal in the Epiponines55,63 and Metapolybia cingulata (Fabricius)55,63. We confirmed
26 605 the lack of clear caste allometric differences in M. cingulata as data were lacking
606 (see Morphometrics methods (below) and Supplementary Table S8).
607 As examples of species showing subtle, colony-stage-specific caste allometry, we
608 chose a species of Polybia. The social organisation of Polybia spp is highly variable,
609 ranging from complete absence of morphological queen-worker differentiation64.
610 Polybia quadricincta (Saussure) is a relatively rare (and little studied) epiponine
611 wasp which can be found across Bolivia, Brazil, Columbia, French Guiana, Guyana,
612 Peru, Suriname and Trinidad (Richards, 1978). Our morphometric analyses found
613 some evidence of subtle allometric morphological differentiation in this species, but
614 with variation through the colony cycle (Supplementary Table S8); this suggest it is a
615 representative species for the evolution of the first signs of pre-imaginal caste
616 differentiation.
617 Many species of the genera Agelaia and Brachygastra appear to show pre-imaginal
618 caste determination with allometric morphological differences between adult queens
619 and workers55. We chose one species from each of these genera as representatives
620 of the most socially complex Polistine wasps. Although no morphological data were
621 available for Agelaia cajennensis (Fabricius) all species of Agelaia studied show
622 some level of preimaginal caste determination55,65. Brachygastra exhibit a diversity of
623 caste differentiation55,66; our morphological analysis of caste differentiation B.
624 mellifica (Say) confirms that this species is highly socially complex, with large colony
625 sizes55 and pre-imaginal caste determination resulting in allometric caste differences
626 (Supplementary Table S8).
627 All species of Vespines are independent nest founders and superorganisms, with a
628 single mated queen establishing a new colony alone and with morphological castes
27 629 that are determined during development. However, some species exhibit derived
630 superorganismal traits, such as multiple mating67, which have likely evolved under
631 different selection pressures to the major transition itself68. The European hornet,
632 Vespa crabro (Linnaeus), exhibits the hallmarks of superorganismality (see Fig. 1)
633 but little evidence of more derived superorganismal traits, such as high levels of
634 multiple mating. Conversely, multiple mating is common in Vespula species,
635 including V. vulgaris (Linnaeus) with larger colony sizes than Vespa67, suggesting a
636 more complex level of social organisation.
637
638 Sample collection
639 Where possible, we sampled from colonies representing different stages in the
640 colony cycle, as caste differentiation can vary as the colony matures in some species
641 (Supplementary Table S1. Metapolybia cingulata (6 colonies), Polistes canadensis (3
642 colonies), Agelaia cajennensis (1 colony) and Mischocyttarus
643 basimacula basimacula (3 colonies) were collected from wild populations in Panama
644 in June 2013. Brachygastra mellifica (4 colonies) were collected from populations in
645 Texas, USA in June 2013. Angiopolybia pallens (2 colonies) and Polybia
646 quadricincta (2 colonies) were collected from Arima Valley, Trinidad in July 2015.
647 Vespa crabro (4 colonies) and Vespula vulgaris (4 colonies) were collected from
648 various locations in South East England, UK in 2017. Queens and workers were
649 collected directly from their nests during the daytime, placed immediately into
650 RNAlater (Ambion, Invitrogen) and stored at -20°C until further use. Samples were
651 ultimately pooled within castes for all analyses, such that each pool consisted of 3-6
652 individual brains from wasps sampled across 2-4 colonies to capture individual-level
28 653 and colony-level variation in gene expression (see Supplementary Table S1).
654 Samples of M. cingulata, A. cajennensis, M. basimacula and B. mellifica were sent to
655 James Carpenter at the American Natural History Museum for species verification.
656 A. pallens and P. quadricincta were identified by Christopher K. Starr, at University of
657 West Indies, Trinidad and Tobago.
658 Morphometrics
659 Data on morphological differentiation among colony members (and thus information
660 on whether pre-imaginal (developmental) caste determination was present) was
661 lacking for M. cingulata, P. quadricinta and B. mellifica; therefore, we conducted
662 morphometric analyses on these three species in order to ascertain the level of
663 social complexity. Morphometric analyses were carried out using GXCAM-1.3 and
664 GXCapture V8.0 (GT Vision) to provide images for assessing morphology. We
665 measured 7 morphological characters using ImageJ v1.49 for queens and workers
666 for each species. The body parts measured were: head length (HL), head width
667 (HW), minimum interorbital distance (MID), mesoscutum length (MSL), mesoscutum
668 width (MSW), mesosoma height (MSH) and alitrunk length (AL) (for measurement
669 details, see 69). Abdominal measurements were not recorded as ovary development
670 could alter the size of abdominal measurements, therefore biasing the results. The
671 morphological data were analysed to determine whether the phenotypic
672 classification, as determined from reproductive status, could be explained by
673 morphological differences. ANOVA was used to determine size differences between
674 castes for each morphological characteristic. A linear discriminant analysis was also
675 employed to see if combinations of characters were helpful in discriminating between
676 castes. The significance of Wilks’ lambda values were tested to determine which
29 677 morphological characters were the most important for caste prediction. All statistical
678 analyses were carried out using SPSS v23.0 or Exlstat 2018. Data and analyses
679 given in Supplementary Table S8.
680
681 Dissections & RNA extractions
682 Individual heads were stored in RNAlater for brain dissections; abdomens were
683 removed and dissected to determine reproductive status. Ovary development was
684 scored according to 33,54 and the presence/absence of sperm in the spermathecae
685 was identified to determine insemination. Inseminated females with developed
686 ovaries were scored as ‘queens’; non-inseminated females with undeveloped ovaries
687 were scored as workers. Brains were dissected directly into RNAlater; RNA was
688 extracted from individuals and then pooled after extraction into caste-specific pools;
689 pooling after RNA extraction allowed for elimination of any samples with low quality
690 RNA. Pooling individuals was generally necessary to ensure sufficient RNA for
691 analyses, as well as accounting for individual variation to ensure expression
692 differences are due to caste or species, and not dependent on colony or random
693 differences between individuals. One exception to this was the V. vulgaris samples
694 which were sequenced as individual brains and pooled bioinformatically after
695 sequencing. Individual sample sizes per species are given in Supplementary Table
696 S1.
697
698 Total RNA was extracted using the RNeasy Universal Plus Mini kit (Qiagen,
699 #73404), according to the manufacturer’s instructions, with an extra freeze-thawing
700 step after homogenization to ensure complete lysis of tissue, as well as an additional
30 701 elution step to increase RNA concentration. RNA yield was determined using a
702 NanoDrop ND-8000 (Thermo Fisher Scientific); all samples showed A260/A280
703 values between 1.9 and 2.1. An Agilent 2100 Bioanalyser was used to determine
704 RNA integrity. Samples of sufficient quality and concentration were pooled and sent
705 for sequencing. Libraries were prepared using Illumina TruSeq RNA-seq sample
706 prep kit at the University of Bristol Genomics Facility. Five samples were pooled per
707 lane to give ~ 50M read per sample. Paired-end libraries were sequenced using an
708 Illumina HiSeq 2000. Descriptions of pooling of individuals and pooled sets into
709 single representatives of caste are shown in (Supplementary Table S1). Raw reads
710 are available on SRA/GEO (GSE159973).
711
712 Preparation of de novo transcriptomes
713 Transcriptomes of Agelaia, Angiopolybia, Metapolybia, Brachygastra, Polybia,
714 Polistes and Mischocyttarus were assembled using the following steps. First, reads
715 were first filtered for rRNA contaminants using tools from the BBTools
716 (version:BBMap_38) software suite (https://jgi.doe.gov/data-and-tools/bbtools/). We
717 then used Trimmomatic v0.3970 to trim reads containing adapters and low-quality
718 regions. Using these filtered RNA sequences, we could assemble a de novo
719 transcriptome for each species (merging queen and worker samples) using Trinity
720 v2.871 and filter protein coding genes to retain a single transcript (most expressed)
721 for each gene and transcript per million value (TPM), which we use for the rest of the
722 analyses.
723 For Vespula and Vespa, reads from both queen and worker samples were
724 assembled into de novo transcriptomes using a Nextflow pipeline
31 725 (github:biocorecrg/transcriptome_assembly). This involved read adapter trimming
726 with Skewer72, de-novo transcriptome assembly with Trinity v2.8.471 and use of
727 TransDecoder v5.5.071 to identify likely protein-coding transcripts, and retain all
728 translated transcripts. These were further filtered to retain the largest open read
729 frame-containing transcript, which we listed as the major isoform of each protein.
730 Trinity assembly statistics are shown in Supplementary Table S2.
731
732 Identification of orthologs
733 To identify gene-level orthologs, we used Orthofinder v.2.2.736 with diamond blast73,
734 multiple sequence alignment program MSA74 and tree inference using FastTree
735 v2.1.1075, for our nine species (Supplementary Fig. 1). The largest spliced isoform
736 per gene (from Trinity) was designated the representative sequence for each gene.
737 In addition for specific iterations, we allowed the existence of multiple gene isoforms
738 belonging to the same species in a single orthogroup (potential duplications); using
739 up to 3 gene isoforms, only keeping the gene most highly expressed. This strategy
740 allows us to test a greater number of orthogroups, where a single duplication in one
741 species would otherwise remove an entire orthogroup from the study. The drawback
742 is that some gene isoforms are removed from the study that could have been
743 informative. Similarly, we were able to include orthogroups that had missing gene
744 information (NAs) for up to two species, filling in these missing genes with read
745 counts of 10 for both queen and worker, thereby allowing us to test orthogroups
746 where some species data are missing. Missing gene orthologs in a single species
747 could be due to a true absence (likely gene loss) in a species, or due to assembly
748 weaknesses, where the use of non-genome guided RNA-Seq assemblies from just
32 749 brain tissue, could lead to missing orthologs. Scripts to perform these steps are
750 included in the Github (https://github.com/Sumner-
751 lab/Multispecies_paper_ML/tree/master/Scripts_to_improve_orthogroups).
752 Orthofinder was also used to create the phylogenetic tree in Supplementary Figure
753 1. Using the single copy orthogroups only, and with the additional of Apis mellifera
754 (GCF_000002195.4), Nomia melanderi (GCA_003710045.1), Dinoponera
755 quadriceps (GCA_001313825.1) and Solenopsis invicta (GCF_016802725.1).
756
757 Gene expression and differential expression
758 We calculated abundances of transcripts within queen and worker samples using
759 “align_and_estimate_abundance.pl” within Trinity, using estimation method RSEM v1.3.174,
760 “trinity_mode” and bowtie276 aligner. We then used edgeR v3.26.5 77 (R version 3.6.0) to
761 compare gene expression between queens and workers. Given that we were comparing a
762 single sequencing pool of several individuals per caste, we used a hard-coded dispersion of
763 0.1 and the robust parameter set to true to account for n = 1. Raw P values for each gene
764 were corrected for multiple testing using a false discovery rate (FDR) cut-off value of 0.05.
765 We did not take advantage of genome data (where available), as only two of the species had
766 published genomes at the time of analysis; using transcriptome-only analyses makes the
767 analysis more consistent across species. Trinity assemblies and RSEM counts are available
768 on GEO/SRA (GSE159973). To compare differential expression across the nine species
769 (Figure 3), we used the orthology list based on three splice isoforms represented as the
770 most expressed and using only orthogroups with a single entry in all nine species after the
771 merging of three isoforms.
772
33 773 Building a normalised gene expression matrix
774 Focused on our set of one-to-one orthologs (merging 3 or less isoforms per species),
775 we began by computing log transformed TPMs (transcripts per million reads) for
776 each gene in each sample from the raw counts, followed by quantile normalisation.
777 Next, we normalised for species, using an approach that is comparable to calculating
778 a species-specific z-score for each sample. Specifically, we transformed the
779 expression scores calculated above by subtracting the species mean and dividing by
780 the species mean for each sample within a species. This calculation has two
781 important effects. Firstly, subtracting the species mean from each sample within a
782 species centres the mean expression of each species on zero, making the units of
783 expression more comparable across species. Secondly, dividing by the species
784 mean from each sample standardises the expression scores, producing a measure
785 that is independent of the units of measurement, so that the magnitude of difference
786 between queens and workers in each sample is no longer important. The
787 transformed expression score thus allows us to focus on the relative expression in
788 queens versus workers across species. Finally, we removed orthogroups where the
789 counts per million were below 10 (with some deviations in specific iteration; listed
790 with run plots) in both Queen and Worker samples of each species, to remove lowly
791 expressed genes that may contribute noise to subsequent analyses. We then
792 performed principal component analysis (PCA) in R on the raw TPM values and
793 those with the normalisation steps outlines above (see: https://github.com/Sumner-
794 lab/Multispecies_paper_ML/blob/master/Scripts_to_run_svm/template_scripts/getEx
795 pressionAllOrthofinder.noNameReplace.R for full R code of this section).
34 796
797 Machine learning (support vector machines)
798 Support vector machine (SVM) classification was used to classify caste across the
799 species. In brief, this analysis involved taking species-scaled, logged and normalised
800 gene expression data from the nine species, filtering the lowly expressed genes,
801 splitting into training and testing datasets (8 training, 1 testing), creating an SVM
802 classifier that best separates caste in the training species (find best gamma and C
803 tuning variables) and testing one of species left out, given the classification. The
804 code to run these steps is shown on github (https://github.com/Sumner-
805 lab/Multispecies_paper_ML), and is executed in perl using subsidiary R scripts. The
806 full detail of these steps are outlined below.
807 The first step is to normalise the expression matrix, given the svm classifier
808 expects the fully normalised and scaled data. This has been outlined in the previous
809 section, and involves orthologous TPMs being logged, quantile and species-scaled;
810 as well as filtering of lowly expressed genes. Second, data were split into test and
811 train sets, so each species was addressed in turn, with queen and worker samples
812 registered based on sample names (e.g. caste <- grepl("Worker ",
813 colnames(data.train))), and converted into a binary format. Third, a svm classifier
814 could be tuned using the train data (8 of 9 species), using the R package e107178,
815 with optional grid search ranges of gamma = 10^(-8:-3) and cost = 2^(1:10). These
816 settings allow the classifier to explore the best gamma and cost settings to use in the
817 model. Fourth, based on the classifier and the best model it predicted (always the
818 radial model, and variable C and gamma values), we could predict the likelihood of
819 the nines sample being a queen or a worker. To explore which kernel was most
35 820 appropriate, we ran the script using radial, polynomial, linear and sigmoidal kernels
821 (Supplementary Figure 2), which showed consistent changes in prediction certainty
822 among the radial, polynomial and sigmoidal kernels but more erratic predictions in
823 the linear kernel. We decided to continue with the radial kernel for the rest of our
824 predictions, as it has decent prediction accuracy once the dataset is filtered, and has
825 progressive improvement across the filtering spectrum (Supplementary Figure 2; top
826 left), unlike the linear kernel (bottom left).
827 Given the majority of genes are likely not to contribute to caste differentiation,
828 on top of the basic svm test, we performed feature selection, which is commonly
829 used in machine learning to increase the power of the classifier79. For this we used
830 linear regression of each gene on caste (lm(caste ~ expr, data)), using the training
831 data only. With regression beta coefficients per orthologous gene, we could then
832 rank genes by their statistical association with caste (Supp. Table 4 using the
833 absolute values of the regression coefficients). This enabled us to measure how the
834 classification certainty changed as we filtered out genes statistically un-associated
835 with reproductive division of labour (Figure 4a).
836 Classification certainty of 0.5 would indicate the SVM could not tell the
837 difference between the two castes (maximal uncertainty), a classification certainty of
838 0 would indicate maximal certainty of a worker and 1 a maximal certainty of being a
839 queen.
840
841 dNdS scores
842 Using the single-copy orthologous genes for the nine species, we aligned nucleotide
843 sequences for each of the 1,971 orthogroups (single-copy allowing for 3 isoforms)
36 844 that successfully went through PRANK (version v.15112080). For each species and
845 for each orthogroup, we then calculated dn/ds ratios (omega) in branch models
846 (Supplementary Table S7 [ST2-ST16] and Supplementary Figure 3; null hypothesis
847 model in PAML codeml (model = 0, NSSites = 0, tree file without foreground) and in
848 an alternative hypothesis model (model = 2, NSSites = 2, tree file without some taxa
849 as foreground)); as well as in branch-site models (ST17-ST26; null hypothesis model
850 (omega fixed at 1, model = 2, NSSites = 2, tree file with one species branch as
851 foreground) and in an alternative hypothesis model (omega not fixed, all other
852 parameters set as the null model) (version v. 4.881)). We assessed log likelihoods
853 using a chi-square test and reported the adjusted p-values using R (version 3.6.3)
854 and Benjamini and Hochberg adjustment82.
855
856 GO enrichment and BLAST
857 To perform GO enrichment tests, we used the R package TopGO v2.42.083, using
858 Bonferroni cut-off P values of <=0.05. In order to assign gene ontology terms to
859 genes in our new species, we used our Orthofinder homology table with annotations
860 to Drosophila melanogaster (downloaded from Ensembl Biomart 1.10.2019). Within
861 species, we calculated enrichment of each species’ gene to a background of all the
862 genes expressed above a mean of 1 TPM. When comparing across the orthogroups
863 (OG), we used Metapolybia GO annotations (derived from homology to Drosophila),
864 with a background of those genes that have a mean >1 TPM in all species
865 orthologues. GO comparisons were similar using other species as a database of
866 gene to GO terms.
37 867 Blast2GO v 1.4.484 with default settings was used to annotate Metapolybia
868 genes.
869
870 Abbreviations
871 DOL = division of labour
872 ORF = open reading frame
873 MT = major transition
874 GO = gene ontology
875 SRA = sequence read archive
876 NCBI = National Centre for Biotechnology Information
877 BUSCO = Benchmarking set of Universal Single-Copy Orthologs
878 SVM = support vector machine
879 PCA = principal components analysis
880 Blast = Basic local alignment search tool
881 nr = non-redundant
882
883 Author’s contributions
884 SS conceived the study and supervised the project; SS, EL, EB and BT collected the
885 samples; DT, EB, BT and RB performed molecular lab work; DT & RB carried out the
886 morphological work; EF, MB & CW executed the bioinformatics pipelines, performed
887 the statistical analyses; CW & SS drafted the manuscript, with input from all authors.
38 888
889 References
890
891 1. Szathmáry, E. & Maynard Smith, J. The major evolutionary transitions. Nature
892 374, 227–232 (1995).
893 2. Kennedy, P. et al. Deconstructing Superorganisms and Societies to Address
894 Big Questions in Biology. Trends in Ecology and Evolution 32, 861–872
895 (2017).
896 3. Toth, A. L. & Robinson, G. E. Evo-devo and the evolution of social behavior.
897 Trends Genet. 23, 334–341 (2007).
898 4. Berens, a. J., Hunt, J. H. & Toth, a. L. Comparative Transcriptomics of
899 Convergent Evolution: Different Genes but Conserved Pathways Underlie
900 Caste Phenotypes across Lineages of Eusocial Insects. Mol. Biol. Evol. 32,
901 690–703 (2014).
902 5. Qiu, B. et al. Towards reconstructing the ancestral brain gene-network
903 regulating caste differentiation in ants. Nat. Ecol. Evol. 2, 1782 (2018).
904 6. Warner, M. R., Qiu, L., Holmes, M. J., Mikheyev, A. S. & Linksvayer, T. A.
905 Convergent eusocial evolution is based on a shared reproductive groundplan
906 plus lineage-specific plastic genes. Nat. Commun. 10, 1–11 (2019).
907 7. Patalano, S. et al. Molecular signatures of plastic phenotypes in two eusocial
908 insect species with simple societies. Proc. Natl. Acad. Sci. U. S. A. 112,
909 13970–5 (2015).
910 8. Shell, W. A. et al. Sociality sculpts similar patterns of molecyalr evolution in
39 911 two independently evolved lineages of eusocial bees. Commun. Biol. 4, 1–9
912 (2021).
913 9. West-Eberhard MJ. Wasp societies as microcosms for the study of
914 development and evolution. in Natural history and evolution of paper wasps.
915 (eds. Turillazzi, S. & West-Eberhard, M. J.) 290–317 (Oxford University Press,
916 1996).
917 10. Toth, A. L. & Rehan, S. M. Molecular Evolution of Insect Sociality: An Eco-Evo-
918 Devo Perspective. Annual Review of Entomology 62, 419–442 (Annual
919 Reviews, 2017).
920 11. Boomsma, J. J. Lifetime monogamy and the evolution of eusociality. Philos.
921 Trans. R. Soc. Lond. B. Biol. Sci. 364, 3191–207 (2009).
922 12. Boomsma, J. J. & Gawne, R. Superorganismality and caste differentiation as
923 points of no return: how the major evolutionary transitions were lost in
924 translation. Biol. Rev. (2017). doi:10.1111/brv.12330
925 13. Ferreira, P. G. et al. Transcriptome analyses of primitively eusocial wasps
926 reveal novel insights into the evolution of sociality and the origin of alternative
927 phenotypes. Genome Biol. 14, R20 (2013).
928 14. Sumner, S. The importance of genomic novelty in social evolution. Mol. Ecol.
929 23, (2014).
930 15. Feldmeyer, B., Elsner, D. & Foitzik, S. Gene expression patterns associated
931 with caste and reproductive status in ants: Worker-specific genes are more
932 derived than queen-specific ones. Mol. Ecol. 23, 151–161 (2014).
933 16. Johnson, B. R. & Tsutsui, N. D. Taxonomically restricted genes are associated
934 with the evolution of sociality in the honey bee. BMC Genomics 12, 164
40 935 (2011).
936 17. Simola, D. F. et al. Social insect genomes exhibit dramatic evolution in gene
937 composition and regulation while preserving regulatory features linked to
938 sociality. Genome Res. 23, 1235–1247 (2013).
939 18. Harpur, B. a et al. Population genomics of the honey bee reveals strong
940 signatures of positive selection on worker traits. Proc. Natl. Acad. Sci. U. S. A.
941 111, 2614–9 (2014).
942 19. Rubin, B. E. R., Jones, B. M., Hunt, B. G. & Kocher, S. D. Rate variation in the
943 evolution of non-coding DNA associated with social evolution in bees. Philos.
944 Trans. R. Soc. B Biol. Sci. 374, (2019).
945 20. Kapheim, K. M. Genomic sources of phenotypic novelty in the evolution of
946 eusociality in insects. Curr. Opin. Insect Sci. 1–9 (2015).
947 doi:10.1016/j.cois.2015.10.009
948 21. Dogantzis, K. A. et al. Insects with similar social complexity show convergent
949 patterns of adaptive molecular evolution. 1–8 (2018). doi:10.1038/s41598-018-
950 28489-5
951 22. Taylor, B. A., Reuter, M. & Sumner, S. Patterns of reproductive differentiation
952 and reproductive plasticity in the major evolutionary transition to
953 superorganismality. Curr. Opin. Insect Sci. (2019).
954 23. Toth, Amy L. Rehan, S. . Climbing the social ladder: The molecular evolution
955 of sociality. (2015).
956 24. Branstetter, M. et al. Genomes of the Hymenoptera. Curr. Opin. Insect Sci. 25,
957 65–75 (2017).
958 25. Standage, D. S. et al. Genome, transcriptome and methylome sequencing of a
41 959 primitively eusocial wasp reveal a greatly reduced DNA methylation system in
960 a social insect. Mol. Ecol. 25, 1769–1784 (2016).
961 26. Toth, A. L. et al. Shared genes related to aggression, rather than chemical
962 communication, are associated with reproductive dominance in paper wasps
963 (Polistes metricus). BMC Genomics 15, 75 (2014).
964 27. Bluher, S. E., Miller, S. E. & Sheehan, M. J. Fine-scale population structure but
965 limited genetic differentiation in a cooperatively breeding paper wasp. Genome
966 Biol. Evol. (2020).
967 28. Rehan, S. M. et al. Conserved Genes Underlie Phenotypic Plasticity in an
968 Incipiently Social Bee. Genome Biol. Evol. 10, 2749–2758 (2018).
969 29. Kocher, S. D. et al. The genetic basis of a social polymorphism in halictid
970 bees. Nat. Commun. 9, (2018).
971 30. Shell, W. A. & Rehan, S. M. Social modularity: Conserved genes and
972 regulatory elements underlie caste-antecedent behavioural states in an
973 incipiently social bee. Proc. R. Soc. B Biol. Sci. (2019).
974 doi:10.1098/rspb.2019.1815
975 31. Kapheim, K. M. et al. Developmental plasticity shapes social traits and
976 selection in a facultatively eusocial bee. Proc. Natl. Acad. Sci. 202000344
977 (2020). doi:10.1073/pnas.2000344117
978 32. Linksvayer, T. A. & Johnson, B. R. Re-thinking the social ladder approach for
979 elucidating the evolution and molecular basis of insect societies. Curr. Opin.
980 Insect Sci. (2019). doi:10.1016/j.cois.2019.07.003
981 33. Taylor, D., Bentley, M. A. & Sumner, S. Social wasps as models to study the
982 major evolutionary transition to superorganismality. Curr. Opin. Insect Sci. 28,
42 983 26–32 (2018).
984 34. Branstetter, M. G. et al. Phylogenomic Insights into the Evolution of Stinging
985 Wasps and the Origins of Ants and Bees. Curr. Biol. 27, 1019–1025 (2017).
986 35. Ferreira, P. G. et al. Transcriptome analyses of primitively eusocial wasps
987 reveal novel insights into the evolution of sociality and the origin of alternative
988 phenotypes Transcriptome analyses of primitively eusocial wasps reveal novel
989 insights into the evolution of sociality. Genome Biol. 14, R20 (2013).
990 36. Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole
991 genome comparisons dramatically improves orthogroup inference accuracy.
992 Genome Biol. (2015). doi:10.1186/s13059-015-0721-2
993 37. Oi, C. A. et al. Effects of juvenile hormone in fertility and fertility-signaling in
994 workers of the common wasp Vespula vulgaris. PLoS One 16, 1–13 (2021).
995 38. Kelstrup, H., Hartfelder, K., Nascimento, F. S. & Riddiford, L. M. Frontiers in
996 Zoology The role of juvenile hormone in dominance behavior , reproduction
997 and cuticular pheromone signaling in the caste-flexible epiponine wasp ,
998 Synoeca surinama. Front. Zool. 11, (2014).
999 39. De Smet, F. et al. Balancing false positives and false negatives for the
1000 detection of differential expression in malignancies. Br. J. Cancer (2004).
1001 doi:10.1038/sj.bjc.6602140
1002 40. Ilicic, T. et al. Classification of low quality cells from single-cell RNA-seq data.
1003 Genome Biol. (2016). doi:10.1186/s13059-016-0888-1
1004 41. Alquicira-Hernandez, J., Sathe, A., Ji, H. P., Nguyen, Q. & Powell, J. E.
1005 ScPred: Accurate supervised method for cell-type classification from single-cell
1006 RNA-seq data. Genome Biol. (2019). doi:10.1186/s13059-019-1862-5
43 1007 42. Furey, T. S. et al. Support vector machine classification and validation of
1008 cancer tissue samples using microarray expression data. Bioinformatics
1009 (2000). doi:10.1093/bioinformatics/16.10.906
1010 43. Taylor, B. A., Cini, A., Wyatt, C. D. R., Reuter, M. & Sumner, S. The molecular
1011 basis of socially mediated phenotypic plasticity in a eusocial paper wasp. Nat.
1012 Commun. 2020.07.15.203943 (2020). doi:10.1101/2020.07.15.203943
1013 44. Vizan, P., Di Croce, L. & Aranda, S. Functional and patholological roles of
1014 AHCY. Front. Cell Dev. Biol. 9, 654344 (2021).
1015 45. Hoffmann, K., Gowin, J., Hartfelder, K. & Korb, J. The scent of royalty: A P450
1016 gene signals reproductive status in a social insect. Mol. Biol. Evol. 31, 2689–
1017 2696 (2014).
1018 46. Sheeja, C. C., Thushara, V. V. & Divya, L. Caste-Specific Expression of Na +
1019 /K + -ATPase in the Asian Weaver Ant, Oecophylla smaragdina (Fabricius,
1020 1775). Neotrop. Entomol. (2018). doi:10.1007/s13744-018-0598-3
1021 47. Iovinella, I. et al. Antennal protein profile in honeybees: Caste and task matter
1022 more than age. Front. Physiol. (2018). doi:10.3389/fphys.2018.00748
1023 48. Sumner, S. The importance of genomic novelty in social evolution. Mol. Ecol.
1024 23, (2014).
1025 49. Simões, J. M. The social brain : how social stimuli are translated into
1026 neuroendocrine signals. (University of Lisboa, 2014).
1027 50. Mariano, D. O. C. et al. Bottom-up proteomic analysis of polypeptide venom
1028 components of the giant ant Dinoponera quadriceps. Toxins (Basel). (2019).
1029 doi:10.3390/toxins11080448
1030 51. Sasa, C., Miyazaki, S., Higashi, S. & Miura, T. Gene expressions for the
44 1031 sexually-dimorphic antennae in a ponerine ant. in (IUSSI).
1032 52. Yanay, C., Morpurgo, N. & Linial, M. Evolution of insect proteomes: Insights
1033 into synapse organization and synaptic vesicle life cycle. Genome Biol. (2008).
1034 doi:10.1186/gb-2008-9-2-r27
1035 53. Rehan, S. M., Berens, A. J. & Toth, A. L. At the brink of eusociality:
1036 transcriptomic correlates of worker behaviour in a small carpenter bee. BMC
1037 Evol. Biol. 14, 260 (2014).
1038 54. Kronauer, D. J. & Libbrecht, R. Back to the roots: the importance of using
1039 simple insect societies to understand the molecular basis of complex social
1040 life. Curr. Opin. Insect Sci. 28, 33–39 (2018).
1041 55. Noll, F. B., Wenzel, J. W. & Zucchi, R. Evolution of Caste in Neotropical
1042 Swarm-Founding Wasps ( Hymenoptera : Evolution of Caste in Neotropical
1043 Swarm-Founding Wasps ( Hymenoptera : Vespidae ; Epiponini ). Am. Museum
1044 Novit. 3467, 1–24 (2004).
1045 56. Richards, O. W. The social wasps of the Americas. (1978).
1046 57. Sumner, S., Bell, E. & Taylor, D. A molecular concept of caste in insect
1047 societies. Curr. Opin. Insect Sci. 42–50 (2017). doi:10.1016/j.cois.2017.11.010
1048 58. Kohannim, O. et al. Boosting power for clinical trials using classifiers based on
1049 multiple biomarkers. Neurobiol. Aging (2010).
1050 doi:10.1016/j.neurobiolaging.2010.04.022
1051 59. Montagna, T. S. & Antonialli, W. F. Morphological differences between
1052 reproductive and non-reproductive females in the social wasp Mischocyttarus
1053 consimilis Zikán (Hymenoptera: Vespidae). Sociobiology 63, 693–698 (2016).
1054 60. Jeanne, R. L. Social Biology of the neotropical wasp, Mischocyttarus drewseni.
45 1055 Bull. Museum Comp. Zool. 144, 63–150 (1972).
1056 61. Murakami, A. S. N., Shima, S. N. & Desuó, I. C. More than one inseminated
1057 female in colonies of the independent-founding wasp Mischocyttarus
1058 cassununga von Ihering (Hymenoptera, Vespidae). Rev. Bras. Entomol. 53,
1059 653–662 (2009).
1060 62. Reeve, H. K. Polistes. in The Social Biology of Wasps (ed. Ross KG, M. R.)
1061 99–148 (Cornell University Press, 1991).
1062 63. Gelin, L. F. F. et al. Morphological Caste Studies In The Neotropical Swarm-
1063 Founding Polistinae Wasp Angiopolybia pallens ( Lepeletier ) (Hymenoptera :
1064 Vespidae). Neotrop. Entomol. 37, 691–701 (2008).
1065 64. West-Eberhard, M. J. Temporary Queens in Metapolybia Wasps :
1066 Nonreproductive Helpers Without Altruism ? Science (80-. ). 200, 441–443
1067 (1978).
1068 65. Sakagami, S. F., Zucchi, R., Yamane, S., Noll, F. B. & Camargo, J. M. P.
1069 Morphological caste differences in Agelaia vicina, the neotropical swarm-
1070 founding polistine wasp with the largest colony size among social wasps
1071 (Hymenoptera: Vespidae). Sociobiology 28, 207–223 (1996).
1072 66. V., B. M., Noll, F. B. & Zucchi, R. Morphological Caste Differences and Non-
1073 Sterility of Workers in Brachygastra augusti ( Hymenoptera , Vespidae ,
1074 Epiponini ), a Neotropical Swarm-Founding Wasp Author ( s ): J. New York
1075 Entomol. Soc. 111, 242–252 (2003).
1076 67. Loope, K. J., Chien, C. & Juhl, M. Colony size is linked to paternity frequency
1077 and paternity skew in yellowjacket wasps and hornets. BMC Evol. Biol. 14, 1–
1078 12 (2014).
46 1079 68. Hastings, M. D., Queller, D. C., Eischen, F. & Strassmann, J. E. Kin selection ,
1080 relatedness , and worker control of reproduction in a large-colony epiponine
1081 wasp , Brachygastra mellifica. Behav. Ecol. 9, 573–581 (1998).
1082 69. Gobbi, N., Noll, F. B. & Penna, M. A. H. ‘Winter’ aggregations, colony cycle,
1083 and seasonal phenotypic change in the paper wasp Polistes versicolor in
1084 subtropical Brazil. Naturwissenschaften (2006). doi:10.1007/s00114-006-0140-
1085 z
1086 70. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for
1087 Illumina sequence data. Bioinformatics (2014).
1088 doi:10.1093/bioinformatics/btu170
1089 71. Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq
1090 using the Trinity platform for reference generation and analysis. Nat. Protoc. 8,
1091 1494 (2013).
1092 72. DI Tommaso, P. et al. Nextflow enables reproducible computational workflows.
1093 Nature Biotechnology (2017). doi:10.1038/nbt.3820
1094 73. Buchfink, B., Reuter, K. & Drost, H. G. Sensitive protein alignments at tree-of-
1095 life scale using DIAMOND. Nat. Methods (2021). doi:10.1038/s41592-021-
1096 01101-x
1097 74. Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using
1098 DIAMOND. Nat. Methods 12, 59 (2014).
1099 75. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 - Approximately maximum-
1100 likelihood trees for large alignments. PLoS One (2010).
1101 doi:10.1371/journal.pone.0009490
1102 76. Li, B. & Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq
47 1103 data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).
1104 77. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2.
1105 Nat. Methods (2012). doi:10.1038/nmeth.1923
1106 78. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D. & Weingessel, A. e1071: Misc
1107 Functions of the Department of Statistics (e1071), TU Wien. R package
1108 version (2011).
1109 79. Karagiannopoulos, M., Anyfantis, D., Kotsiantis, S. B. & Pintelas, P. E. Feature
1110 Selection for Regression Problems. 8th Hell. Eur. Res. Comput. Math. its Appl.
1111 HERCMA 2007 (2007). doi:10.1109/ICDM.2014.63
1112 80. Löytynoja, A. Phylogeny-aware alignment with PRANK. Methods Mol. Biol.
1113 (2014). doi:10.1007/978-1-62703-646-7_10
1114 81. Yang, Z. Paml: A program package for phylogenetic analysis by maximum
1115 likelihood. Bioinformatics (1997). doi:10.1093/bioinformatics/13.5.555
1116 82. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical
1117 and powerful approach to multiple testing. J. R. Stat. Soc. 57, 289–300 (1995).
1118 83. Alexa, A. & Rahnenführer, J. Gene set enrichment analysis with topGO.
1119 Bioconductor Improv. (2007).
1120 84. Götz, S. et al. High-throughput functional annotation and data mining with the
1121 Blast2GO suite. Nucleic Acids Res. 36, 3420–35 (2008).
1122 85. Piekarski, P. K., Carpenter, J. M., Lemmon, A. R., Lemmon, E. M. &
1123 Sharanowski, B. J. Phylogenomic evidence overturns current conceptions of
1124 social evolution in wasps (vespidae). Mol. Biol. Evol. 35, 2097–2109 (2018).
1125 86. O’Donnell, S. Reproductive caste determination in eusocial wasps
1126 (Hymenoptera: Vespidae). Annu. Rev. Entomol. 43, 323–46 (1998).
48 1127 87. Matsuura, M. & Yamane, S. Biology of the vespine wasps. (Springer Verlag,
1128 1990).
1129 88. Menezes, R. S. T., Lloyd, M. W. & Brady, S. G. Phylogenomics indicates
1130 Amazonia as the major source of Neotropical swarm-founding social wasp
1131 diversity. Proc. R. Soc. B (2020).
1132
1133
49 1134 Main Figures
1135
1136 Fig. 1 | Social wasps as a model group. The nine species of social wasps used in
1137 this study, and their characteristics of social complexity. The Polistinae and Vespinae
1138 are two subfamilies comprising 1100+ and 67 species of social wasp respectively, all
1139 of which share the same common non-social ancestor, an eumenid-like solitary
1140 wasp85. The Polistinae are an especially useful subfamily for studying the process of
1141 the major transition as they include species that exhibit simple group living
1142 comprised of small groups (<10 individuals) of totipotent relatives, as well as species
1143 with varying degrees of more complex forms of sociality, with different colony sizes,
1144 levels of caste commitment and reproductive totipotency86. The Vespinae include the
1145 yellow-jackets and hornets, and are all superorganismal, meaning caste is
1146 determined during development in caste-specific brood cells; they also show
1147 species-level variation in complexity, in terms of colony size and other
1148 superorganismal traits (e.g. multiple mating, worker policing)87. Ranked in order of
1149 increasing levels of social complexity, from simple to more complex, these species
1150 are: Mischocyttarus basimacula basimacula, Polistes canadensis, Metapolybia
50 1151 cingulata, Angiopolybia pallens, Polybia quadricincta, Agelaia cajennensis,
1152 Brachygastra mellifica, Vespa crabro and Vespula vulgaris (see Supplementary
1153 Methods for further details of species choice). Where data on evidence of
1154 morphological castes was not available from the literature, we conducted
1155 morphometric analyses of representative queens and workers from several colonies
1156 per species (see Supplementary Methods; Supplementary Table S8). Image credits:
1157 M. basimacula (Stephen Cresswell). A. cajennesis (Gionorossi; Creative Commons);
1158 V. vulgaris (Donald Hobern; Creative Commons). V. crabro (Patrick Kennedy); P.
1159 canadensis; M. cingulata, A. pallens, P. quadricinta, (Seirian Sumner), B. mellifica
1160 (Amante Darmanin; Creative Commons).
1161
1162
1163
1164
1165
1166
1167
1168
1169
51 a Pre normalisation b Post normalisation
Vespa Brachygastra Vespa Vespula Agelaia 0.2 Brachygastra
Polybia 0.25 Mischocyttarus Polybia Angiopolybia Angiopolybia Vespa 0.0 Angiopolybia Mischocyttarus Polybia Metapolybia Metapolybia Agelaia Polistes Polistes Polistes 0.00 Polistes Metapolybia
PC2 (16.02%) PC2 (15.64%) Agelaia −0.2 Polybia Angiopolybia Vespa
−0.4 −0.25 Mischocyttarus
Vespula Vespula
−0.6 Vespula Brachygastra
−0.6 −0.4 −0.2 0.0 0.2 −0.50 −0.25 0.00 0.25 0.50 PC1 (28.9%) PC1 (36.85%) Queen upregulated Worker upregulated 1170
1171 Fig. 2 | Principal component analyses of orthologous gene expression before
1172 and after between-species normalisation. a) Principal component analyses
1173 performed using log2 transcript per million (TPM) gene expression values. This
1174 analysis used single-copy orthologs (using Orthofinder), allowing up to three gene
1175 isoforms in a single species to be present, whereby we took the most highly
1176 expressed to represent the orthogroup, as well as filtering of orthogroups which have
1177 expression below 10 counts per million. b) Principal component analysis of the
1178 species-normalised and scaled TPM gene expression values using same filters as
1179 (a). Caste denoted by purple (queen) or blue (worker). Each species has a different
1180 shape/symbol.
1181
52 1182
1183 Fig. 3 | Overlap of differential caste-biased genes (queen vs worker) and their
1184 functions. a) Heatmap showing the differential genes that are caste-biased in at
1185 least four species (identified using edgeR) using the orthologous genes present in
1186 the nine species. In cases where three isoforms exist for a single species
1187 orthogroup, only the isoform with greatest expression was considered, plus at least 2
1188 species could have missing gene data for each orthogroup. Listed for each species
1189 (in brackets), is the total number of orthologous differentially expressed genes per
1190 species. Metapolybia Blast hits are listed along with orthogroup name. b) Gene
1191 ontology histogram of overrepresented terms of genes found differentially expressed
53 1192 in at least four out of the nine species (n=57 genes; in either queen or worker [not
1193 both]), using a background of all genes tested across the nine species. P values are
1194 single-tailed and bonferroni corrected. c) Gene ontology for orthogroups with at least
1195 two species with caste-biased gene expression in the same direction (n=353 genes).
54 1196
1197 Fig. 4 | A genetic toolkit for social behaviours across eusocial wasps. a)
1198 Change in certainty of correct caste classifications through progressive feature
55 1199 selection using Support Vector Machine (SVM) approaches. Models were trained on
1200 eight species and tested on the ninth species. Features (a.k.a. orthologous genes)
1201 were sorted by linear regression with regard to caste identity within the 8 training
1202 species, beginning at 99% where almost all genes were used for the predictions of
1203 caste, to 1% where only the top one percent of genes from the linear regression
1204 (sorted by P value) were used to train the model. ‘1’ equates to high queen
1205 classification certainty. This test was repeated using various filters, showing: LEFT:
1206 use of only 1to1 orthogroups, where no gene isoforms or missing values were
1207 accepted into the model, MIDDLE: allowing three gene isoforms (using the most
1208 highly expressed as the representative) and RIGHT: allowing three gene isoforms
1209 and missing values (NAs; up to 2 species with no ortholog), filling in with non-caste
1210 biased gene values for queen and worker. Horizontal guide markers are placed at
1211 0.4, 0.5 and 0.6 respectively indicating the confidence score for classifying queens,
1212 and the vertical dotted line marks the percentage at which the regression score
1213 becomes < 0.05. b): Histograms indicate the Bonferroni-corrected enriched gene
1214 ontology (GO) terms for the top orthologous genes from the corresponding model
1215 above (linear regression P value < 0.05; shown as dotted line in upper plots), tested
1216 against a background of all genes used in the corresponding SVM model (i.e. with a
1217 single gene representative for all species in the test). P values are single-tailed. c)
1218 Heatmap of the top 50 genes after sorting by regression, in SVM with 3 merged
1219 isoforms and 2 NAs; showing species-normalised gene expression levels in the nine
1220 species for queen (Q) and worker (W) samples (where red indicated down-regulation
1221 and blue up; Row Z-score), orthogroup name and top Metapolybia BLAST hit.
1222
56 1223
1224 Fig.5 | Testing for the presence of a distinct ‘simple society toolkit’ and a
1225 ‘complex society toolkit’. a): Using either the four species with the simpler
1226 societies or the four with the more complex societies, we trained SVM models, using
1227 progressive filtering of genes (based on the linear regression); we then tested how
1228 well these models classified caste for the opposite set of species. UPPER panel:
1229 training set was the four simpler species, and the test set was the four more complex
1230 species. LOWER panel: training set was the four more complex species, and the test
1231 set was the four simpler species. Likelihood of being a queen from 0 to 1 is plotted
1232 for each species across the progressively filtered sets. The total number of
57 1233 orthologous genes used in each SVM model are shown (bottom left of each panel);
1234 the total number of these genes that were significant predictors of caste (p value <
1235 0.05) is listed at the dotted line. For each test (pair of Queen/Worker in each
1236 species), the SVM model was run using genes with only 1 homologous gene copy
1237 per species. b) Overlap of significant genes among the three putative toolkits,
1238 comparing those genes for ‘simpler species’ (Fig 5a UPPER panel), ‘more complex
1239 species’ (Fig 5a LOWER panel), and ‘full spectrum’ (Fig 4a- Left) toolkits. For each
1240 experiment, the number of genes (orthogroups) tested and the total number that
1241 were significant after linear regression are given outside of the Venn diagram. Inside
1242 the Venn diagram are the numbers of these significant genes shared (or not)
1243 between the three groups. Significant overlap is shown using hypergeometric tests
1244 (one-tailed). The colours represent the genes that predict caste in: the more complex
1245 societies (blue), the simpler societies (grey) and across the full spectrum (pink).
1246 There is significant overlap between the simpler and the more complex genes with
1247 the full spectrum set, but not between the simpler and complex societies, suggesting
1248 that there may be distinct differences in the genes regulating castes at different
1249 levels of social complexity. c) Enriched gene ontology terms (TopGO) using a
1250 background of all genes tested in each individual experiment, using an uncorrected
1251 single-tailed p values.
1252
58 1253 Supplementary Figures
1254 1255 Supp. Figure. 1 | Phylogenetic tree of social wasps used in this study, and
1256 related hymenopterans generated using Orthofinder (SpeciesTree_rooted.txt).
1257 Drosophila was chosen as the root of the species tree (not shown), with two
1258 representative ants (Dinoponera quadriceps and Solenopsis invicta) and two bees
1259 (Apis mellifera and Nomia melanderi). Colours show groupings of ants, bees and
1260 wasps (Vespinae or Polistinae). For the nine wasp species (this study) we have
1261 sequenced adult caste-specific brain transcriptomic data (queen and worker). As
1262 expected: Vespinae are clearly separated from the Polistinae; Angiopolybia is basal
1263 to the other Epiponini; independent-founding, non-superorganismal wasps (Polistes
1264 and Mischocyttarus) are basal to the Polistinae85,88.
1265
1266
1267
1268
1269
59 Radial Polynomial
V_cra Queen V_cra Queen V_vul Queen V_vul Queen A_caj Queen A_caj Queen P_can Queen P_can Queen A_pal Queen A_pal Queen M_cin Queen M_cin Queen M_bas Queen M_bas Queen P_qua Queen P_qua Queen B_mel Queen B_mel Queen Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2
Percentage kept after filtering Percentage kept after filtering
Linear SIgmoidal
V_cra Queen V_cra Queen V_vul Queen V_vul Queen A_caj Queen A_caj Queen P_can Queen P_can Queen A_pal Queen A_pal Queen M_cin Queen M_cin Queen M_bas Queen M_bas Queen P_qua Queen P_qua Queen B_mel Queen B_mel Queen Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2
Percentage kept after filtering Percentage kept after filtering 1270
1271 Supp. Figure. 2 | SVM kernel type iterations. For each of the four kernel types, a
1272 filtered SVM plot shows the leave one out experiment. Each coloured line represents
1273 the prediction of the queen sample at various regression filters, from 99% data
1274 inclusion to 1% inclusion of only the top genes in the regression analysis. Species
1275 are listed in their Genus (capital letter) and species (first three lower case letters of
1276 each species).
60 1277
1278 Supp. Figure 3 | dNdS. a) All orthogroups that had at least one gene significantly
1279 under selection are listed (branch-site models). b) Overlap of positive selection
61 1280 genes and various toolkit gene lists. Top, shows the differential gene list overlap.
1281 Middle, shows the SVM gene list <0.05 for singleton orthogroups. Bottom, shows the
1282 three gene isoform merge and 2 NA gene list. The number of genes overlapping are
1283 in the middle and in each sphere shows the total number of genes compared for
1284 each list (given that the two backgrounds do not overlap). A mutual background is
1285 listed showing the total number of genes that were possible to have been significant
1286 in the two compared lists.
1287
1288
1289
1290
62 a Angiopolybia Queen Metapolybia Queen Training with Swarm founding species ONLY Agelaia Queen Polybia Queen Mischocyttarus Queen Polistes Queen Vespa Queen Vespula Queen Classification confidence values Classification confidence values Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2
Percentage kept after filtering Percentage kept after filtering Percentage kept after filtering Percentage kept after filtering
b Vespula Queen Vespa Queen Training with Independent founding species ONLY Mischocyttarus Queen Polistes Queen
Agelaia Queen Angiopolybia Queen Metapolybia Queen Polybia Queen Classification confidence values Classification confidence values Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 c Percentage kept after filtering Percentage kept after filtering Percentage kept after filtering Percentage kept after filtering Mischocyttarus Queen Polistes Queen Training with Polistinae species ONLY Angiopolybia Queen Metapolybia Queen Agelaia Queen Polybia Queen
Vespa Queen Vespula Queen Classification confidence values Classification confidence values 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
99 90 81 72 63 54 45 36 27 18 9 2 99 90 81 72 63 54 45 36 27 18 9 2 1291 Percentage kept after filtering Percentage kept after filtering 1292 Supp. Figure. 4. | SVM predictions using founding behaviour and phylogeny to
1293 subsets of the training data. SVM predictions are given for each single species
1294 given training on species listed to left, showing the prediction after progressive
1295 feature selection from 95 to 1% of genes remaining after selection. Numbers of
1296 genes in each test are indicated in the lower left part of each plot. a) Training with
1297 independent-founders only, tested on the four swarm-founders (see Figure 1 for
1298 species). b) Training with swarm-founders and testing on the four independent-
1299 founders. c) Training with Polistine species only, and testing on the Vespines (Vespa
1300 and Vespula). The reverse was not possible, as the SVM requires a minimum
1301 number of samples to work.
63 Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.
Supp.Table.S1RNAseqsamples.xlsx Supp.Table.S2.OrthologyplusMerge.xlsx Supp.Table.S3.OrthogroupDEGs.v2.xlsx Supp.Table.S4.NormalisedSVMandresult.xlsx Supp.Table.S5.SVMSuperNotSuper.xlsx Supp.Table.S6.SVMNestTypePolistinae.xlsx Supp.Table.S7.DnDs.xlsx Supp.Table.S8Morphometricdataresults.xlsx