The 4-Celled Tetrabaena socialis Nuclear Genome Reveals the Essential Components for Genetic Control of Cell Number at the Origin of Multicellularity in the Volvocine Lineage

Item Type Article

Authors Featherston, Jonathan; Arakaki, Yoko; Hanschen, Erik R; Ferris, Patrick J; Michod, Richard E; Olson, Bradley J S C; Nozaki, Hisayoshi; Durand, Pierre M

Citation Jonathan Featherston, Yoko Arakaki, Erik R Hanschen, Patrick J Ferris, Richard E Michod, Bradley J S C Olson, Hisayoshi Nozaki, Pierre M Durand; The 4-Celled Tetrabaena socialis Nuclear Genome Reveals the Essential Components for Genetic Control of Cell Number at the Origin of Multicellularity in the Volvocine Lineage, Molecular Biology and Evolution, Volume 35, Issue 4, 1 April 2018, Pages 855–870, https://doi.org/10.1093/molbev/ msx332

DOI 10.1093/molbev/msx332

Publisher OXFORD UNIV PRESS

Journal MOLECULAR BIOLOGY AND EVOLUTION

Rights © The Author(s) 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved.

Download date 08/10/2021 09:07:22

Item License http://rightsstatements.org/vocab/InC/1.0/ Version Final accepted manuscript

Link to Item http://hdl.handle.net/10150/628254 Jonathan Featherston*1,2, Yoko Arakaki3, Erik R. Hanschen4, Patrick J. Ferris4, Richard E. Michod4, Bradley J.S.C. Olson5, Hisayoshi Nozaki3, Pierre M. Durand1

Email addresses: Jonathan Featherston [email protected] Yoko Arakaki [email protected] Erik R. Hanschen [email protected] Patrick J. Ferris [email protected] Richard E. Michod [email protected] Bradley J. S. C. Olson [email protected] Hisayoshi Nozaki [email protected] Pierre M. Durand [email protected]

Affiliations: 1 Evolutionary Studies Institute, University of the Witwatersrand, South Africa. 2Agricultural Research Council, Biotechnology Platform, Pretoria 0040, South Africa 3Department of Biological Sciences, Graduate School of Science, University of Tokyo, Hongo, Bunkyo-ku, Tokyo, Japan. 4 Department of Ecology and Evolutionary Biology, University of Arizona, United States of America. 5 Division of Biology, Kansas State University, United States of America. #Correspondence to: [email protected]

The 4-celled Tetrabaena socialis nuclear genome reveals the essential components for genetic control of cell number at the origin of multicellularity in the volvocine lineage.

Abstract Multicellularity is the premier example of a major evolutionary transition in individuality and was a foundational event in the evolution of macroscopic biodiversity. The volvocine chlorophyte lineage is well suited for studying this process. Extant members span unicellular, simple colonial, and obligate multicellular taxa with germ-soma differentiation. Here, we report the nuclear genome sequence of one the simplest and most deeply rooted organisms in the lineage – the 4-celled colonial Tetrabaena socialis and compare this to the three other complete volvocine nuclear genomes. Using conservative estimates of gene family expansions a minimal set of expanded gene families was identified that associate with the origin of multicellularity. These families are rich in genes related to developmental processes. Multiple points of evidence associate modifications to the ubiquitin proteasomal pathway (UPP) with the beginning of coloniality. Genes undergoing positive or accelerating selection in the multicellular volvocines were found to be enriched in components of the UPP and gene families gained at the origin of multicellularity include components of the UPP. A defining feature of colonial / multicellular life- cycles is the genetic control of cell number. The genomic data presented here, which includes diversification of cell cycle genes and modifications to the UPP, align the genetic components with the evolution of this trait.

Introduction The evolution of multicellular organisms from unicellular ancestors is an example of a transition in the unit of selection and evolution, in this case, from the level of the cell to the level of the multicellular group. Explaining such a major evolutionary transition (sensu Maynard Smith and Szathmary (Maynard Smith and Szathmáry 1995)) on evolutionary principles is a challenge, because explanatory concepts such as fitness and individuality are themselves changing during the transition. Critical in this endeavour to explain the origin of multicellular life is understanding the initial steps involved in the transition from a unicellular life cycle to a life cycle involving groups of cells (Maliet, et al. 2015; Rashidi, et al. 2015). The volvocine green algae provide a compelling model system to deconstruct these early steps (Starr 1968; Kirk and Harper 1986; Kirk 1999, 2005; Herron and Michod 2008; Herron 2009). The value of the volvocines as a model system is due to (i) the relatively recent evolution of multicellularity in the lineage (~234 million years ago (MYA)) (Herron, et al. 2009) when compared to lineages such as land plants and metazoa (600-1000 MYA)(Sharpe, et al. 2015), and (ii) the availability of extant taxa with varying levels of organismal complexity and (iii) with phylogenetic relationships sufficiently resolved for meaningful comparative analyses (Larson, et al. 1992; Coleman 1999; Nozaki, et al. 2000; Nakada, et al. 2008; Herron, et al. 2009; Nozaki, et al. 2014; Nozaki, et al. 2016). The volvocine algae form a monophyletic lineage within the reinhardtinia- clade of the order (Nakada, et al. 2008). Members include single-celled species (e.g., Chlamydomonas reinhardtii), simple colonial forms where multicellularity can be facultative (e.g. the 4-celled Tetrabaena socialis and the 16- celled Gonium pectorale) to complex multicellular taxa with germ-soma differentiation (e.g. Volvox carteri) (Nishii and Miller 2010; Coleman 2012). Multicellular members of the lineage are divided into three families: the Tetrabaenaceae (e.g. T. socialis), the Goniaceae (e.g. G. pectorale) and the Volvocaceae (e.g. Pleodorina starrii and V. carteri) (Coleman 1999; Nozaki, et al. 2000). Comparative genomic analyses of C. reinhardtii (Merchant, et al. 2007), G. pectorale (Hanschen, et al. 2016) and V. carteri f. nagariensis (hereafter V. carteri) (Prochnik, et al. 2010) have revealed protein-coding potential to be similar across the lineage. Consequently, research into volvocine developmental complexity has largely focused on gene exaptation/co-option rather than protein-coding novelty (Miller and Kirk 1999; Kirk 2001; Nishii, et al. 2003; Duncan, et al. 2006; Hanschen, et al. 2014; Hanschen, et al. 2016; Olson and Nedelcu 2016; Grochau-Wright, et al. 2017). Genome comparisons have furthered our understanding of some of the important developmental traits. These include: the association between expanded extracellular matrix (ECM) genes and morphological expansion of the ECM in V. carteri, (Prochnik, et al. 2010; Hanschen, et al. 2016) and the master regulator of germ-soma differentiation in V. carteri regA (Kirk, et al. 1999) and closely related regA-cluster genes (Duncan, et al. 2006; Hanschen, et al. 2014) originated near the split between the Goniaceae and Volvocaceae (Hanschen, et al. 2014; Hanschen, et al. 2016; Grochau-Wright, et al. 2017). In accordance with the expectation that the transition to multicellular life involves construction of a life cycle at the level of the cell group (Maliet, et al. 2015), previous work, including genomic analyses, in the volvocine algae has shown that cell cycle genes are among the earliest diversifying gene families (Prochnik, et al. 2010; Hanschen, et al. 2016). These include the expansion of CYCD1 genes relative to C. reinhardtii in G. pectorale and V. carteri (Prochnik, et al. 2010; Hanschen, et al. 2016) and (iv) a role for the retinoblastoma pathway in cell cycle modifications associated with multicellularity (Hanschen, et al. 2016). The palintomic multiple- fission life cycle of the unicellular C. reinhardtii includes a brief phase as a group where division has occurred but separation is incomplete. This brief phase is rapidly followed by separation during which there is no growth. In contrast, cell growth in T. socialis occurs in the group context. The order of cell cycle events has changed in T. socialis such that the separation stage occurs after cell growth and before division in the group life cycle instead of before growth in the unicellular cycle (Maliet, et al. 2015). Another major difference between the multicellular program of division and the unicellular program is that the maximum number of divisions (n) is genetically determined in the multicellular taxa and not in C. reinhardtii. To date, sequenced colonial and multicellular volvocines are from the Goniaceae and Volvocaceae families.. The best-studied member of the most deeply rooted Tetrabaenaceae family (fig. 1) is the 4-celled T. socialis (supplementary fig. S1), which is amongst the “simplest” of known colonial multicellular eukaryotes (Nozaki 1986; Arakaki, et al. 2013). The nuclear genome of T. socialis is analysed here (organelle genomes have been described elsewhere (Featherston, et al. 2016)). Our analyses confirm the limited differences in protein coding potential between the unicellular C. reinhardtii and the colonial/multicellular taxa. A small set of orthologous gene families were identified that arose at the evolution of colonial living and an even smaller set of families showed expansion in colonial/multicellular taxa. Although few families were gained or expanded, these include a number of genes with putative functions related to DNA repair, surveillance of DNA damage or are homologous to metazoan oncogenes. We show here that this diversification of cell cycle genes began at the earliest possible stage with the evolution of a 4-celled Tetrabaena colony and that the genetic determination of colony size was likely one of the earliest group properties to evolve. Methods to detect positive and accelerating selection, as well as other findings, identified numerous genes involved at most points in the ubiquitin proteasomal pathway (UPP) and we discuss a role for the UPP in the genetic control of cell number.

Results and Discussion Genome Assembly A final Tetrabaena socialis (NIES-571) genome assembly was generated after combining Allpaths-LG (Gnerre, et al. 2011) and SPAdes (Bankevich, et al. 2012) assemblies from coverage in excess of 400X. The combined meta-assembly produced 20,363 contigs, 5,856 scaffolds and totalled 133MB in length. The N50 values for contigs and scaffolds were 7.4KB and 145.9KB respectively (table 1). The T. socialis genome is highly GC-skewed with a GC content of 66%. Kmer analyses with Jellyfish (v2.0.0) (Marcais and Kingsford 2011) and GenomeScope (Vurture, et al. 2017) (e.g. supplementary fig. S2), Allpaths-LG and BBtools (http://jgi.doe.gov/data- and-tools/bbtools/) did not converge on the same estimated genome size or repeat content (supplementary table 1). The estimated genome is likely to be at least ~120MB with repeat content in excess of 20%, which is similar to C. reinhardtii (19%) and V. carteri (20%) (Prochnik, et al. 2010).

Transcriptome Assembly and Annotation A de novo transcriptome was generated using Bridger De Novo (r2014-12-01) (Chang, et al. 2015) from 75 million paired RNAseq reads (v4 SBS 2x125bp chemistry). TransRate (1.0) (Smith-Unna, et al. 2016) was used to evaluate assembly quality and to filter contigs (n=23,559). The TransRate Optimal Assembly Score was 0.362. By comparison, 50% of 155 publically available de novo transcriptomes have been shown to have a TransRate Optimal Assembly Score of ≤0.35 (Smith-Unna, et al. 2016). A total of 21,823 (92%) assembled transcripts mapped to the genome assembly (90% mapped with >95% of contig length). Computationally derived annotations were generated using the MAKER2 pipeline (v2.31.3) (Holt and Yandell 2011) with evidence for computational predictions derived from de novo assembled transcripts, protein sequences from publically available volvocine genomes (from C. reinhardtii, G. pectorale and V. carteri, supplementary table 2) as well as the SWISS- PROT database (Boutet, et al. 2016) of curated proteins. Automated annotations produced 15,513 protein-coding loci, which is a value similar to other volvocines (table 1). A total of 65% of proteins modelled by MAKER2 were in possession of at least 1 Pfam-A domain (Finn, et al. 2014) and 93.8% of Eukaryotic Clusters of Orthologous Genes (KOG) domains (Parra, et al. 2007) could be identified in the predicted proteome. Non-redundant homologs for at least 91% of C. reinhardtii KOG proteins (n=406) (Parra, et al. 2007) were identified in the T. socialis predicted proteome, indicating that at least 90% of genes were modelled. Manual annotation of gene families (e.g. pherophorins or supplementary table 24 orthologues) showed similar levels of model coverage. The average intron length in T. socialis is greater than those of other volvocines (table 1). This is likely due to a mixture of the highly repetitive nature of the genome and due to assembly artefacts that are the result of inflated average mate-pair insert distances spanning exons. Using OrthoMCL (v2.09) and OrthoVenn orthologous proteins shared by all volvocines numbered 5199 (supplementary fig. 3). BLASTP searches to volvocine proteins as well as UNIPROT databases identified significant matches (E=10-7) for 90% of T. socialis predicted proteins (supplementary table 3) and the percentage of reciprocal BLAST matches to proteins from model organism C. reinhardtii was comparable for T. socialis when compared to G. pectorale and V. carteri (supplementary table 4).

Domain content Comparing domains in C. reinhardtii with those of the multicellular taxa (T. socialis, G. pectorale, and V. carteri) in a 2-way comparison identified only 61 domains as significantly enriched (p=0.05). Of these only 18 were enriched in the multicellular taxa (the other 43 were enriched in C. reinhardtii). A further 30 domains were identified as present in T. socialis, G. pectorale and V. carteri but absent from the proteome of C. reinhardtii (supplementary table 5). Domains that were significantly enriched (n=61) and domains unique to the multicellular taxa (n=30) were combined (total n=91) and examined for their presence in selected eukaryotic model species (supplementary table 2, supplementary fig. S4). The majority of these domains were not enriched in volvocine or indeed chlorophyte specific domains (supplementary fig. S5) but rather the majority (n=59) were of a more universal eukaryotic origin (supplementary fig. S5). These findings are consistent with previous volvocine genomes, which suggested that extensive domain novelty was not a major feature of the evolution of multicellularity in the volvocines (Prochnik, et al. 2010; Hanschen, et al. 2016). With the exception of the gametolysin domain (Hanschen, et al. 2016) enriched domains between C. reinhardtii and V. carteri (Prochnik, et al. 2010) mostly show loss in V. carteri (supplementary table 6) and this is particularly evident for ankyrin domains (supplementary table 7).

Phylogenetic reconstruction of gene family expansion and contraction OrthoMCL (2.09) (Li, et al. 2003) was used to generate gene family clusters from the proteomes of all available genome sequenced (n=13) (supplementary table 2)). Following Hanschen et al (2016), parsimony reconstruction of gene family expansion and contraction was performed using the Count software package (v10.04) (Csuros 2010) using Dollo and Wagner’s parsimony (symmetric and asymmetric). We focused on the results of Wagner’s symmetrical parsimony as this produced the most conservative set of gene families expanded at the origin of multicellularity (supplementary table 8). The largest expansion of gene families (n=2657) in the phylogeny was for the (at the split between the chlorophyta orders Chlorophyceae and Trebouxiophyceae), which points towards substantial protein- coding diversification and increasing complexity in this order since separating from other lineages (fig. 2) (Hanschen, et al. 2016). However, it remains unknown if this increased proteome complexity was important for the evolution of multicellularity. Using results from Wagner’s symmetric parsimony a set of 131 gene families was gained/originated at the origin of volvocine coloniality/multicellularity (fig. 2). For simplicity these genes were labelled as the “origin of multicellularity” (ORIMC) gene set (supplementary table 9). Argot2 (http://www.medcomp.medicina.unipd.it/Argot2/) (Falda, et al. 2012), InterProScan (Quevillon, et al. 2005) as well as BLASTP (Altschul, et al. 1997) homology to C. reinhardtii proteins were used to annotate the ORIMC gene set. A number of families that originate at the ORIMC are candidates for further exploration many of which have putative functions related to developmental or regulatory processes. Of particular interest may be those homologous to DNA repair, DNA damage surveillance and cancer genes, including: breast cancer susceptibility1-like (BRCA1- like) gene, colon cancer-associated protein Mic1-like gene, RECQ-helicases, structural maintenance of chromosome 3 proteins (with a recF/recN domain), MutL C-terminal dimerization domain containing proteins and a family of proliferating cell nuclear antigen domain containing proteins (Strzalka and Ziemienowicz 2011). In relation to the Ubiquitin Proteasomal Pathway (UPP), a family of AAA-type ATPases with a possible role in proteosomal degradation was gained (Bar-Nun and Glickman 2012). Of the 131 ORIMC families, homology (E=10-10) to C. reinhardtii proteins was not identified for proteins from 30 families. Using TBLASTN to search the C. reinhardtii genome with proteins from these 30 families it was found that 16 gene families had no homology at E=10-5 while a further 5 showed no homology at E=10- 10. There were 2 instances where alignments indicated that an orthologue is present in C. reinhardtii while all other BLAST alignments (E<10-10) were partial or due to repetitive motifs. Overall, only 8 gene ontology (GO) terms could be assigned to the 30 families by Argot2 and, furthermore, BLASTP searches against the RefSeq non- redundant (O'Leary, et al. 2016) and UNIPROT-TREMBL databases (Schneider, et al. 2009) identified only 3 families where matches were not to volvocine proteins, suggesting that most of the 30 families were unlikely to have been gained by horizontal transfer events. Families with no homology to C. reinhardtii proteins were likely either lost in C. reinhardtii or arose de novo near the origin of multicellularity. Homology to C. reinhardtii proteins could be identified for proteins from the remaining ORIMC families (n=101), however, it was not determined for how many a C. reinhardtii orthologue might be present. The exclusion of C. reinhardtii proteins from these 101 OrthoMCL Markov family clusters is most likely accounted for by protein diversification, including paralogue divergence from gene duplication events but is not suggestive of ab initio gene formation. Gene families expanded in the multicellular taxa relative to C. reinhardtii numbered only 27 (supplementary table 10). Amongst expanded families was a family of MutL-related genes and, furthermore, enrichment analysis of expanded families using C. reinhardtii v5 annotations (Lopez, et al. 2011; Blaby, et al. 2014) identified the KEGG pathway “Mismatch repair”. In terms of gene function, families of protein kinases and DNA-repair related genes were identified both amongst families gained and expanded at the ORIMC. As molecules with diverse functions it is unsurprising to identify diversification of protein kinases (Francino 2005) and protein kinase expansions have previously been associated with the evolution of multicellularity or increasing organismal complexity in other lineages (Rensing, et al. 2008; Stajich, et al. 2010; de Mendoza and Ruiz- Trillo 2011; Miller 2012; Schultheiss, et al. 2012; Glockner, et al. 2016). The expansion or diversification of DNA repair genes (Rad51-type) has been identified in the multicellular brown alga Ectocarpus siliculosus (Cock, et al. 2010), the importance of which is not known and the significance of gained/expanded families of DNA repair-related genes in the multicellular volvocines requires further investigation. There is currently no evidence to suggest that replication fidelity is improved in multicellular volvocines and the general trend across life suggests that mutation rates rise with increased organismal complexity (Lynch, et al. 2016). This does not, however, preclude specific exceptions to this general trend. Based on the premise that increased organismal complexity potentially results in increased risk of harm from mutations and mutator genotypes (Lynch 2008) then a possibility for gains in DNA repair-related genes in the multicellular taxa might be related to increased redundancy in repair mechanisms. Both protein kinase signalling and DNA repair pathways may play a role in regulating the cell cycle (Haring, et al. 1995; Branzei and Foiani 2008).

Protein orthologues with evidence of codon-level positive selection Codeml (Yang 2007) site models of positive selection were applied within the Potion pipeline (v1.1.2) (Hongo, et al. 2015) to identify evidence for positive selection amongst orthologue families. Analysis was restricted to OrthoMCL families without paralogues with a single protein from each of the 4-volvocine taxa and 73 family clusters with evidence of positive selection were identified (q=0.05 and m7/8 models, supplementary table 11). It is expected that substantially more families with evidence of positive selection would be detected if paralogues were included. GO enrichment analysis applied to C. reinhardtii v5.5 annotations (Blaby, et al. 2014) from the 73 families identified 82 enriched terms. Most enriched terms were related to metabolic processes (supplementary table 12). Selection acting upon metabolic processes might be evidence of metabolic adaptations to group living since the initial step of group formation is stressful to organisms due to the challenges of dealing with dying cells and nutrient and waste exchange (Durand, et al. 2016; Sathe and Durand 2016). In addition, terms associated with developmental processes were amongst the enriched terms. A number of terms related to the UPP were enriched that ranged from targeting for proteolysis by ubiquitination to proteasomal components. In addition, the MapMan annotation term (Klie and Nikoloski 2012) “protein degradation ubiquitin” and the KEGG pathway “Proteosome” pathway were enriched. Genes from the 73 families with putative functions of interest include genes related to mRNA processing/splicing and polymerization, protein translation, protein shuttling, Ca2+ sensing and signalling, UPP-related genes, chromatin-remodelling and FIZZY-related kinase activity.

Genome-level scans for conservation and accelerating selection A whole-genome multiple sequence alignment of C. reinhardtii, T. socialis, G. pectorale, and V. carteri was generated, referenced to C. reinhardtii and examined for signature of selection using the PHAST suite of algorithms (v1.1.) (Hubisz, et al. 2011). Highly conserved non-coding elements identified with PhastCons in the volvocine genomes were remarkably sparse - only 101 elements of average length 58.9bp were identified (5 951bp) (supplementary table 13 and supplementary table 14). This was almost entirely due to the minimal alignment stemming from non- coding sequences, which amounted to only 57 898bp. The paucity of alignments stemming from non-coding regions can be explained by a high-degree of non-coding genome diversification, simple repeats masking (including removal of duplicate alignments) and the gap content of volvocine genome assemblies, which are highest in the T. socialis (27%) and G. pectorale (21%) assemblies. PhyloP (Pollard, et al. 2010) was used to identify features that, relative to C. reinhardtii, showed evidence of accelerating selection in the multicellular taxa. Features were divided according to C. reinhardtii v5.5 annotations into genes (whole genes including untranslated regions), CDS, 3’UTR, 5’UTR and intergenic regions. With false discovery rate correction (q=0.05) 129 genes were identified as accelerating in the multicellular taxa relative to C. reinhardtii (supplementary table 15) of which 24 were accelerating faster than the neutral rate (the remaining 105 genes were accelerating in the multicellular taxa relative to C. reinhardtii but not faster than the neutral rate). Amongst the 24 more rapidly evolving genes were the retinoblastoma (MAT3/RB) gene and an SPC97/SPC98 member of the spindle pole body component (also known as Gamma tubulin interacting protein 2 (GCP2)). GCP genes in plants are involved in microtubule cytoskeleton organisation and spindle integrity (Janski, et al. 2012). In addition, the tubulin beta chain 1 gene was accelerating relative to C. reinhardtii. Both genes may be of interest for further investigation because microtubule interactions are important for volvocine developmental traits such as embryo inversion in V. carteri (Nishii, et al. 2003). GO enrichment analysis of the 129 genes identified enriched annotations such as “generative cell differentiation”, “regulation of cell cycle” and “actin filament-based movement”. Enriched molecular functions included “cyclin binding” and “ubiquitin activating enzyme activity”. Enriched MapMan terms included “RNA transcription”, “cell organisation”, “protein degradation ubiquitin E1” and “cell division” (supplementary table 16). The term “accelerating selection” implies that loci were evolving more rapidly than the neutral rate or, for branch tests, acceleration indicates that loci are evolving more rapidly on a particular phylogenetic branch than for the rest of the tree. Acceleration might indicate positive selection but may also indicate relaxation of purifying selection and/or even ultimately pseudogenisation. Therefore, in order to identify if genes accelerating in the multicellular taxa showed evidence of positive selection Potion analysis was repeated specifically for families containing accelerating genes, but with slightly relaxed filters (see supplementary text) employed and with the inclusion of families containing paralogues. The 129 genes were associated with 124 orthoMCL families, of which 67 were valid within the Potion pipeline and 25 (37%) displayed evidence of positive selection (q-value=0.05). We considered this to be a substantial proportion because accelerating genes without evidence for positive selection might be undergoing pseudogenisation or acceleration may be occurring in non-coding parts of gene sequences rather than in CDS. A total of 42 CDS features were identified (from 38 genes –supplementary table 17) as accelerating in the multicellular taxa relative to C. reinhardtii with 16 accelerating faster than the neutral rate and 6 that were also identified as accelerating in the whole-gene analysis. Accelerating CDS features of interest include a putative chromatin remodelling gene and a cytoplasmic dynein heavy chain. Corroborating whole gene analysis and indicating that the coding regions of genes are under acceleration are the ubiquitin activating enzyme 2, ubiquitin ligase 1, GCP2, tubulin beta chain 2, a putative transcription regulator and a putative histone protein. GO enrichment analysis identified the cellular compartment terms “microtubule organising centre”, “SWI/SNF complex” and “SWI/SNF-type complex” and the molecular function term “ubiquitin activating enzyme activity” as enriched. As per the gene-level analysis, accelerating CDS regions were examined for positive selection using Potion for 34 OrthoMCL families, which were filtered to 14 and of which 11 were significant for positive selection (79%). This indicates a good correlation, albeit from a limited number of families, between phyloP results and protein-level Codeml analysis, suggesting that the majority of accelerating CDS are under positive selection rather than under-going pseudogenisation. Perhaps due to ~234MYA of separation and low-levels of alignment stemming from non-coding regions no 3’UTRs, 5’UTRs or intergenic regions were identified as accelerating significantly faster than the neutral rate of non-coding DNA evolution. This is similar to the non-coding regions of volvocine organelle genomes, which are replete with extensive tracts of non-coding regions intractable to alignment (Smith and Lee 2009; Smith, et al. 2013). Diversification of non-coding regions also suggests extensive divergence of regulatory elements and gene regulatory mechanisms (Kianianmomeni 2015), which in other lineages have been associated with the evolution of multicellularity (Sebe-Pedros, et al. 2016). As per findings from proteome-wide scans of positive selection amongst orthologues, accelerating genes and CDS with evidence of positive selection include genes with functions related to the UPP (e.g. ubiquitin- activating enzyme 2 and ubiquitin ligase 1), Ca2+ sensing and signalling (e.g. calmodulin 4) and protein shuttling (e.g. heat shock protein 70), which is further evidence of selection acting on these processes. Genes with these functions are likely to be important for basic cellular functions and regulatory processes and have been associated with cell cycle regulatory functions (Goldknopf, et al. 1980; Ciechanover, et al. 1984; Kao, et al. 1985; Goebl, et al. 1988; von Kampen and Wettern 1991; Cheng, et al. 2005; Choi and Husain 2006).

Cyclins and the retinoblastoma gene Theoretical models (Maliet, et al. 2015; Rashidi, et al. 2015) and observations of traits in extant taxa such as T. socialis (Arakaki, et al. 2013) suggest that alterations to the cell cycle was one of the primary and contingent steps to emerge at the origin of group living (Kirk 2005; Herron and Michod 2008). The underlying genetic components that control the cell cycle are therefore of special interest. The primary genetic components of the cell cycle regulation pathway are similar throughout the volvocine lineage (Bisova, et al. 2005; Merchant, et al. 2007; Olson, et al. 2010; Prochnik, et al. 2010; Cross and Umen 2015) and are generally similar to those found in most eukaryotes. However, the cyclin D1 (CYCD1) gene has expanded through tandem duplication from a single copy in C. reinhardtii to four copies in V. carteri and G. pectorale (Prochnik, et al. 2010; Hanschen, et al. 2016). The expansion of CYCD1 was identified in the T. socialis genome, three copies of CYCD1 were identified and confirmed using rapid amplification of cDNA (RACE) and Sanger sequencing (fig. 3, supplementary table 18 and supplementary table 19). Two of the CYCD1 copies are located in close proximity on the same scaffold while the third copy is found on a separate scaffold (supplementary fig. S6). The expansion of CYCD1 genes in T. socialis suggests that CYCD1 expansion traces back to the formation of colonial multicellularity and that an additional duplication event (resulting in four copies) most probably occurred later, after the G pectorale and V carteri lineage split from the Tetrabaena lineage. CYCD1 expansion in the simple colonial T. socialis supports the hypothesis that the expansion was integral for the development of colonial multicellularity. As per other volvocines (Merchant, et al. 2007; Prochnik, et al. 2010; Hanschen, et al. 2016) the repertoire of cyclin-dependent kinases is similar in T. socialis (supplementary fig. S7). In most eukaryotic lineages including the volvocines D-type cyclins interact with the retinoblastoma (RB) gene (a notable exception being yeast) (Muller, et al. 1994; Umen and Goodenough 2001; Olson, et al. 2010; Desvoyes, et al. 2014; Narasimha, et al. 2014; Li, et al. 2016). The volvocine RB homolog (also known as MAT3) is important for the formation of colonial multicellularity (Hanschen, et al. 2016) and may be involved in oogamous sexual reproduction in V. carteri (Kianianmomeni, et al. 2008; Ferris, et al. 2010; Hiraide, et al. 2013). Transformation of a MAT3/RB negative strain of C. reinhardtii with the G. pectorale RB gene was found to cause colony formation in the unicellular C. reinhardtii (Hanschen, et al. 2016). The transformative effect of the G. pectorale RB gene was proposed to be due to specific differences of the RB protein structure observed between C. reinhardtii RB and the G. pectorale and V. carteri RB proteins. Including T. socialis, G. pectorale, V. carteri and other published RB protein sequences from Yamagishiella, Eudorina, and Pleodorina (Hiraide, et al. 2013) the colonial/multicellular taxa were confirmed to possess a shortened linker region between the RB-A and RB-B domains (supplementary table 20 and fig. S8). Phylogenetic analysis of MAT3/RB proteins was in accordance with Hiraide et al. (2015) (supplementary fig. 9). Following Hanschen et al. (2016) cyclin-dependant kinase (CDK) phosphorylation targets within the linker region between RB-B and the carboxy (C)-terminal domain were identified in T. socialis, Eudorina spp and Yamagishiella spp (fig. 4). In addition, the PlantPhos server was used to computationally identify additional targets within this region in all examined taxa. Findings show that phosphorylation targets within this region are not an exclusive feature of the C. reinhardtii MAT3/RB protein and do not correlate with the evolution of multicellularity. Results from the PlantPhos server, in particular, show extensive variation in putative phosphorylation targets amongst MAT3/RB protein sequences and, therefore, altered MAT3/RB phosphorylation may be of significance. However, the absence of phosphorylation targets between the RB-B and RB-C domains cannot be associated with the emergence of multicellularity. To further investigate the RB gene, the exon/intron boundaries of RB genes from C. reinhardtii, T. socialis, G. pectorale, and V. carteri were examined (supplementary fig. S10). Aligned gene structures identified three introns present in T. socialis, G. pectorale, and V. carteri but are absent in C. reinhardtii. The exact intron positions and sequence composition in these regions are not identical and, thus, gene structures are not identical. However, the intron gains are at least indicative of a predisposition for intron gain at these specific locations and, of more interest, introns gained in these 3 regions are limited to the multicellular taxa. It is unknown if these introns are important for transcript regulation or if the intron gains have no influence on function. However, if the former is true, it might contribute towards the transformative ability of G. pectorale MAT3/RB to induce colonies in a strain of C. reinhardtii lacking MAT3/RB. From analysis of cell cycle genes in T. socialis we can conclude that the expansion of D1-type cyclins and alterations to the MAT3/RB gene (shortened RB-A to RB-B linker region) trace near to the origin of colonial of group living, which further supports a role for these genetic alterations for a multicellular program of cell division.

The Ubiquitin Proteasomal Pathway Although other classes of genes identified in the current investigation are of interest in relation to the genetic control of division number (e.g. protein kinases) the UPP is of particular interest because our various analyses identified genes related to most steps in this pathway (supplementary table 21). Targeted degradation of proteins via the UPP is integral to various cellular processes including but not limited to the removal of misfolded proteins, deflagellation and the regulation of cell cycle proteins (Teixeira and Reed 2013). The basic components of the pathway include a series of enzymatic steps to bind ubiquitins (or ubiquitin-like molecules) to proteins whereupon they are shuttled to the proteasome and degraded, releasing free ubiquitin molecules and degraded protein products. Specialised ligases (E3 ligases) that target specific proteins take the form of RING-U-box complexes, which vary but may consist of cullins, transducins, E2 conjugators and various other molecules (e.g. SKP1 in the Skp1- Cullin-F-box complex) (Bai, et al. 1996; Hershko 2005). Complexes such as the Skp1-Cullin-F-Box complex form the basis of key cell cycle checkpoints. Migration to the proteasome may be facilitated and regulated by chaperones such as heat-shock proteins (Noga, et al. 1997; Park, et al. 2007; Shiber, et al. 2013). Aspects of the UPP identified in our analyses (supplementary table 21, fig. 5) include: ubiquitin activation, conjugation and ligation enzymes (E1, E2 and E3), shuttling and regulation of ubiquitinated targets to the proteasome (HSP70), proteasomal subunits/ATPases as well as components of the cullin-RING ubiquitin ligase-type complexes. The scope and varying nature (e.g. positive selection versus gained families) of UPP-related modifications identified in our analyses render targeted functional characterisation challenging. Together, however, they point to systemic modifications of this pathway in the multicellular taxa. Regulation via the UPP is important for many aspects of cell biology and modifications of this pathway will have far-reaching and unpredictable consequences and hence it cannot be stated with certainty that UPP-related genes identified in our analyses have specifically contributed towards increased developmental complexity in the lineage. Based on of the well-known role in many organisms of the UPP in cell cycle regulation where it has been shown to be an integral means of regulating cell cycle progression (Goldknopf, et al. 1980; Ciechanover, et al. 1984; Pagano 1997; Nakayama and Nakayama 2005, 2006b, a; Tu, et al. 2012; Vlachostergios, et al. 2012; Teixeira and Reed 2013) the modifications of the UPP pathway will impact cell cycle regulation in the volvocines. The UPP is not well characterized in C. reinhardtii (von Kampen, et al. 1995; Vallentine, et al. 2014) although a relationship between the UPP and the C. reinhardtii cell cycle has been demonstrated (von Kampen and Wettern 1991; von Kampen, et al. 1995). Specifically, the expression of ubiquitin encoding genes is associated with stress responses in C. reinhardtii and their expression has furthermore been associated with progression through the cell cycle (von Kampen and Wettern 1991; von Kampen, et al. 1995). In C. reinhardtii the number of divisions is highly correlated with the size of the mother cell entering the division phase of the life cycle (Craigie and Cavalier-Smith 1982; Cross and Umen 2015). A key step that evolved early in the evolution of volvocine multicellularity was the modification of the genetic program to control the number of rounds of cell division (Kirk 2005; Herron and Michod 2008). Specifically, the maximum number of divisions is genetically controlled in all colonial volvocines and this feature is even present in the comparatively simple 4-celled T. socialis (Arakaki, et al. 2013). An intriguing feature of the multicellular volvocines is that the maximum number of cell divisions is a characteristic feature of each species e.g. 8-celled T. socialis colonies are rare (requiring 3 divisions). This suggests that the genetic control program is differentially regulated. Such regulation may occur through various mechanisms, not the least being alternative regulation of key cell cycle regulatory molecules such as RB, which has been shown to be a key regulatory molecule at specific points in the cell cycle (Umen and Goodenough 2001; Olson, et al. 2010; Hanschen, et al. 2016). Another molecule of interest is CDKG1, because CDKG1 concentration has been correlated with the number of cell divisions in C. reinhardtii, and after the final round of palintomic division in C. reinhardtii CDKG1 is undetectable (Li, et al. 2016). Evidence suggests that CDKG1 undergoes degradation between rounds of palintomic divisions (Li, et al. 2016) and hence serves as example of how targeted degradation may control cell division in the volvocines. Modification to the UPP identified in the current investigation associate with even the deeply-rooted T. socialis, therefore, although we do not present specific evidence these modifications are associated with a genetic program to control the number of divisions in the volvocines, we propose that because of the known role of the UPP in the cell cycle regulation, including in C. reinhardtii, that the modifications are likely to be important for this key evolutionary development. The role of the UPP might conceivably be through regulated degradation of other determinants of the number of divisions such as CDKG1 or MAT3/RB or other cell cycle genes such as the CYCD1 genes.

Additional analysis of volvocine families associated with developmental complexity Extracellular matrix proteins The two major protein families found in the extracellular matrix (ECM) of volvocines are the pherophorins and matrix metalloproteases (Godl, et al. 1995; Godl, et al. 1997; Sumper and Hallmann 1998; Hallmann 1999). A total of 36 pherophorin and 34 matrix metalloproteases (MMPs) genes were identified in T. socialis. Separate phylogenies of manually curated pherophorins and MMPs that included proteins from C. reinhardtii (Merchant, et al. 2007; Blaby, et al. 2014), T. socialis, G. pectorale (Hanschen, et al. 2016) and V. carteri (Prochnik, et al. 2010; Hanschen, et al. 2016) highlight orthologues shared throughout the lineage as well as species-specific expansions (supplementary fig. S11 and fig. S12). As per previous findings (Hanschen, et al. 2016), widespread expansion of pherophorins and MMP genes are not a feature of the colonial taxa such as T. socialis or G. pectorale but rather are limited to V. carteri and are, therefore, best associated with the expansion of the extracellular matrix rather than initial transformation of the cell wall into an extracellular matrix. The C. reinhardtii cell wall hydroxyproline-rich glycoprotein 1 (GP1) gene (Adair, et al. 1987) is absent from the genomes of T. socialis, G. pectorale and V. carteri (supplementary table 11). The significance of this is unknown, however, it might be important for understanding the initial transformation of cell wall components into an ECM, which is a feature that all multicellular volvocines share.

RegA-related sequences. The evolution of germ-soma division of labour is a key development in complex multicellular volvocines such as Pleodorina and Volvox spp. The somatic regenerator (regA) is a putative transcription factor that is expressed in Volvox carteri somatic cells where it is thought to suppress cell growth via suppression of chloroplast biogenesis. Homologues of regA (regA-like sequences (rls) genes) are found throughout the lineage (Harper, et al. 1987; Kirk 1994; Choi, et al. 1996; Kirk 1997; Meissner, et al. 1999; Duncan, et al. 2006; Duncan, et al. 2007; Hanschen, et al. 2014; Hanschen, et al. 2016; Grochau-Wright, et al. 2017). The genes rlsA, rlsB, rlsC and rlsN/O have been identified as syntenically linked to regA in a number of Volvocaceae and show a greater phylogenetic relationship to regA than other rls genes (Hanschen, et al. 2014; Grochau-Wright, et al. 2017). Together with regA these genes have been labelled as regA-cluster genes and are absent in C. reinhardtii and G. pectorale. A total of 8 RLS genes were identified in T. socialis, which is the same number as in G. pectorale but less than C. reinhardtii (n=12) and V. carteri (n=14). Phylogenetic relationships (supplementary fig. S13) of RLS proteins from C. reinhardtii, T. socialis, G. pectorale, and V. carteri f. nagariensis show that regA- cluster genes are absent in T. socialis and confirm the close relationship of RLS1/RlsD proteins to RegA-cluster proteins (Duncan, et al. 2006; Duncan, et al. 2007; Hanschen, et al. 2014)(supplementary fig S8), consistent with the hypothesis that the regA-cluster genes evolved from an ancestral RLS1/rlsD gene (Hanschen, et al. 2014; Hanschen, et al. 2016).

Fasciclin-domain containing proteins A V. carteri fasciclin-domain containing protein known as the algal-cell adhesion molecule (algal-CAM) has previously been shown to be required for V. carteri embryogenesis and embryo inversion (Huber and Sumper 1994). Protein phylogenetic relationships of fasciclin-domain containing proteins from C. reinhardtii (N=21), T. socialis (n=18), G. pectorale (n=17) and V. carteri (n=17) show that 2 V. carteri proteins (Jgi|Volca1|127393 and Jgi|Volca1|108097) cluster separately with the annotated V. carteri algal-CAM protein (Jgi|Volca1|127389) (fig. 6). Orthologues of algal-CAM were not found in C. reinhardtii, T. socialis, or G. pectorale. While G. pectorale undergoes a form of partial inversion V. carteri undergoes complete embryo inversion. The absence of algal-CAM orthologues in C. reinhardtii, T. socialis and G. pectorale supports that algal-CAMs are required for complete embryo inversion in V. carteri (Huber and Sumper 1994). As yet it is unknown how CAMs are involved in inversion. A possible role might be that CAMs are required for reconstitution of cellular connectivity after cellular inversion around cytoplasmic bridges and are candidates for further exploration as cell adhesion molecules, which may play an important role in volvocine developmental biology.

Conclusions We present findings from the nuclear genome of one of the simplest eukaryotic colonial multicellular organisms- the 4-celled volvocine T. socialis – as well as comparative genomic analyses with the three other available volvocine genomes. In keeping with previous findings, extensive increases in proteome complexity are not associated with the origin of coloniality in this lineage (Prochnik, et al. 2010; Hanschen, et al. 2016). However, gene families gained near the origin of coloniality were enriched in genes related to developmental processes such as DNA repair, protein kinase activity, cell adhesion and extracellular functions. In addition, over 200 genes with evidence of either positive or accelerating selection trace back to T. socialis, which suggests that the evolution of volvocine multicellularity has been shaped by numerous genetic loci. Diversification of cell cycle genes as well as modification to components of the UPP were identified that emerged at the beginning of colonial living and associate with the evolution of a genetic program for the control of cell number.

Materials and Methods Culturing, nucleic acid extraction and sequencing. An axenic preparation of T. socialis NIES-571 was cultured in tri-acetate-phosphate medium (Gorman and Levine 1965) as per Arakaki et al (2013). Total DNA was extracted using standard Volvox DNA isolation method (Miller, et al. 1993) and RNA was extracted 2hrs into the dark phase of the life cycle with Thermo Fisher Tri reagent (Waltham, MA, USA). Paired- end sequencing libraries for DNA were prepared using the Illumina TruSeq DNA library preparation kit (San Diego, CA, USA) and libraries with two insert sizes were generated (~180bp and ~450bp). A 6kb insert mate-pair library was generated using the Illumina Nextera Mate-Pair Library Kit (San Diego, CA, USA ) and automated pulse-field size selection on the Sage BioScience Blue-Pippen (0.75% agarose cassette, Beverley, MA, USA). An RNA-seq library was prepared with the TruSeq stranded-mRNA kit (cat # RS-122-2101). Sequencing was performed on an Illumina HiScanSQ (2x100bp), HiSeq 2500 (2x100bp and 2x125bp) and MiSeq (2x300bp).

Genome Assembly. An Allpaths-LG assembly (v48057) (Gnerre, et al. 2011) was generated using mate-pair data and paired-end data from short reads (2x100bp) as well as longer MiSeq reads that were trimmed to 150bp. A SPAdes (v3.1.1) assembly (Bankevich, et al. 2012) was generated using only MiSeq data (2x300bp), using the multi-Kmer approach (maximum kmer of 127) and subsequent scaffolding with mate- pair data was performed with Opera (v1) (Gao, et al. 2011). The Allpaths-LG and SPAdes assemblies were combined using MetaAssembler (Wences and Schatz 2015) with the Allpaths-LG assembly selected as the “master” assembly. Further improvements to the assembly in terms of gap content and accuracy were made using Gapfiller (Boetzer and Pirovano 2012) and Pilon (Walker, et al. 2014). Multi-kmer analysis was performed with Jellyfish (v2.0.0) (Vurture, et al. 2017), Allpaths-LG and BBtools (http://jgi.doe.gov/data-and-tools/bbtools/).

Transcriptome Assembly and De Novo gene prediction. Transcriptome assembly was performed from ~150 million reads using Bridger de novo (r2014-12-01) (Chang, et al. 2015) (k=25) and was examined for quality using TransRate (v1.0) (Smith- Unna, et al. 2016). Automated gene predictions were performed using MAKER2 (v2.31.3) (Holt and Yandell 2011) using protein homology data and and de novo transcripts generated by Bridger assembly. Consensus models were generated by MAKER2 from genes predicted by Augustus (v3.02) (Stanke, et al. 2004), Genemark (v4.10) (Borodovsky, et al. 2003) and SNAP (v2013-11-29) (Korf 2004). Functional annotations of proteins were derived using HMMER3 (v3.1b) (Zhang and Wood 2003), InterProScan v5 (Quevillon, et al. 2005), BLASTP (Altschul, et al. 1997) to the UniProt TREMBL database (Schneider, et al. 2009) as well as by identifying homology to proteins from sequenced volvocines.

Global Coding Content. Proteomes were collected for all available genome sequenced chlorophyta (supplementary table 2), additional transcriptome-derived proteomes from chlorophyta in the Chlamydomonadales order, yeast and selected model metazoans and plants. Domains were identified using hmmsearch (HMMER v3.1) against Pfam-A. Domain content was compared between C. reinhardtii and multicellular taxa with a χ2-test. Domains with significant difference in abundance (p=0.05) as well as domains shared by multicellular volvocines are absent in C. reinhardtii were identified for presence in opistokonta (yeast and selected metazoa), chlorophyta and plantae. Orthologue protein families for all available genome- sequenced chlorophyta were generated with OrthoMCL (v2.09) (Li, et al. 2003). An RAxML (v8.2.0) phylogeny (Stamatakis 2006) was first generated from 1:1 orthologue families (singletons, n=927). The expansion and contraction of Markov families were examined within the Count environment (v10.04) (Csuros 2010) using Dollo and Wagner’s parsimony (symmetric and asymmetric - gain cost of 2). Families gained in the multicellular volvocines and families with expanded protein numbers in the multicellular volvocines were further annotated with manual curation of outputs from BLAST, InterProScan and Argot2 (Falda, et al. 2012) results.

Positive selection in orthologue protein families. OrthoMCL family clustering was repeated except only volvocine proteomes were included. Identification of families with evidence of positive selection was performed using the Potion pipeline (v1.1.2) (Codeml with site-models) (Yang 2007; Hongo, et al. 2015). Only singleton families were examined. Identified families with evidence for positive selection (q=0.05) were examined for Gene Ontology, MapMan and KEGG pathway enrichment via the Algal Functional Annotation Tool (http://pathways.mcdb.ucla.edu/algal/index.html) (Lopez, et al. 2011).

Genomic elements with evidence of conservation or acceleration. Whole genome alignments for volvocine genomes were performed using LAST (Kielbasa, et al. 2011) and TBA-Multiz (Blanchette, et al. 2004) and multiple sequence alignments (MSA) were referenced to the C. reinhardtii genome thereby removing alignments that did not include C. reinhardtii sequences. The PHAST suite of software (v1.1.) was used to analyse elements within the MSA for evidence of conservation and acceleration. After the PhastCons (Pollard, et al. 2010) parameters Gamma and Omega were optimised (supplementary table 22) PhastCons was used to identify discreet highly conserved regions while gene (whole genes including untranslated regions), coding sequences, intergenic regions and untranslated gene regions (3’ and 5’ ends of genes) were examined in separate analyses for evidence of acceleration in the multicellular taxa using phyloP with Benjamini-Hochberg false discovery rate correction applied (Hochberg and Benjamini 1990). Enrichment analyses were performed as per families with evidence of positive selection (Lopez, et al. 2011). Potion was used to examine accelerating genes (identified by either gene-level analysis or CDS-level analysis) for evidence of selection, this time with families containing paralogues included.

Specific Gene Families. Tetrabaena socialis orthologues of cell cycle genes, matrix metalloproteases, pherophorins and regA-related genes were identified. D-type cyclins and the MAT3/RB gene were sequenced using capillary sequencing with primers designed from the genome sequence (supplementary table and supplementary table 23). Furthermore, fasciclin-domain containing proteins from all volvocines were identified. Protein phylogenies of these families were generated that included T. socialis proteins and where possible curated sets of proteins from C. reinhardtii, G. pectorale and V. carteri from Hanschen et al (2016) were used. Orthologues of the “Prochnik Table” genes (Prochnik, et al. 2010; Hanschen, et al. 2016) were identified in T. socialis (supplementary table 24).

References: Adair WS, Steinmetz SA, Mattson DM, Goodenough UW, Heuser JE. 1987. Nucleated assembly of Chlamydomonas and Volvox cell walls. J Cell Biol 105:2373-2382. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402. Arakaki Y, Kawai-Toyooka H, Hamamura Y, Higashiyama T, Noga A, Hirono M, Olson BJ, Nozaki H. 2013. The simplest integrated multicellular organism unveiled. PLoS One 8:e81641. Bai C, Sen P, Hofmann K, Ma L, Goebl M, Harper JW, Elledge SJ. 1996. SKP1 connects cell cycle regulators to the ubiquitin proteolysis machinery through a novel motif, the F-box. Cell 86:263-274. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455- 477. Bar-Nun S, Glickman MH. 2012. Proteasomal AAA-ATPases: structure and function. Biochim Biophys Acta 1823:67-82. Bisova K, Krylov DM, Umen JG. 2005. Genome-wide annotation and expression profiling of cell cycle regulatory genes in Chlamydomonas reinhardtii. Plant Physiol 137:475-491. Blaby IK, Blaby-Haas CE, Tourasse N, Hom EF, Lopez D, Aksoy M, Grossman A, Umen J, Dutcher S, Porter M, et al. 2014. The Chlamydomonas genome project: a decade on. Trends Plant Sci 19:672-680. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14:708-715. Boetzer M, Pirovano W. 2012. Toward almost closed genomes with GapFiller. Genome Biol 13:R56. Borodovsky M, Lomsadze A, Ivanov N, Mills R. 2003. Eukaryotic gene prediction using GeneMark.hmm. Curr Protoc Bioinformatics Chapter 4:Unit4 6. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, Poux S, Bougueleret L, Xenarios I. 2016. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. Methods Mol Biol 1374:23-54. Branzei D, Foiani M. 2008. Regulation of DNA repair throughout the cell cycle. Nature Reviews Molecular Cell Biology 9:297-308. Chang Z, Li G, Liu J, Zhang Y, Ashby C, Liu D, Cramer CL, Huang X. 2015. Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol 16:30. Cheng Q, Pappas V, Hallmann A, Miller SM. 2005. Hsp70A and GlsA interact as partner chaperones to regulate asymmetric division in Volvox. Dev Biol 286:537- 548. Choi G, Przybylska M, Straus D. 1996. Three abundant germ line-specific transcripts in Volvox carteri encode photosynthetic proteins. Curr Genet 30:347- 355. Choi J, Husain M. 2006. Calmodulin-Mediated Cell Cycle Regulation: New Mechanisms for Old Observations. Cell Cycle 5:2183-2186. Ciechanover A, Finley D, Varshavsky A. 1984. Ubiquitin dependence of selective protein degradation demonstrated in the mammalian cell cycle mutant ts85. Cell 37:57-66. Cock JM, Sterck L, Rouze P, Scornet D, Allen AE, Amoutzias G, Anthouard V, Artiguenave F, Aury JM, Badger JH, et al. 2010. The Ectocarpus genome and the independent evolution of multicellularity in brown algae. Nature 465:617-621. Coleman AW. 2012. A Comparative Analysis of the Volvocaceae (Chlorophyta)(1). J Phycol 48:491-513. Coleman AW. 1999. Phylogenetic analysis of "Volvocacae" for comparative genetic studies. Proc Natl Acad Sci U S A 96:13892-13897. Craigie RA, Cavalier-Smith T. 1982. Cell Volume and the control of the Chlamydomonas Cell Cycle. J Cell Sci 54:173-191. Cross FR, Umen JG. 2015. The Chlamydomonas cell cycle. Plant J 82:370-392. Csuros M. 2010. Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics 26:1910-1912. de Mendoza A, Ruiz-Trillo I. 2011. The mysterious evolutionary origin for the GNE gene and the root of bilateria. Mol Biol Evol 28:2987-2991. Desvoyes B, de Mendoza A, Ruiz-Trillo I, Gutierrez C. 2014. Novel roles of plant RETINOBLASTOMA-RELATED (RBR) protein in cell proliferation and asymmetric cell division. J Exp Bot 65:2657-2666. Duncan L, Nishii I, Harryman A, Buckley S, Howard A, Friedman NR, Miller SM. 2007. The VARL gene family and the evolutionary origins of the master cell-type regulatory gene, regA, in Volvox carteri. J Mol Evol 65:1-11. Duncan L, Nishii I, Howard A, Kirk D, Miller SM. 2006. Orthologs and paralogs of regA, a master cell-type regulatory gene in Volvox carteri. Curr Genet 50:61-72. Durand PM, Sym S, Michod RE. 2016. Programmed Cell Death and Complexity in Microbial Systems. Curr Biol 26:R587-593. Falda M, Toppo S, Pescarolo A, Lavezzo E, Di Camillo B, Facchinetti A, Cilia E, Velasco R, Fontana P. 2012. Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms. BMC Bioinformatics 13 Suppl 4:S14. Featherston J, Arakaki Y, Nozaki H, Durand PM, Smith DR. 2016. Inflated organelle genomes and circular-mapping mtDNA probably existed at the origin of coloniality in volvocine green algae. European Journal of Phycology:1-9. Ferris P, Olson BJ, De Hoff PL, Douglass S, Casero D, Prochnik S, Geng S, Rai R, Grimwood J, Schmutz J, et al. 2010. Evolution of an expanded sex-determining locus in Volvox. Science 328:351-354. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, et al. 2014. Pfam: the protein families database. Nucleic Acids Res 42:D222-230. Francino MP. 2005. An adaptive radiation model for the origin of new gene functions. Nat Genet 37:573-577. Gao S, Sung WK, Nagarajan N. 2011. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol 18:1681- 1691. Glockner G, Lawal HM, Felder M, Singh R, Singer G, Weijer CJ, Schaap P. 2016. The multicellularity genes of dictyostelid social amoebas. Nat Commun 7:12085. Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, et al. 2011. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A 108:1513-1518. Godl K, Hallmann A, Rappel A, Sumper M. 1995. Pherophorins: a family of extracellular matrix glycoproteins from Volvox structurally related to the sex- inducing pheromone. Planta 196:781-787. Godl K, Hallmann A, Wenzl S, Sumper M. 1997. Differential targeting of closely related ECM glycoproteins: the pherophorin family from Volvox. EMBO J 16:25- 34. Goebl MG, Yochem J, Jentsch S, McGrath JP, Varshavsky A, Byers B. 1988. The yeast cell cycle gene CDC34 encodes a ubiquitin-conjugating enzyme. Science 241:1331-1335. Goldknopf IL, Sudhakar S, Rosenbaum F, Busch H. 1980. Timing of ubiquitin synthesis and conjugation into protein A24 during the HeLa cell cycle. Biochem Biophys Res Commun 95:1253-1260. Gorman DS, Levine RP. 1965. Cytochrome f and plastocyanin: their sequence in the photosynthetic eletron transport chain of Chlamydomonas reinhardtii. Proc Natl Acad Sci U S A 54:1665-1669. Grochau-Wright ZI, Hanschen ER, Ferris PJ, Hamaji T, Nozaki H, Olson BJSC, Michod RE. 2017. Genetic Basis for Soma is Present in Undifferentiated Volvocine Green Algae. J Evol Biol doi: 10.1111/jeb.13100. Hallmann A. 1999. Enzymes in the extracellular matrix of Volvox: an inducible, calcium-dependent phosphatase with a modular composition. J Biol Chem 274:1691-1697. Hanschen ER, Ferris PJ, Michod RE. 2014. Early evolution of the genetic basis for soma in the volvocaceae. Evolution 68:2014-2025. Hanschen ER, Marriage TN, Ferris PJ, Hamaji T, Toyoda A, Fujiyama A, Neme R, Noguchi H, Minakuchi Y, Suzuki M, et al. 2016. The Gonium pectorale genome demonstrates co-option of cell cycle regulation during the evolution of multicellularity. Nat Commun 7:11370. Haring MA, Siderius M, Jonak C, Hirt H, Walton KM, Musgrave A. 1995. Tyrosine phosphatase signalling in a lower plant: cell-cycle and oxidative stress-regulated expression of the Chlamydomonas eugametos VH-PTP13 gene. Plant J 7:981-988. Harper JF, Huson KS, Kirk DL. 1987. Use of repetitive sequences to identify DNA polymorphisms linked to regA, a developmentally important locus in Volvox. Genes Dev 1:573-584. Herron MD. 2009. Many from one: Lessons from the volvocine algae on the evolution of multicellularity. Commun Integr Biol 2:368-370. Herron MD, Hackett JD, Aylward FO, Michod RE. 2009. Triassic origin and early radiation of multicellular volvocine algae. Proc Natl Acad Sci U S A 106:3254- 3258. Herron MD, Michod RE. 2008. Evolution of complexity in the volvocine algae: transitions in individuality through Darwin's eye. Evolution 62:436-451. Hershko A. 2005. The ubiquitin system for protein degradation and some of its roles in the control of the cell-division cycle (Nobel lecture). Angew Chem Int Ed Engl 44:5932-5943. Hiraide R, Kawai-Toyooka H, Hamaji T, Matsuzaki R, Kawafune K, Abe J, Sekimoto H, Umen J, Nozaki H. 2013. The evolution of male-female sexual dimorphism predates the gender-based divergence of the mating locus gene MAT3/RB. Mol Biol Evol 30:1038-1040. Hochberg Y, Benjamini Y. 1990. More powerful procedures for multiple significance testing. Stat Med 9:811-818. Holt C, Yandell M. 2011. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12:491. Hongo JA, de Castro GM, Cintra LC, Zerlotini A, Lobo FP. 2015. POTION: an end- to-end pipeline for positive Darwinian selection detection in genome-scale data through phylogenetic comparison of protein-coding genes. BMC Genomics 16:567. Huber O, Sumper M. 1994. Algal-CAMs: isoforms of a cell adhesion molecule in embryos of the alga Volvox with homology to Drosophila fasciclin I. EMBO J 13:4212-4222. Hubisz MJ, Pollard KS, Siepel A. 2011. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief Bioinform 12:41-51. Janski N, Masoud K, Batzenschlager M, Herzog E, Evrard JL, Houlne G, Bourge M, Chaboute ME, Schmit AC. 2012. The GCP3-interacting proteins GIP1 and GIP2 are required for gamma-tubulin complex protein localization, spindle integrity, and chromosomal stability. Plant Cell 24:1171-1187. Kao HT, Capasso O, Heintz N, Nevins JR. 1985. Cell cycle control of the human HSP70 gene: implications for the role of a cellular E1A-like function. Mol Cell Biol 5:628-633. Kianianmomeni A. 2015. Potential impact of gene regulatory mechanisms on the evolution of multicellularity in the volvocine algae. Commun Integr Biol 8:e1017175. Kianianmomeni A, Nematollahi G, Hallmann A. 2008. A gender-specific retinoblastoma-related protein in Volvox carteri implies a role for the retinoblastoma protein family in sexual development. Plant Cell 20:2399-2419. Kielbasa SM, Wan R, Sato K, Horton P, Frith MC. 2011. Adaptive seeds tame genomic sequence comparison. Genome Res 21:487-493. Kirk DL. 1999. Evolution of multicellularity in the volvocine algae. Curr Opin Plant Biol 2:496-501. Kirk DL. 1997. The genetic program for germ-soma differentiation in Volvox. Annu Rev Genet 31:359-380. Kirk DL. 1994. Germ cell specification in Volvox carteri. Ciba Found Symp 182:2- 15; discussion 15-30. Kirk DL. 2001. Germ-soma differentiation in volvox. Dev Biol 238:213-223. Kirk DL. 2005. A twelve-step program for evolving multicellularity and a division of labor. Bioessays 27:299-310. Kirk DL, Harper JF. 1986. Genetic, biochemical, and molecular approaches to Volvox development and evolution. Int Rev Cytol 99:217-293. Kirk MM, Stark K, Miller SM, Muller W, Taillon BE, Gruber H, Schmitt R, Kirk DL. 1999. regA, a Volvox gene that plays a central role in germ-soma differentiation, encodes a novel regulatory protein. Development 126:639-647. Klie S, Nikoloski Z. 2012. The Choice between MapMan and Gene Ontology for Automated Gene Function Prediction in Plant Science. Front Genet 3:115. Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5:59. Larson A, Kirk MM, Kirk DL. 1992. Molecular phylogeny of the volvocine flagellates. Mol Biol Evol 9:85-105. Li L, Stoeckert CJ, Jr., Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178-2189. Li Y, Liu D, Lopez-Paz C, Olson BJSC, Umen JG. 2016. A new class of cyclin dependent kinase in Chlamydomonas is required for coupling cell size to cell division. Elife 10767. Lopez D, Casero D, Cokus SJ, Merchant SS, Pellegrini M. 2011. Algal Functional Annotation Tool: a web-based analysis suite to functionally interpret large gene lists using integrated annotation and expression data. BMC Bioinformatics 12:282. Lynch M. 2008. The cellular, developmental and population-genetic determinants of mutation-rate evolution. Genetics 180:933-943. Lynch M, Ackerman MS, Gout JF, Long H, Sung W, Thomas WK, Foster PL. 2016. Genetic drift, selection and the evolution of the mutation rate. Nat Rev Genet 17:704-714. Maliet O, Shelton DE, Michod RE. 2015. A model for the origin of group reproduction during the evolutionary transition to multicellularity. Biol Lett 11:20150157. Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurences of k-mers. Bioinformatics 27:764-770. Maynard Smith J, Szathmáry E. 1995. The major transitions in evolution. Oxford ; New York: W.H. Freeman Spektrum. Meissner M, Stark K, Cresnar B, Kirk DL, Schmitt R. 1999. Volvox germline- specific genes that are putative targets of RegA repression encode chloroplast proteins. Curr Genet 36:363-370. Merchant SS, Prochnik SE, Vallon O, Harris EH, Karpowicz SJ, Witman GB, Terry A, Salamov A, Fritz-Laylin LK, Marechal-Drouard L, et al. 2007. The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science 318:245-250. Miller SM, Kirk DL. 1999. glsA, a Volvox gene required for asymmetric division and germ cell specification, encodes a chaperone-like protein. Development 126:649-658. Miller SM, Schmitt R, Kirk DL. 1993. Jordan, an active Volvox transposable element similar to higher plant transposons. Plant Cell 5:1125-1138. Miller WT. 2012. Tyrosine kinase signaling and the emergence of multicellularity. Biochim Biophys Acta 1823:1053-1057. Muller H, Lukas J, Schneider A, Warthoe P, Bartek J, Eilers M, Strauss M. 1994. Cyclin D1 expression is regulated by the retinoblastoma protein. Proc Natl Acad Sci U S A 91:2945-2949. Nakada T, Misawa K, Nozaki H. 2008. Molecular systematics of Volvocales (Chlorophyceae, Chlorophyta) based on exhaustive 18S rRNA phylogenetic analyses. Mol Phylogenet Evol 48:281-291. Nakayama KI, Nakayama K. 2005. Regulation of the cell cycle by SCF-type ubiquitin ligases. Semin Cell Dev Biol 16:323-333. Nakayama KI, Nakayama K. 2006a. Ubiquitin ligases: cell-cycle control and cancer. Nat Rev Cancer 6:369-381. Nakayama KI, Nakayama K. 2006b. [Ubiquitin system regulating G1 and S phases of cell cycle]. Tanpakushitsu Kakusan Koso 51:1362-1369. Narasimha AM, Kaulich M, Shapiro GS, Choi YJ, Sicinski P, Dowdy SF. 2014. Cyclin D activates the Rb tumor suppressor by mono-phosphorylation. Elife 3. Nishii I, Miller SM. 2010. Volvox: simple steps to developmental complexity? Curr Opin Plant Biol 13:646-653. Nishii I, Ogihara S, Kirk DL. 2003. A kinesin, invA, plays an essential role in volvox morphogenesis. Cell 113:743-753. Noga M, Hayashi T, Tanaka J. 1997. Gene expressions of ubiquitin and hsp70 following focal ischaemia in rat brain. Neuroreport 8:1239-1241. Nozaki H. 1986. Sexual reproduction in Gonium sociale (Chlorophta, Volvocales). Phycologia 25:29-35. Nozaki H, Misawa K, Kajita T, Kato M, Nohara S, Watanabe MM. 2000. Origin and evolution of the colonial volvocales (Chlorophyceae) as inferred from multiple, chloroplast gene sequences. Mol Phylogenet Evol 17:256-268. Nozaki H, Ueki N, Isaka N, Saigo T, Yamamoto K, Matsuzaki R, Takahashi F, Wakabayashi KI, Kawachi M. 2016. A New Morphological Type of Volvox from Japanese Large Lakes and Recent Divergence of this Type and V. ferrisii in Two Different Freshwater Habitats. PLoS One 11:e0167148. Nozaki H, Yamada TK, Takahashi F, Matsuzaki R, Nakada T. 2014. New "missing link" genus of the colonial volvocine green algae gives insights into the evolution of oogamy. BMC Evol Biol 14:37. O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44:D733-745. Olson BJ, Nedelcu AM. 2016. Co-option during the evolution of multicellular and developmental complexity in the volvocine green algae. Curr Opin Genet Dev 39:107-115. Olson BJ, Oberholzer M, Li Y, Zones JM, Kohli HS, Bisova K, Fang SC, Meisenhelder J, Hunter T, Umen JG. 2010. Regulation of the Chlamydomonas cell cycle by a stable, chromatin-associated retinoblastoma tumor suppressor complex. Plant Cell 22:3331-3347. Pagano M. 1997. Cell cycle regulation by the ubiquitin pathway. FASEB J 11:1067-1075. Park SH, Bolender N, Eisele F, Kostova Z, Takeuchi J, Coffino P, Wolf DH. 2007. The cytoplasmic Hsp70 chaperone machinery subjects misfolded and endoplasmic reticulum import-incompetent proteins to degradation via the ubiquitin-proteasome system. Mol Biol Cell 18:153-165. Parra G, Bradnam K, Korf I. 2007. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23:1061-1067. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A. 2010. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res 20:110-121. Prochnik SE, Umen J, Nedelcu AM, Hallmann A, Miller SM, Nishii I, Ferris P, Kuo A, Mitros T, Fritz-Laylin LK, et al. 2010. Genomic analysis of organismal complexity in the multicellular green alga Volvox carteri. Science 329:223-226. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R. 2005. InterProScan: protein domains identifier. Nucleic Acids Res 33:W116-120. Rashidi A, Shelton DE, Michod RE. 2015. A Darwinian approach to the origin of life cycles with group properties. Theor Popul Biol 102:76-84. Rensing SA, Lang D, Zimmer AD, Terry A, Salamov A, Shapiro H, Nishiyama T, Perroud PF, Lindquist EA, Kamisugi Y, et al. 2008. The Physcomitrella genome reveals evolutionary insights into the conquest of land by plants. Science 319:64- 69. Sathe S, Durand PM. 2016. Cellular aggregation in Chlamydomonas (Chlorophyceae) is chimaeric and depends on traits like cell size and motility. European Journal of Phycology 51:129-138. Schneider M, Lane L, Boutet E, Lieberherr D, Tognolli M, Bougueleret L, Bairoch A. 2009. The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program. J Proteomics 72:567-573. Schultheiss KP, Suga H, Ruiz-Trillo I, Miller WT. 2012. Lack of Csk-mediated negative regulation in a unicellular SRC kinase. Biochemistry 51:8267-8277. Sebe-Pedros A, Ballare C, Parra-Acero H, Chiva C, Tena JJ, Sabido E, Gomez- Skarmeta JL, Di Croce L, Ruiz-Trillo I. 2016. The Dynamic Regulatory Genome of Capsaspora and the Origin of Animal Multicellularity. Cell 165:1224-1237. Sharpe SC, Eme L, Brown MW, Roger AJ. 2015. Timing the Origins of Multicellular Eukaryotes Through Phylogenomics and Relaxed Molecular Clock Analysis. In: Trillo IR, Nedelcu AM, editors. Evolutionary Transitions to Multicellular Life. Netherlands: Springer. p. 3-29. Shiber A, Breuer W, Brandeis M, Ravid T. 2013. Ubiquitin conjugation triggers misfolded protein sequestration into quality control foci when Hsp70 chaperone levels are limiting. Mol Biol Cell 24:2076-2087. Smith DR, Hamaji T, Olson BJ, Durand PM, Ferris P, Michod RE, Featherston J, Nozaki H, Keeling PJ. 2013. Organelle genome complexity scales positively with organism size in volvocine green algae. Mol Biol Evol 30:793-797. Smith DR, Lee RW. 2009. The mitochondrial and plastid genomes of Volvox carteri: bloated molecules rich in repetitive DNA. BMC Genomics 10:132. Smith-Unna R, Boursnell C, Patro R, Hibberd JM, Kelly S. 2016. TransRate: reference-free quality assessment of de novo transcriptome assemblies. Genome Res 26:1134-1144. Stajich JE, Wilke SK, Ahren D, Au CH, Birren BW, Borodovsky M, Burns C, Canback B, Casselton LA, Cheng CK, et al. 2010. Insights into evolution of multicellular fungi from the assembled chromosomes of the mushroom Coprinopsis cinerea (Coprinus cinereus). Proc Natl Acad Sci U S A 107:11889-11894. Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688- 2690. Stanke M, Steinkamp R, Waack S, Morgenstern B. 2004. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 32:W309-312. Starr RC. 1968. Cellular differentiation in Volvox. Proc Natl Acad Sci U S A 59:1082-1088. Strzalka W, Ziemienowicz A. 2011. Proliferating cell nuclear antigen (PCNA): a key factor in DNA replication and cell cycle regulation. Ann Bot 107:1127-1140. Sumper M, Hallmann A. 1998. Biochemistry of the extracellular matrix of Volvox. Int Rev Cytol 180:51-85. Teixeira LK, Reed SI. 2013. Ubiquitin ligases and cell cycle control. Annu Rev Biochem 82:387-414. Tu Y, Chen C, Pan J, Xu J, Zhou ZG, Wang CY. 2012. The Ubiquitin Proteasome Pathway (UPP) in the regulation of cell cycle control and DNA damage repair and its implication in tumorigenesis. Int J Clin Exp Pathol 5:726-738. Umen JG, Goodenough UW. 2001. Control of cell division by a retinoblastoma protein homolog in Chlamydomonas. Genes Dev 15:1652-1661. Vallentine P, Hung C, Xie J, van Hoewyk D. 2014. The ubiquitin-proteasome pathway protects Chlamydomonas reinhardtii against selenite toxicity, but is impaired as reactive oxygen species accumulate. AoB PLANTS 6. Vlachostergios PJ, Voutsadakis IA, Papandreou CN. 2012. The ubiquitin- proteasome system in glioma cell cycle control. Cell Div 7:18. von Kampen J, Nielander U, Wettern M. 1995. Expression of ubiquitin genes in Chlamydomonas reinhardtii: involvement in stress response and cell cycle. Planta 197:528-534. von Kampen J, Wettern M. 1991. Ubiquitin-encoding mRNA and mRNA recognized by genes encoding ubiquitin-conjugating enzymes are differentially expressed in division-synchronized cultures of Chlamydomonas reinhardtii. Eur J Cell Biochem 55:312-317. Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. 2017. GenomeScope: Fast reference-free genome profiling from short reads. Bioinformatics. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, et al. 2014. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9:e112963. Wences AH, Schatz MC. 2015. Metassembler: merging and optimizing de novo genome assemblies. Genome Biol 16:207. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24:1586-1591. Zhang Z, Wood WI. 2003. A profile hidden Markov model for signal peptides generated by HMMER. Bioinformatics 19:307-308.

Tables Table 1: Summary Statistics of Volvocine Genomes Used in this Analysis Chlamydomonas Tetrabaena Gonium Volvox carteri reinhardtii socialis pectorale V1* Assembly 111.1MB 138MB 148.8MB 137.8MB Size Contig N50 90,6KB 7,4KB 16,2KB 43 981 Scaffold N50 6 617KB 146KB 1276KB 1 491KB Number of 88 5858 2373 1 265 Scaffolds GC Content 64.1 66.0 64.5 56.0 Number of 17 737 15 513 17 984 14 520 Coding Genes Average 279.17 635 349.83 496.67 Intron Length Genes with 10 889** 9 803 10 029 8 456 Pfam-A Domains *For this analysis Volvox carteri version 1 data were used (see supplementary text) ** Including alternative splice variants

Figures

Fig. 1. Traits relevant to the current investigation and phylogenetic relationships of genome-sequenced volvocines.

Fig. 2. Wagner’s symmetrical parsimony reconstruction of gene family history for whole genome sequenced chlorophytes. A) Gene families gained and lost. B) Gene familes expanded and contracted. Phylogeny based on 927 single orthologue protein families (100% bootstrap support for all branches).

100/100 CycA1 Volvox carteri 99/99 CYCA1 Chlamydomonas reinhardtii CycB1 Volvox carteri 99/89 100/100 CYCB1 Chlamydomonas reinhardtii CycAB1 Volvox carteri 100/100 CYCAB1 Chlamydomonas reinhardtii CYCD4 Chlamydomonas reinhardtii 96/97 CycD3 Volvox carteri 93/57 CYCD3 Gonium pectorale 100/99 99/89 CYCD3 Tetrabaena socialis CYCD3 Chlamydomonas reinhardtii 67/- CycD4 Volvox carteri CYCD4 Tetrabaena socialis CycD2 Volvox carteri 100/99 CYCD2 Gonium pectorale 94/- CYCD2 Tetrabaena socialis 94/68 CYCD2 Chlamydomonas reinhardtii CYCD1.2 Gonium pectorale CYCD1.1 Gonium pectorale 70/- CYCD1.4 Gonium pectorale 96/79 CYCD1.3 Gonium pectorale 98/74 CYCD5 Gonium pectorale 100/100 CYCD1.1 Tetrabaena socialis CYCD1.2 Tetrabaena socialis CYCD1.3 Tetrabaena socialis ML/NJ CYCD1 Chlamydomonas reinhardtii Bootstrap values CycD1.1 Volvox carteri 70/50 CycD1.2 Volvox carteri 0.5 74/96 CycD1.3 Volvox carteri 100/- CycD1.4 Volvox carteri

Fig. 3. Maximum likelihood and Neighbour-joining phylogeny of cyclins from Chlamydomonas reinhardtii (green), Tetrabaena socialis (magenta), Gonium pectorale (blue) and Volvox carteri (black). The phylogeny shows the expansion of CYCD1 from a single copy in Chlamydomonas reinhardtii to three copies in Tetrabaena socialis and further expansion to 4 copies in Gonium pectorale and Volvox carteri.

Fig. 4. Putative phosphorylation targets in volvocine retinoblastoma protein sequences. Putative phosphorylation targets were manually identified and identified using the PlantPhos server.

Fig. 5. The core aspects of ubiquitin proteasomal pathway (UPP) and components highlighted in the current study. E1 activators, E2 conjugators, E3 ligases and the chaperone HSP70, a 26S proteasomal regulatory subunit, 3 AAA-ATPase putative components of the proteasome and 2 transducins similar to Cullin-RING E3 ligase complex transducins showed evidence of positive/accelerating selection. Furthermore, a family of RING/U-box E3 type ligases, a family of RPM1 interacting proteins, and an FTSH AAA-type protease were gained at the origin of multicellularity. Together these data point towards modification to multiple aspects of the UPP in the multicellular volvocines relative Chlamydomonas reinhardtii.

Fig. 6. Phylogeny of fasciclin-domain containing volvocine proteins. Included in the phylogeny are proteins from Chlamydomonas reinhardtii (green), Tetrabaena socialis (magenta), Gonium pectorale (blue) and Volvox carteri (black). Highlighted in yellow is a branch identified by Count to have been gained at the origin of multicellularity. Highlighted in purple are Volvox algal-CAM proteins.

Supplementary Text

Table of Contents

Tetrabeana socialis

Methods

Culturing and nucleic acid extraction

Library preparation and sequencing

Genome Assembly

Allpaths-LG

SPAdes

Gapfilling, assembly merger and error correction

Assembly description and quality

Transcriptome sequencing and assembly

Volvox carteri annotations

Genome annotation

Functional annotation

Phylogenomics

Domain Analysis

Gene family expansions and contractions

Positive selection

Genome level evolution

Cell cycle genes

Cyclins

Cyclin-dependent kinases

Retinoblastoma homologue

Extracellular matrix genes and the cell wall Pherophorins

Matrix metalloprotease proteins

Cell wall proteins

RegA-related genes

Fasciclin-domain containing proteins

Prochnik table genes

Tetrabaena socialis

Tetrabaena socialis (Figure S1) is the “type-species” for the

Tetrabaenaceae family of volvocines. Other taxa from this family include the genus Basichlamys. The Tetrabaenaceae are the most deep-branching of multicellular volvocines and ancestors of this family gave rise to the

Goniaceae and the Volvocaceae. Tetrabaena socialis was initially classified as Gonium sociale (Nozaki 1986) and until recently an in-depth morphological description of the 4-celled T. socialis was lacking. Phylogenetic methods identified the Tetrabaenaceae as a separate family from the Goniaceae and the Volvocaceae (e.g. (Nozaki, et al. 2000)). Recently, Arakaki et al. (2013) provided an ultrastructural morphological description of T. socialis, which included an assessment of morphological traits associated with Kirk’s 12 steps (Kirk 2005) and identified at least 4 of Kirk’s steps present in the basal

T. socialis (Arakaki, et al. 2013). These are: “incomplete cytokinesis”, “rotation of basal bodies”, “transformation of cell walls into an ECM” and “genetic modulation of cell number”. Cytoplasmic bridges connecting T. socialis cells were confirmed using transition electron microscopy (TEM) and their formation within dividing T. socialis embryos were observed. Morphological and phylogenetic features place T. socialis as amongst the simplest of extant eukaryotic “integrated multicellular organisms” and key for understanding basal developments in the volvocine lineage (Arakaki, et al. 2013).

Methods

Culturing and nucleic acid extraction

An axenic culture of Tetrabaena socialis strain NIES-571 was generated through multiple rounds of isolating and transferring single cells onto sterile solid media (Pringsheim, 1946). The culturing conditions were the same as those followed by Arakaki et al. (2013) where an axenic culture of T. socialis was grown at 25°C in tris-acetate-phosphate medium (Gorman and

Levine 1965) with aeration. Cool-white fluorescent lamps were used for lighting with a 12:12 light-dark cycle. Total DNA was extracted using the standard Volvox CTAB extraction method (Miller, et al. 1993). RNA was isolated with Thermo Fisher (Waltham, MA, USA) TRI reagent from cultures that were two hours into the dark phase of the life cycle, when both dividing and non-dividing cells were observed.

Library preparation and sequencing

The Illumina TruSeq DNA sample preparation kit (v2 Illumina, San

Diego, CA, USA) was used to prepare paired-end (PE) libraries from total

DNA for Illumina sequencing. Two libraries with PE insert sizes of ~180bp and

~450bp were prepared. A ~6kb insert mate-pair (MP) library was prepared using the Illumina Nextera Mate-Pair library preparation kit and with automated pulse-field size-selection performed using the Sage Bioscience

Blue Pippen (Beverly, MA, USA, 0.75% agarose cassette). Preparation of total RNA for RNAseq was performed using the TruSeq Stranded mRNA library preparation kit (Illumina). Sequencing was variously performed on the

Illumina HiScanSQ (2x100bp), Illumina HiSeq 2500 (2x100bp and 2x25bp) or

Illumina MiSeq (2x300bp). The ~180bp and ~6kb MP insert libraries were sequenced using 2x100bp chemistry, RNAseq was performed using 2x125bp chemistry while the ~450bp library was sequenced using both 2x100bp and

2x300bp chemistries.

Genome assembly

Allpaths-LG

For Allpaths-LG assembly all data were trimmed only for adaptor presence with Trimmomatic (0.32) (Bolger, et al. 2014), as recommended by the Allpaths-LG developer (Gnerre, et al. 2011). In order to utilise MiSeq data for Allpaths-LG assembly a custom python script was used to extract overlapping reads that produced a merged fragment size of between 220-

270bp from MiSeq 2x300bp data. Thus, 2 libraries were used for Allpaths-LG

“fragment” data: a ~180bp PE insert library and MiSeq data that were overlapped using Flash (Magoc and Salzberg 2011), further size selected in silico (for a 220-270bp fragment size) insert size and were finally “hard trimmed” to 2x150bp. Allpaths-LG jump libraries were of three types: ~6kb insert MP data, 2x100bp PE data (~450bp insert size) and non-overlapping

MiSeq data (with an insert size greater than ~550bp). Allpaths-LG assembly was performed without modification with a ploidy of 1 defined as a parameter. For assembly Allpaths-LG calculated the read-separation distance of the 6kb library as 7,1kb, which may have resulted in inflated gap estimation.

SPAdes assembly

A contig-level SPAdes assembly (v3.1.1) (Bankevich, et al. 2012) was performed using only MiSeq 2x300bp data that were trimmed to remove adaptors and low quality sequence (quality filtering below Q20 and PCR- duplicate removal of 35bp) with Fastq-mcf

(https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMcf.md). The assembly pipeline was performed in “careful mode’ with error correction and with the multi-kmer assembly protocol optimised for MiSeq data (k of 21, 33,

55, 77, 99 and 127). Scaffolding using the ~6kp insert mate-pair data was performed after contig level assembly using Opera (v1) (Gao, et al. 2011) with bwa (v0.7.5a) mapping.

Gapfilling, assembly merger and error correction

Prior to assembly merger gap filling was performed for both assemblies with GapFiller (v1.11) (Boetzer and Pirovano 2012) with default paramaters (- m 30 -o 2 -r 0.7 -n 10 -d 50 -t 10 –i 10). To improve assembly contiguity and accuracy metassembler (v1) (Wences and Schatz 2015) was used to merge the two assemblies with the Allpaths-LG assembly as the “master” assembly and the SPAdes assembly as the “slave” assembly. After assembly merger 3 rounds of error correction and further gap filling were performed using Pilon

(Walker, et al. 2014).

Assembly description and quality

The final assembly was 66% GC, 135.9mb in length, half of the assembly size was made up of 263 scaffolds (the L50 value) and the largest scaffold was 697kb in length. The scaffold N50 was 145.9kb and the contig

N50 was 7.5kb. The C-value of T. socialis has not been established. In order to assess the genome size three kmer based approaches were used. Initially,

Allpaths-LG automatically calculated the estimated genome size and repeat content based on a 25mer profile. Allpaths-LG estimated the genome size to be 179,MB and 43% repetitive. These values are substantially larger than the assembly and the genomes of other volvocines. Therefore additional efforts were made to estimate the genome size using Jellyfish (Marcais and

Kingsford 2011) with GenomeScope (Vurture, et al. 2017) and with BBTools

(http://jgi.doe.gov/data-and-tools/bbtools/). Jellyfish was used to count kmers of various lengths (K=19, 21, 25, 33, 45, 55 and 77). BBTools was tested at a

K=31. Jellyfish with GenomeScope kmer analysis indicated that the genome size ranged from 101MB (K=19) to 118MB (K=77) while repeat length ranged from 20.7MB to 28.0MB (Supplementary table 1 and Figure S2). BBTools estimated the genome size to be 114,9MB with a 10% repeat content. These data are perhaps more in line with expectation than values derived by

Allpaths-LG and it is most likely that Allpaths-LG overestimated the genome size and may have over-predicted the distance between gaps. Improvements in the C. reinhardtii and V. carteri assemblies have resulted in the total genome size being reduced once a higher percentage of repeats have been filled or more greater scaffolding resolution has been performed. The same is likely to occur for T. socialis. On the contrary, initial C-value estimates for V. carteri were 180MB (Kirk and Harper 1986) while the v2 genome assembly is only 130MB and therefore another possibility is that some volvocine genomes are replete with large stretches of highly-repetitive heterochromatic regions that are not amenable to typical sequencing approaches. The repeat content of the genome as well as the high GC content were found to affect assembly contiguity and the assembly contains 27,8% of ambiguous sequence content

(denoted by N’s), which is greater than that of G. pectorale (20,9%).

Transcriptome sequencing and assembly

Approximately 75million paired-reads (2x125bp) were used to produce a de novo transcriptome for gene modelling. Quality and adaptor-content trimming was performed using Trimmomatic. Quality trimming was performed to exclude terminal sequence data that fell below a minimum quality of 5 and reads shorter than 36bp were removed. Bridger de novo (Chang, et al. 2015) was used for transcriptome assembly using a stranded library type and a kmer of 25. A total of 27 172 transcripts were produced with a N50 of 670bp.

TransRate (v1.0) (Smith-Unna, et al. 2016) was used to evaluate the quality of assembly as well as to filter transcripts with evidence of errors (4 153 transcripts were removed). The TransRate optimal assembly score was

0.362. Transcripts were aligned to the T. socialis genome and 92% (n=21

823) of transcripts aligned. All transcripts aligned with greater than 82% percentage identity and 90% mapped with > 95% percentage identity.

Volvox carteri annotations For all analyses, except for where proteins have been manually annotated or curated, v1 V. carteri proteins were used. This was due to previously reported inconsistencies regarding the domain content of the v2 predicted proteome (Hanschen, et al. 2016). For example, from version 2 proteins only 7795 Pfam-A domains could be identified whereas 13 160 were identified in version 1 (Hanschen, et al. 2016). A newer version of V. carteri annotations is now available (v2.1) at JGI-DOE Phytozome

(http://www.phytozome.net/), however, we considered it prudent to continue with v1 annotations for the purposes of comparing these data with previous publications (Hanschen, et al. 2016).

Genome annotation

Automated gene model annotations for T. socialis were generated using the MAKER2 pipeline (v2.31.3) (Holt and Yandell 2011). Evidence for modelling included the de novo transcriptome assembly, proteome data from

C. reinhardtii (v5.5) (Merchant, et al. 2007; Blaby, et al. 2014), G. pectorale

(Hanschen, et al. 2016) and V. carteri (v1) (Prochnik, et al. 2010) and the curated collection of UNIPROT-SWISSPROT (Boutet, et al. 2016) proteins

(accessed 25 May 2015). The T. socialis genome was masked for simple repeats and C. reinhardtii RepBase repeats selected. All repeats were soft masked, which allowed annotations to extend into repetitive regions but masked regions could not seed annotations. Soft masking was employed to mitigate excessive masking of genic region caused by GC bias, which has been reported in C. reinhardtii (Blaby, et al. 2014). Within the MAKER2 pipeline three gene finders were used: SNAP (v2013-11-29) (Korf 2004), GeneMark (v4.10) (Besemer and Borodovsky 2005) and Augustus (v3.02)

(Stanke, et al. 2006). For Augustus (v3.02) and GeneMark refined C. reinhardtii models were utilised without further training, whereas a T. socialis specific model was generated for SNAP through two rounds of training. The

MAKER pipeline for “gene rescue” was followed (Campbell, et al. 2014) such that models derived exclusively by ab initio predictions (gene models with annotation edit distance (AED) score of 1) were retained if they possessed a

Pfam-A domain (Finn, et al. 2014). Pfam-A domains were identified using

InterProScan 5 (Jones, et al. 2014). Single-exon gene models were retained if

>250bp and a maximum intron size of 10kb was selected. A total of 33 030 unfiltered (for AED or domains) gene models, were generated by MAKER2, of which, 10 734 (32,5%) had AED scores <1. A total of 15 513 gene models were retained with an AED less<1 or with a domain. A total of 93.8% of KOG domains (n=458) were identified within the MAKER2 derived T. socialis proteins (HMMER E-value threshold <1x10-5), which is less than but still comparable to the KOG content of C. reinhardtii (v5.5- 98.7%) and V. carteri

(v1- 96.3%). Complete and manually curated C. reinhardtii KOG orthologues

(n=406) were obtained from the CEGMA website

(http://korflab.ucdavis.edu/datasets/cegma/). Tetrabaena socialis proteins

(from MAKER2 annotations) were aligned to C. reinhardtii KOG proteins using

BLASTP (Altschul, et al. 1997) and subsequently sorted to retain only the highest scoring matching T. socialis homologues. A total of 382 proteins

(94%) were identified with significant homology (E=1x10-10) to C. reinhardtii

KOG proteins, of which, a total of 367 unique T. socialis proteins (90%) were identified. While 393 (96.7%) C. reinhardtii KOG proteins aligned to the T. socialis genome (E-value=1x10-10).

Functional annotation

Putative protein functions were assigned by using BLASTP to query T. socialis proteins against the UNIPROT-TREMBL database with an (E=10-10).

Pfam-A domains and GO terms were assigned using InterProScan (Jones, et al. 2014) while HMMER3 (Eddy 2011) was used for later analysis of domain content. InterProScan identified 8072 GO terms, 9803 proteins were found to contain PFAM-A domains and 14 041 total domains were identified of which

2998 were unique. BLASTP queries to the C. reinhardtii, G. pectorale and V. carteri proteomes were used for further analysis of specific genes and to inspect the coding content of T. socialis in relation to other volvocines. The outcome of homology searches against other volvocine proteomes as well as against the UNIPROT databases TREMBL and SWISSPROT are summarised in Supplementary Table 3. OrthoMCL (v2.09) (Li, et al. 2003) was used to generate family clusters in 2 separate analyses: firstly, including only volvocine proteomes and, secondly, including the proteomes of available genome-sequenced chlorophytes (n=13) from sources summarised in

Supplementary Table 2. For both analyses a markov clustering (MCL) inflation value of 1.5 and an E-value of 1x10-5 were used as OrthoMCL parameters. A

Venn diagram of volvocine-specific family clusters was generated and visualized using OrthoVENN (Wang, et al. 2015) (Figure S3). The Venn diagram showed fewer orthologs for T. socialis. However, reciprocal blast orthologues identified by OrthoMCL to the reference-quality C. reinhardtii proteome showed that the number of T. socialis orthologs was similar to that of G. pectorale and V. carteri (Supplementary Table 4). Results from functional annotations do suggest that some gene content is missing from the

T. socialis genome and/or annotations, although automated gene modelling most likely captured >90% of gene content.

Phylogenomics

A phylogenomic “supermatrix” protein phylogeny (Delsuc, et al. 2005) approach was followed to generate a phylogeny of whole-genome sequenced chlorophytes for down-stream analyses. A total of 927 “singleton” (clusters that contained 1 protein per species) were identified from OrthoMCL clusters of 13 genome-sequenced chlorophytes (see Functional Annotation, fig. 2).

Proteins from each family were separately aligned using MUSCLE (3.8.31)

(Edgar 2004), alignments were trimmed with trimal (1.3) (Capella-Gutierrez, et al. 2009) and concatenated using fastconcat (Kuck and Meusemann 2010).

After testing for the most appropriate phylogenetic model of amino acid substitution with ProtTest (Abascal, et al. 2005), RAxML (8.2.0) (Stamatakis

2014) was used to generate a phylogeny with 100 bootstraps and the model parameters of “gamma”, “invariable sites”, “WAG” and “empirical frequency estimates”. Bootstrap support of 100% was obtained for all nodes. The phylogeny is in agreement with previous phylogenies, which have variously been produced from whole proteome (Hanschen et al, 2016) or from phylogenetically informative genes (e.g. ITS2 (Buchheim, et al. 2011)).

Domain analysis

The development of novel domain architecture has been identified as a feature of the evolution of plant and animal multicellularity (Putnam, et al.

2007; Srivastava, et al. 2008; Srivastava, et al. 2010). Previous analyses that have compared the proteomes of V. carteri and G. pectorale with that of C. reinhardtii have not identified the same degree of domain novelty associated with the evolution of multicellularity in animals and plants (Prochnik, et al.

2010; Hanschen, et al. 2016). Here, domains identified in the proteomes of T. socialis, G. pectorale and V. carteri were compared to those of C. reinhardtii.

Other unicellular chlorophytes were not directly compared with those of the multicellular taxa for enrichment analysis because we were interested in exploring lineage specific differences. Pfam-A domains (Pfam-A 27.0) were identified in volvocine proteomes using HMMER (v3.1, E-value=1x10-5) (Eddy

2011). Domain content between the proteomes of the multicellular volvocines with that of C. reinhardtii proteome was tested for enrichment using a chi- squared distribution test (p=0.05). A total of 61 domains were significantly different at an alpha of 0.05, of which only 18 were enriched in the multicellular volvocines. At an alpha of 0.0001 only 16 domains were significantly different and the inclusion of the T. socialis proteome here is in agreement with previous findings, which found that increased domain complexity does not correlate with the evolution of multicellularity. A further 30 domains were identified as absent in C. reinhardtii but present in T. socialis,

G. pectorale and V. carteri (multicellular domains). The phylogenetic origin of enriched domains as well as multicellular domains was explored. These

“domains of interest” were combined into a list of 91 domains (Supplementary Table 5). As before, HMMER was used to identify Pfam-A domains in a broader collection of eukaryotic taxa (n=26) (Supplementary Table 2 and

Figure S4), including: all whole-genome sequenced chlorophytes, transcriptome derived proteomes of taxa from the Chlamydomonadales order, model animals, model plants and yeast (Supplementary Table 2 and Figure

S4). In order to assess if domains of interest were lineage specific or were common to most eukaryotes, the domains were categorised for shared presence with other major eukaryotic phyla. A total of 93 domains were identified that are unique to the volvocine lineage (C. reinhardtii, T. socialis, G. pectorale and V. carteri). Only 1 domain of interest (PF08707) was found amongst the 93-volvocine specific domains, showing that the domains of interest were not enriched in volvocine specific domains. Subsequently, domains identified in all taxa were classified in a three-way Venn diagram that compared domain content between (i) chlorophyta, (ii) plants and (iii) opisthokonta (here - metazoa and yeast) (Figure S5). Finally, domains of interest were classified according to/or compared with the results of the three- way Venn diagram, which revealed that the majority of significant domains

(n=59/91) are common to all examined phyla. Furthermore, besides 1 domain, all domains absent from C. reinhardtii but present in T. socialis, G. pectorale and V. carteri are present in other chloropytes, plants and opisthokonta, which suggests that the absence of these domains in C. reinhardtii may be due to the loss, either real or artefactual.

The 5 domains reported by Prochnik et al (2010) as differentially present between C. reinhardtii and V. carteri were examined in T. socialis.

Counts for C. reinhardtii, G. pectorale and V. carteri were obtained from Hanschen et al (2016) (Supplementary Table 6). The domain content of T. socialis was relatively similar to those of C. reinhardtii and G. pectorale reported by Hanschen et al (2016). The outlier in terms of organisms for these

5 domains remains V. carteri, which shows reduced content of histones

(PF00125) and the ankyrin repeat (PF00023) as well as expansion of the gametolysin domain (PF05548), which is found in extracellular matrix proteins. Relative to C. reinhardtii leucine-rich repeats are reduced in G. pectorale and V. carteri but not in T. socialis. Other ankyrin domains were examined and all showed reduction in V. carteri (Supplementary Table 7).

Gene family expansions and contractions

The expansion and contraction of gene families as well as the phylogenetic origin of new gene families was identified using the Count software package (v10.04) (Csuros 2010). Proteome data from 13 chlorophyte taxa were obtained from sources listed in Supplementary Table 2.

Numerical profiles of OrthoMCL (v2.09) clusters (n=24 949) generated for phylogenetic analysis were used together with the phylogeny generated from

927 singleton families as input for Count. Bias in the rate of gene loss or gain was assessed for branches and significant variations were not identified.

Dollo’s and Wagner’s parsimony (symmetric and asymmetric) were evaluated as alternative parsimony implementations (Supplementary Table 8).

Asymmetric parsimony analysis was performed with a gain cost of 2.

Symmetric Wagner’s parsimony yielded the most conservative estimate of gains tracing to the origin the multicellular volvocines and because this node was of particular interest a conservative collection of gene family gains was selected for further analysis. At this node a balance between gains and losses was obtained with symmetric Wagner’s parsimony while asymmetrica

Wagner’s parsimony and Dollo’s parsimony favoured gains. Technical differences between alternative parsimony methods as well as technical differences in clustering of gene families may affect the outcome of reconstruction efforts (Demuth and Hahn 2009). Here, we chose to focus on the most conservative results for genes gained at the origin of multicellularity

(fig. 2). Families gained in the branch leading to the colonial/multicellular volvocines (ORIMC families) were annotated using a combination of C. reinhardtii homology, V. carteri annotations, Argot2

(http://www.medcomp.medicina.unipd.it/Argot2/) (Falda, et al. 2012) and

InterProScan (v5) (Jones, et al. 2014). Where possible a consensus annotation was assigned. These are listed in Supplementary Table 9 and described in the main text.

Families gained at the Gonium/Volvox branch were examined and filtered to include only families present in both G. pectorale and V. carteri.

Amongst 1248 families gained at the Gonium/Volvox branch 274 families were gained in both G. pectorale and V. carteri the remaining families were species specific. These 274 families include protein kinases, ankyrin-repeat containing proteins, AAA-domain containing proteins (with domains: AAA,

AAA_2, AAA_5, AAA_11, AAA_15, AAA_16, AAA_17, AAA_18, AAA_19,

AAA21, AAA_22, AAA_23, AAA_29, AAA_30, AAA_33), EF-hand domain containing proteins, a family of dynein heavy chain related proteins, regulator of chromatin condensation domain proteins as well as proteins with possible

DNA interacting domains such as Myb DNA-binding, zinc finger (zf)-MYND, zf-C3HC4 and zf-MYND. Families of particular interest include a family of proteins with anaphase-promoting complex subunit 3 domains (the V. carteri protein is annotated as a kinesin), a family of histone H3-11 proteins and the

Kif4A-type kinesin protein. Furthermore, two families of Volvox metalloproteases and a family of pherophorin proteins were gained at this node.

Families expanded at the origin of multicellularity were identified from

Wagner’s asymmetrical parsimony outputs (fig. 2, Supplementary Table 10).

For expanded families a further filter was applied to ensure that only families with at least one protein member present in C. reinhardtii, V. carteri, G. pectorale and T. socialis were considered and furthermore expanded families were only considered if expanded in all colonial/multicellular taxa relative to C. reinhardtii. In addition, families expanded in G. pectorale and V. carteri relative to C. reinhardtii and T. socialis were examined. The annotation of expanded families was based on C. reinhardtii annotations. Only 27 families demonstrated expansion relative to C. reinhardtii while a further 53 families were expanded in G. pectorale and V. carteri relative to T. socialis and C. reinhardtii. Families contracted at the ORIMC numbered 8.

Positive selection

Family clusters generated with OrthoMCL were tested for evidence of positive selection using the POTION pipeline (Hongo, et al. 2015). OrthoMCL was repeated with the same methodology as described under the heading

“Phylogenetics”, except that only the proteomes of C. reinhardtii, T. socialis,

G. pectorale and V. carteri were included (12 614 families). OrthoMCL family clusters were filtered within the Potion package on the basis of a number of stipulated criteria. Orthologs groups/families that contained paralogues were removed and thus analysis was restricted to groups that consisted of only 1 protein per taxa (singleton families). Although a more permissive approach that included paralogues would have identified more families with evidence of positive selection the retention of only singleton families was performed in order to reduce errors associated with assigning orthologues from amongst families containing paralogues. Groups with a protein/s without valid stop codons, groups that contained proteins indivisible by 3 and groups that contained non-ACGT nucleotides (in CDS sequences) were removed. A minimum group identity of 70% was applied.

PhiPack (Bruen, et al. 2006) was used to identify families with significant evidence (q=0.05) of gene recombination and these families were removed.

After filtering 3 610 families were retained. Multiple sequence alignments of proteins were generated with MUSCLE (v3.8.31) for each ortholog family, which were used to inform codon alignments of CDS sequences. Protein- informed CDS alignments were generated, trimmed (using trimal) (Capella-

Gutierrez, et al. 2009) and used to create a phylogeny (using Phylip (Retief

2000)) for each orthologue group. Chlamydomonas reinhardtii was selected as the “reference” organism, which produced rooted trees for each family.

Potion utilized the PAML (Yang 2007) package with statistical testing and false-discovery rate correction with codeml site model evaluations of positive selection (Supplementary Table 11). Enrichment analyses of Gene ontology

(GO), KEGG pathways and MapMan for families undergoing positive selection was performed via the Algal Functional Annotation portal (http://pathways.mcdb.ucla.edu/algal/) using C. reinhardtii annotations (Lopez, et al. 2011). GO enrichment analysis identified 82 enriched GO terms

(Supplementary Table 12).

Genome-level evolution

In order to extend the analysis of selection beyond orthologue protein families signatures of selection were explored from whole-genome alignments using the PHAST software suite (Hubisz, et al. 2011). PHAST was used to identify genomic regions that were highly conserved as well those under accelerating selection from both coding and non-coding regions of the genome. Pair-wise alignments for the genome sequences of C. reinhardtii

(v5), T. socialis, G. pectorale and V. carteri (v1) were performed using LAST

(Kielbasa, et al. 2011). Simple repeat regions were masked (cR01 command option). Using the single coverage functionality of threaded-blockset aligner

(TBA) (Blanchette, et al. 2004) duplicate alignments were filtered such that a genomic region from one organism did not align to multiple regions in the other organism. Threaded-blockset aligner with multiz was used to create a multiple sequence alignment (MSA) in maf-format from pair-wise alignments

(Blanchette, et al. 2004). The MSA was subsequently mapped to C. reinhardtii

- thereby only retaining alignments that included a C. reinhardtii sequence.

To provide a guide tree for PhastCONS a simple phylogeny of C. reinhardtii, T. socialis, G. pectorale and V. carteri was generated with PhyML

(Guindon, et al. 2009) and 7 mitochondrial genes (cob, cox1, nad1, nad2, nad4, nad5 and nad6). Chlamydomonas reinhardtii annotations (v5.5.) in GFF format were used to define coding and non-coding regions of genome alignments. PhastConS (Siepel and Haussler 2004; Hubisz, et al. 2011) was used to generate a neutral model of substitution rate for non-coding regions while a model of substitution rates from conserved regions was generated from first codon positions using PhyloFit (part of the PHAST package). Using both models PhastCons was used to identify “highly-conserved” regions. The

PhastCONS parameters of “expected length of conserved regions” (gamma) and “target coverage” (omega) were first optimised to minimize the PIT value

(PIT- Phylogenetic Information Threshold, entropy of the phylogenetic model multiplied by the expected minimum number of conserved sites required to predict a conserved element), while increasing the coverage of conserved

CDS sequences. The consEntropy program of PHAST was used to calculate

PIT values at varying levels of omega and gamma. High-conserved sequences for each permutation of omega and gamma were intersected with

CDS annotations from C. reinhardtii chromsome 1 using bedtools (v2-2.20.1)

(Quinlan and Hall 2010; Quinlan 2014) intersect and the percentage of conserved CDS bases was calculated. The most successful parameters were omega of 12 and gamma of 25 (PIT of 11 and CDS coverage

~2.5%)(Supplementary Table 22). Conserved regions were subsequently categorised into annotation features using bedtools (Supplementary Table

13). Highly conserved intergenic regions were examined for proximity to coding regions using bedtools “nearest” (Supplementary Table 14).

Conserved non-coding elements nearby to the aforementioned genes may have been conserved for >200 MYA, which suggests a regulatory role for these elements. In order to identify accelerating selection the phyloP tool of the PHAST package was used (Hubisz, et al. 2011). Annotation features from C. reinhardtii v5.5 were divided per chromosome (PHAST algorithms examined one chromosome/scaffold at a time). Separate C. reinhardtii annotation features were tested for conservation and acceleration using phyloP in

CONservationACCeleration mode (CONACC). The features “genes”, “CDS”,

“3’UTR” and “5’UTR” were used in separate analyses. The log ratio test (LRT) in CONACC mode was used to identify features under accelerating selection within the multicellular lineage (the sub-tree or branch that included T. socialis, G. pectorale and V. carteri). Benjamini-Hochberg False-Discovery

Rate (FDR) correction was used to account for multiple testing (at an alpha of

0.05) (Hochberg and Benjamini 1990). Furthermore, regions accelerating relative to C. reinhardtii were tested for overall evidence of conservation by repeating the phyloP analysis without the “sub-tree” test. Regions accelerating relative to C. reinhardtii and that did not demonstrate evidence of conservation in the “whole tree” analysis were identified and may be considered to be more rapidly evolving (faster than the neutral rate of evolution) than features that are only accelerating relative to C. reinhardtii.

Results with and without this filtering step were inspected. A total of 129

(Supplementary Table 15) whole genes were identified as experiencing accelerating selection relative to C. reinhardtii (FDR corrected q-value <0.05), however, 105 were evolving more slowly than the neutral rate of evolution

(leaving 24 evolving faster than the neutral rate). Annotations were derived from C. reinhardtii v5.5 annotations. Whole genes and CDS regions undergoing acceleration were tested for enrichment of Gene ontology (GO) terms, KEGG pathways and MapMan annotation via the Algal Functional

Annotation website (http://pathways.mcdb.ucla.edu/algal/) using C. reinhardtii annotations (Lopez, et al. 2011)(Supplementary Table 16). A total of 42 CDS regions were identified (associated with 38 genes) as accelerating of which 16 were accelerating faster than the neutral rate (Supplementary Table 17). Of the 42 accelerating CDS, 6 CDS overlap with gene-level analysis.

Corroborating whole gene analysis and indicating that the coding regions of genes are under acceleration are the ubiquitin activating enzyme 2, ubiquitin ligase 1, Spc97/Spc 98 spindle body protein, tubulin beta chain 2, a transcription regulator and a histone superfamily protein. GO enrichment analysis identified the cellular compartment terms “microtubule organising centre”, “SWI/SNF complex” and “SWI/SNF-type complex” and the molecular function term “ubiquitin activating enzyme activity” as enriched. Enriched

MapMan terms and KEGG pathways respectively included “signalling phosphinositides” and “Cell adhesion molecules”.

The Potion pipeline was repeated to identify if accelerating whole- genes and CDS sequences showed evidence of positive selection. Altered parameters than what were used for the proteome-wide scan of positive selection were applied. OrthoMCL families that housed the accelerating gene or CDS were identified and further analysis with potion was limited to these families. Paralogues were included in this analysis and the minimum sequence identity, minimum group identity and the relative minimum and maximum sequence size were set to 10%, 20%, 0,1 and 3 respectively – thereby retaining more homologous clusters/families for evaluation than used for global analyses of positive selection in orthologous protein families. To identify acceleration in volvocine intergenic regions a number of approaches were assessed on how best to divide intergenic regions into smaller coordinates, thereby balancing statistical constraints. These approaches included: (i) dividing all C. reinhardtii intergenic regions into

100bp windows, (ii) applying phyloP to “highly-conserved elements” identified by PhastCONS and (iii) selecting flanking regions upstream and downstream of coding loci. However, the shortage of alignments stemming from intergenic regions hampered attempts to analyse intergenic regions. Ultimately, the lack of alignable regions from intergenic coordinates was used as the basis for creating coordinates for intergenic regions by simply extracting all aligned regions from the MSA that derived from C. reinhardtii intergenic regions (with bedtools). These coordinates were then directly inputted into phyloP and tested for evidence of acceleration. However, no intergenic regions were identified that were significant after Benjamini-Hochberg FDR correction was applied. The same was found for 3’UTRs and 5’UTRs.

Cell cycle genes

Cyclins

The observation of cyclin D1 expansion in G. pectorale and V. carteri relative to C. reinhardtii is important for understanding modifications to the cell cycle in the multicellular taxa because otherwise the repertoire of core cell cycle genes such as cyclins and cyclin-dependent kinases is similar between the multicellular taxa and C. reinhardtii. Tetrabaena socialis cyclins

(Supplementary Table 19, fig. 3) were explored to identify additional CYCD1 genes. OrthoMCL, BLAST, InterProScan, HMMER3 and protein phylogenies were used to identify and characterize cyclins (Bisova et al., 2005; Hanschen et al., 2016; Olson et al., 2010; Prochnik et al., 2010). Cyclin protein sequences from C. reinhardtii, G. pectorale and V. carteri were used as queries against the genome of T. socialis with TBLASTN and as queries against the T. socialis predicted proteins using BLASTP (Evalue=10-5). In addition to blast-based homology searches, HMMER3.1 was used to search for un-annotated cyclin D1 genes. A nucleotide sequence alignment of C. reinhardtii, G. pectorale and V. carteri cyclin D1 CDS sequences was generated with MAFFT (v7.045), which was used to create a cyclin D1 HMM that was subsequently scanned against the T. socialis genome with

HMMER3.1.

Duplicates of CYCD1 (3 copies) were identified from homology searches and were used to generate primer sequences (Supplementary Table

18). In order to produce complete sequences cDNA sequencing and rapid amplification of cDNA ends (RACE) with sequencing was performed on an

ABI sequencer (Thermo Fisher Scientific). Reverse transcribed (Superscript III reverse transcriptase Thermo Fisher Scientific) cDNA was used for PCR as follows: 94°C for 1 minute, 35 cycles of 94°C for 30 seconds, 60°C for 30 seconds, and 72°C for 2 minutes, followed by 72°C for 5 minutes with

TaKaRa LA-Taq with GC Buffer (Takara Bio Inc., Otsu, Japan). Amplicons were purified with the illustra GFX PCR DNA and Gel Band Purification Kit

(GE Healthcare, Buckinghamshire, UK), and additionally some samples were

TA subcloned with TOPO TA Cloning Kit for sequencing (Thermo Fisher

Scientific) according to the manufacture’s protocol. Sequencing was performed directly or after TA cloning on an ABI PRISM 3100 Genetic Analyzer (Thermo Fisher Scientific) using the BigDye Terminator cycle sequencing ready reaction kit, v.3.1 (Thermo Fisher Scientific). The 5’ and 3’ ends of cyclin D T. socialis genes were determined by RACE using the

GeneRacer kit (Thermo Fisher Scientific) according to the manufacturer’s protocol. RACE products were amplified with KOD FX Neo DNA polymerase

(TOYOBO, Osaka, Japan) and PCR was carried out as follows: 94°C for 2 minutes, 5 cycles of 98°C for 10 seconds and 74°C for 1.5 minutes, 5 cycles of 98°C for 10 seconds and 72°C for 1.5 minutes, 5 cycles of 98°C for 10 seconds and 70°C for 1.5 minutes, 15 cycles of 98°C for 10 seconds and

68°C for 1.5 minutes, and followed by 68°C for 7 minutes. Amplified DNA were purified and sequenced as described above. The amino acid sequences of CYC family from C. reinhardtii, G. pectorale and V. carteri were aligned with newly determined CYC proteins of T. socialis with MAFFT (v7) and low- alignment regions were manually trimmed. Phylogenetic analyses of the CYC family was performed using the maximum-likelihood (ML) and the neighbour joining (NJ) method with PhyML (Guindon, et al. 2009) and MEGA 5.2.2

(Tamura, et al. 2011) with WAG+I+G+F model selected by ProtTest (Abascal, et al. 2005). The bootstrap analyses were performed with 1000 replicates.

The sequences of CYCA1, CYCB1, and CYCAB1 were used as outgroup proteins based on previous CYC phylogenetic analyses (Prochnik, et al. 2010;

Hanschen, et al. 2016). Based on the present phylogenetic analyses, volvocine CYCD proteins (CYCD1-4) formed a robust monophyletic group with 99% and 89% bootstrap values in ML and NJ analysis, respectively (fig.

3). Within the CYCD clade, CYCD1 of C. reinhardtii, T. socialis, G. pectorale, and V. carteri formed a monophyletic group though it was supported with low bootstrap values (49% and 18% bootstrap values in ML and NJ analysis, respectively). Although supported with low bootstrap values, expanded

CYCD1s in each volvocine species formed a clade (fig. 3). These results were consistent with those of Hanschen et al. (2016). As described above and in the main text 3 CYCD1 genes were identified in T. socialis (at least 2 found as tandem repeats – Figure S6), which indicates that CYCD1 expansion traces to the beginning of multicellularity in the volvocines and that a further duplication event occurred after the Gonium/Volvox split from Tetrabaena.

Together with interactions with the retinoblastoma protein CYCD1 expansion correlates with a modified program of cell division in the multicellular volvocines. Although here we show that the initial expansion traces to the origin of multicellularity there remain a number of unresolved questions regarding CYCD1 expansion. A number of factors related to CYCD1 expansion require further consideration. As yet the functional consequences of this expansion have not been demonstrated and it remains unknown if the expansion is a form of functional divergence (e.g. subfunctionalization or neofunctionalization) or if the expansion is related to dosage. Previously, it was found that when compared to other cell cycle genes that D-type cyclins show elevated rates of non-synonymous over synonymous substitutions

(dN/dS) (Hanschen, et al. 2016). However, median dN/dS values were below

1, which is typically indicative of purifying or near-neutral selection and elevated rates relative to other cyclins may reflect a relaxation of purifying selection rather than positive selection (Yang and Bielawski 2000). The phylogeny of cyclins presented in this work (and the previous analysis by

Hanschen et al. (2016)) shows that D1-type cyclins cluster by species rather than by subtype, which might suggest that D1-type cyclin paralogues have not functionally diverged in a form that is conserved throughout the lineage. Better statistical support for a volvocine CYCD protein phylogeny is required, which we were not able to generate. A dosage effect may better explain the second duplication event leading to 4 copies rather than functional divergence because 3 copies is sufficient for enabling a multicellular program of division in T. socialis. Another aspect to consider relates to the retinoblastoma-like

(RB/MAT3) gene, which alone appears able to alter the program of division in

C. reinhardtii to activate a multicellular program even though only 1 copy of

CYCD1 is present in C. reinhardtii (Hanschen, et al. 2016).

Cyclin-dependent kinases

To identify putative T. socialis cyclin dependant kinases (CDK) and rhodanese/cell-cycle control protein orthologues, annotated proteins from C. reinhardtii, G. pectorale and V. carteri were searched against the T. socialis predicted proteome and genome using BLASTP and TBLASTN (E-value=10-

5). ProtTest was used to identify the most suitable model for phylogenetic analysis. A phylogeny was generated using RAxML with the WAG model, gamma distribution and 1000 bootstrap iterations. In addition, orthologs were confirmed by inspecting OrthoMCL families. As per previous analyses

(Prochnik, et al. 2010; Hanschen, et al. 2016) CDKs do not display extensive differences across the volvocine lineage (Figure S7).

Retinoblastoma homologue (MAT3/RB)

Retinoblastoma protein sequences from C. reinhardtii, G. pectorale and V. carteri were aligned to the T. socialis proteome using BLASTP and a computationally predicted orthologue was identified. Following the same methodology as for cyclins the model of the T. socialis MAT3/RB gene was determined through additional capillary sequencing with primers designed from the genome assembly (Supplementary Table 23). The putative binding partners of MAT3/RB – Elongation factor 2 (E2F) and dimerization protein

(DP) - were also identified in T. socialis (Supplementary Table 24).

Chlamydomonas reinhardtii, G. pectorale and V. carteri MAT3/RB proteins sequences as well as additional MAT3/RB protein sequences from Hiraide et al. (2015) were used to inspect features of MAT3/RB. Using MAFFT (v7) a protein alignment was generated from available MAT3/RB protein sequences

(Figure S8): C. reinhardtii (plus and minus strains), T. socialis (homothallic),

Yamagishiella unicocca (plus), Eudorina spp (female), Gonium pectorale

(plus), Pleodorina starrii (female strains), V. carteri f nagariensis (male and female strains). The RB/MAT3 sequences were aligned with newly determined TsRB/MAT3, based on the alignment of Hanschen et al. (2016).

The alignment of MAT3/RB proteins was inspected for the length of the region between the RB-A and RB-B domains as well as for putative serine/threonine kinase phosphorylation targets within the region linking the RB-B domain to the C-terminal domain (Hanschen et al., 2016). The alignment showed that all colonial/multicellular taxa examined possess a shortened linker region between the RB-A and RB-B domains (Supplementary Table 20). In addition to the linker length between the RB-A and RB-B domains the linker region between RB-B and the carboxy (C)-terminal domain was previously found to be devoid of putative cyclin dependent kinase phosphorylation targets in G. pectorale and V. carteri (male and female) whereas two such sites were identified in C. reinhardtii (Hanschen, et al. 2016). Within the linker region between the RB-B and C domains typical CDK phosphorylation targets (with the (S/T)PX(K/R) motif) were identified manually and, in addition, the

PlantPhos server (http://csb.yzu.edu.tw/PlantPhos) was used to identify additional targets (Lee, et al. 2011). The first C. reinhardtii MAT3 target within the RB-B/C terminal identified by Hanschen et al. (2016) possesses an SPRG motif, which is atypical (or lineage-specific), and the second target contains a more typical SPTK motif. Atypical targets and targets predicted by PlantPhos for phosphorylation were identified within the RB-B to C-terminal domain linker-region of MAT3/RB in all multicellular volvocines (fig 4). Typical phosphorylation targets were present in T. socialis, G. pectorale, V. africanus and V. carteri f. nagariensis (female - TPGK). The remaining putative phosphorylation targets were atypical (where the 4th amino acid position is not a K/R).

Phylogenetic analyses of MAT3/RB were performed using the ML and the NJ method with PhyML program and MEGA 5.2.2 program, respectively, with JTT+G+F model, selected by ProtTest optimised using

Akaike information criteria. As per Hiraide et al. (2010) the MAT3/RB proteins of Chlamydomonas globosa and C. reinhardtii were included as outgroups in a new MAFFT alignment. The bootstrap analyses were performed 1000 replicates. Broadly, phylogenetic relationships were in accordance with that of

Hiraide et al. (2010)(Figure S9). MAT3/RB gene structures for C. reinhardtii, T. socialis, G. pectorale and both sexes of V. carteri were generated and inspected using WebScipio

(Odronitz, et al. 2008) and Genepainter (Muhlhausen, et al. 2015). WebScipio employed blat (Kent 2002) to align proteins to genomic sequences and subsequently created gene structures in YAML file format. A protein multiple sequence alignment was generated using MAFFT (v7.045), which together with YAML gene structure files were submitted online to GenePainter

(http://www.motorprotein.de/genepainter/) to map aligned MAT3/RB proteins to gene structures. The aligned gene structure identified 3 regions where introns are present in T. socialis, G. pectorale and V. carteri but absent in C. reinhardtii (Figure S10)

Extracellular matrix genes and the cell wall

The biochemical and molecular characteristics of volvocine extracellular matrices (ECM) and cell walls have been investigated (Goodenough and

Heuser 1985; Kirk, et al. 1986; Ertl, et al. 1989; Woessner and Goodenough

1989; Woessner, et al. 1994; Godl, et al. 1995; Godl, et al. 1997; Hallmann, et al. 1998; Sumper and Hallmann 1998; Hallmann and Kirk 2000; Hallmann

2003; Ishida 2007), in particular for C. reinhardtii and V. carteri. Two important innovations in the lineage are related to the ECM and/or cell wall: (i) the transformation of the cell wall into an extracellular matrix and (ii) the expansion of the ECM (Kirk 2005). The ECM forms as little as 1% of a C. reinhardtii cell (Kirk 2005) and in smaller colonial volvocines (e.g. G. pectorale) the ECM contribution to organismal volume varies but overall contributes significantly less to organismal volume than in V. carteri and other larger taxa where the ECM can constitute >95% of organismal volume

(Coleman 2012). The major protein constituents of the volvocine ECM are pherophorin proteins and matrix metalloproteases (MMPs). Comparative genomic investigations of C. reinhardtii, G. pectorale and V. carteri have identified that pherophorin and MMPs gene expansion associates with the expansion of the ECM in V. carteri (Prochnik, et al. 2010; Hanschen, et al.

2016) but not in G. pectorale (Prochnik, et al. 2010; Hanschen, et al. 2016).

The pherophorins are hydroxy-proline rich glycoproteins that characteristically possess at least 1 pherophorin domain (PF12499) (Tschochner, et al. 1987;

Mages, et al. 1988; Godl, et al. 1995; Godl, et al. 1997; Merchant, et al. 2007;

Prochnik, et al. 2010; Hanschen, et al. 2016). Pherophorins cross-link and together with simple carbohydrates create nets that support the gelatinous

ECM and maintain cellular connectivity (Goodenough and Heuser 1985; Kirk, et al. 1986; Ender, et al. 2002). The expression of certain pherophorins has been shown to be upregulated in V. carteri by sex inducers and by nitrogen starvation in C. reinhardtii (Sumper, et al. 1993; Amon, et al. 1998; Rodriguez, et al. 1999). Pherophorin encoding genes are expanded in V. carteri (n=78) relative to C. reinhardtii (n=35) and G. pectorale (n=31) and thus the expansions are more appropriately associated with the expansion of the ECM than the transformation of the ECM into a cell wall (Prochnik, et al. 2010;

Hanschen, et al. 2016). The matrix metalloproteases (MMPs) are another major constituent of proteins found in the volvocine ECM (Hallmann, et al.

2001). Metalloprotease proteins typically possess a metalloprotease metal binding domain (for copper or zinc binding) and a proline rich domain (HR domain) (Heitzer and Hallmann 2002). The specific substrates of MMPs are not well characterised but have been shown to be involved in degrading ECM components of the C. reinhardtii cell wall during gamete release and hence are also referred to as gametolysins (Goodenough and Heuser 1985).

Wounding can also trigger MMP expression, which suggests a role in ECM remodelling response (Hallmann, et al. 2001). The timing of MMP expression possibly indicates that a major role of MMPs is to degrade pherophorins

(Hallmann, et al. 2001; Heitzer and Hallmann 2002), which may form part of a signal transduction cascade that activates sexual reproduction in V. carteri

(Sumper, et al. 1993). It has been suggested that specific MMPs might exist to target specific pherophorins, however, this remains unconfirmed. As per pherophorins, relative to C. reinhardtii and G. pectorale, the expansion of

MMP coding genes is an exclusive feature of V. carteri with 44 genes in C. reinhardtii, 36 genes in G. pectorale and 98 genes in V. carteri (Hanschen, et al. 2016).

The C. reinhardtii cell wall is composed of 3 layers (Goodenough and

Heuser 1985) and each cell of the simpler colonials such as T. socialis and G. pectorale is surrounded at least by a layer akin to the central layer of C. reinhardtii (the tripartite boundary layer) while in the larger Volvocaceae the central layer surrounds the colony as opposed to each cell (Kirk 2005;

Coleman 2012). In the smaller taxa such T. socialis and G. pectorale ECM material is deposited between cells surrounding each cell but they work together with other structures such as cytoplasmic bridges to maintain cellular connectivity (Nozaki 1990; Arakaki, et al. 2013). This form of ECM deposition and cell wall modifications that connect cells has been described as a form (or intermediate form) of cell wall transformation into an ECM (Kirk 2005). As yet the transformation of the cell wall into an ECM is not well understood at the molecular level. Proteins characterized from the outer layers of the C. reinhardtii cell wall include glycoproteins GP1, GP2 and GP3 (Adair, et al.

1987) and notably, orthologs of GP2 and GP3 are present in both V. carteri and G. pectorale but GP1 is absent from both taxa (Prochnik, et al. 2010;

Hanschen, et al. 2016) (Supplementary Table 24).

Pherophorins

Using C. reinhardtii pherophorin genes phC1 and GAS28, T. socialis gene models were searched (E-value=10-2) for pherophorins. The identified collection of proteins was examined for the presence of the DUF3707 domain

(the protein domain present in pherophorins, up to three domains per protein) using HMMER3 (with E-value=10-5). Manual modifications of computer annotations were made to improve the pherophorin domains flanking a proline-rich repeat. The intergenic region of the T. socialis genome was searched (E-value=10-3) and models of pherophorins were manually built. The presence of DUF3707 domains was verified using direct Pfam submission (E- value=10-5). A protein phylogenetic tree was constructed using the automatic pipeline (Hanschen et al. 2016) described for MMPs. A total of 36 pherophorin proteins were identified in T. socialis. RAxML (v8.0.20) phylogeny of curated pherophorins demonstrated both the close relationship of many pherophorins throughout the volvocine lineage as well as species-specific expansions

(Figure S11). The most prominent of species-specific expansions are found in

V. carteri. Pherophorins have been associated with the development of the expanded extracellular matrix of V. carteri. The deepest branching proteins in the phylogeny include the C. reinhardtii hydroxyproline-rich glycoproteins

(GAS proteins) that are upregulated during N-starvation and putative orthologs of GAS proteins are found in all taxa. As per findings by Hanschen et al. (2016) in describing G. pectorale pherophorins, extensive expansions of pherophorins are not a feature of the colonial T. socialis.

Matrix metalloprotease proteins

A curated and published collection of matrix metalloproteases (MMPs) from C. reinhardtii, G. pectorale and V. carteri as compiled by Hanschen et al

(2016) were used to search for T. socialis MMPs from amongst MAKER2 predicted proteins (E-value of 10-1). These proteins were queried for the

Peptidase M11 domain (PF05548- E-value=10-5) using HMMER3. In addition, the published collection of MMPs was searched against the genome of T. socialis using TBLASTN (E-value=10-5) to identify MMPs that were not automatically modelled. Four additional genes that contained the PF05548 domain were manually annotated bringing the total number of T. socialis MMP genes to 34. For comparison, C. reinhardtii has 44 MMP genes, G. pectorale has 36 MMP genes, and V. carteri has 98 MMP genes. A phylogenetic tree of

Peptidase M11 proteins was constructed using a using a custom pipeline

(Hanschen et al. 2016) of SATe version 2.2.7 (Liu et al. 2009) coupled with

RAxML (v8.0.20) (Stamatakis 2014). Full protein sequences were passed to

SATe using a FASTTREE tree estimation, an RAxML (Liu, et al. 2011) search after initial tree formation, a maximum limit of 10 iterations and the “longest” decomposition strategy. Bootstraps were made on the SATe output alignment and tree using RAxML with automatic model selection, a rapid hill climbing algorithm (-f d) and 100 bootstrap partitions. Bipartition information (-f a) was obtained using the SATe output tree and RAxML bootstrap tree. As is in the case for pherophorins, the phylogeny of MMPs shows lineage-wide orthologues as well as V. carteri-specific expansions (Figure S12). Included amongst V. carteri expansions are proteins that form numerous clades that are devoid of proteins from other taxa, such as, a branch of 46 proteins that consist only of V. carteri proteins. Related-to and sister-to the large clade of

46 V. carteri proteins is a clade of proteins from C. reinhardtii, T. socialis and

G. pectorale, which includes the MMP4 C. reinhardtii protein. The inclusion of

T. socialis proteins strengthens the correlation between the number of pherophorin and MMP proteins that has previously been reported (Hanschen, et al. 2016) and thus it may be concluded that genomic evidence to date does not contradict the hypothesis that specific MMPs exist to target specific pherophorins.

Cell wall proteins

As per G. pectorale and V. carteri the GP1 gene was not identified in T. socialis (Supplementary Table 24). It is unknown if GP1 is unique to C. reinhardtii or if the absence of GP1 in the multicellular volvocines was due to loss. If gene loss occurred then the loss likely occurred near the origin of volvocine multicellularity, which might prompt speculation of whether or not the loss of GP1 was somehow of consequence for cell wall modifications. To further explore the origin of the C. reinhardtii GP1 gene the GP1 protein was used as a query against available transcriptome-derived proteomes of other unicellular members of Chlamydomonadales order using BLASTP (Supplementary Table 2). No homologs were identified (Evalue=10-5) but as yet the evolutionary history of GP1 remains unknown. The GP2 gene was identified in T. socialis as a partial sequences but was not modelled.

Reg-A related genes

Reg-A related sequences (rls) genes (also known as the Volvocine

Algal RegA-Like (VARL) genes) were identified in T. socialis by following previous approaches (Hanschen et al. 2016). All published VARL gene sequences from C. reinhardtii, G. pectorale and V. carteri (Duncan et al. 2007;

Merchant et al. 2007; Hanschen et al. 2014, 2016) were searched against both the predicted genes and assembly of T. socialis. The presence of a

SAND domain in computationally predicted and manually constructed VARL proteins was verified using HMMER3 to query Pfam-A version 26.0 (Finn et al. 2010) (E-value of 10-5). There was one instance (RLS12 in C. reinhardtii) where a hypothesized SAND domain did not pass the E-value threshold. An alignment of RLS proteins from T. socialis, C. reinhardtii, G. pectorale and V. carteri was generated but was limited to the VARL domain because this is the only conserved region in RLS-proteins (Duncan et al. 2007; Hanschen et al.

2014)). MAFFT (v6.859b) with the L-INS-I option (Katoh et al. 2005) was used to generate a protein alignment. A phylogenetic tree was producing using

RAxML (v8.0.20) (Stamatakis 2014) with the Protein Gamma distribution and automatic model selection (Figure S13). Rapid bootstrapping analysis to search for the best-scoring maximum-likelihood (ML) tree was run with an automatically selected number of bootstraps. A total of 8 REGA-related (RLS) proteins were identified in T. socialis, which is the same number as G. pectorale. As is the case with G. pectorale - REGA-cluster proteins (regA, rlsA, rlsB, rlsC and rlsN/O) were not identified in T. socialis.

Fasciclin domain-containing proteins

The molecular mechanisms that underlie incomplete cytokinesis in the volvocines are as yet unknown, however, an important morphological feature of multicellular volvocines is the cytoplasmic bridges that maintain cellular connections. Cellular adhesion has been recognized as an important feature for the early development of multicellularity and adhesion molecules in a number of lineages have been characterized (e.g. pectins in plants, glycoproteins in fungi, phlorotannins in brown algae and cadherins in metazoa) (Niklas 2014). Besides ECM components that envelop cells and thereby maintain connectivity adhesive molecules are not well characterized in the volvocine lineage. One candidate cell adhesion molecule is the fasciclin-domain containing protein named algal-cell adhesion molecule

(Algal-CAM1) (Huber and Sumper 1994), which has to date only been identified in V. carteri and where it may play a role in cell adhesion as well as embryo inversion. Treating synchronised V. carteri cultures with antibodies targeting these proteins prior to embryogenesis resulted in the formation of misshapen “doughnut” shaped blastomers while later treatment with antibodies resulted in normal shaped blastomers except that embryological inversion was inhibited (Huber and Sumper 1994). Furthermore, a family of fasciclin-domain containing proteins was identified as gained at the ORIMC in the current investigation. To explore fasciclin-domain containing proteins throughout the volvocine lineage fasciclin domain (PF02469) containing proteins in C. reinhardtii, T. socialis, G. pectorale and V. carteri were identified

HMMER3 scans to Pfam-A. Proteins were aligned using MAFFT (v7.045), model testing was performed with ProtTest and RAxML (v8.2.0) (Blosum62, invariant sites, gamma distribution, empirical frequencies and 1000 bootstraps) was used for the generation of a phylogenetic tree (fig. 5). The V. carteri algal cell adhesion molecule (Algal-CAM) protein (XP_002958363.1) was identified in the V. carteri v1 proteome and was identified in protein phylogeny of fasciclin-domain containing proteins (Huber & Sumper, 1994).

The protein phylogeny identified 3 putative paralogues of V. carteri Algal-CAM

(fig. 5). Other fasciclin-domain containing proteins from C. reinhardtii, T. socialis, G. pectorale and V. carteri did not cluster with Algal-CAM paralogues. The family identified as originating at the ORIMC contains two T. socialis proteins (TSOC_001029-RA and TSOC_001027-RA) and one protein from both G. pectorale (scaffold00001.g603t.1) and V. carteri (96911) (fig. 5).

Prochnik table genes

The Prochnik et al (2010) genes “involved in processes that are associated with increased developmental complexity in V. carteri relate to C. reinhardtii” (hereafter Prochnik genes/proteins) were identified in T. socialis.

Orthologues of Prochnik genes have been identified in G. pectorale and revisions have been made that include orthologues of updated C. reinhardtii and V. carteri proteomes (Hanschen et al., 2016). Results from various informatics tools, were manually inspected to identify putative T. socialis orthologues of Prochnik table proteins (Supplementary Table 24). Protein sequences from C. reinhardtii, G. pectorale and V. carteri were used to identify T. socialis orthologues using homology searches (TBLASTN and

BLASTP). Furthermore, OrthoMCL protein-family clusters were inspected and where required protein phylogenies were generated to identify T. socialis orthologues. Additional efforts were taken to identify and classify matrix metalloproteases (MMPs), pherophorins and cell cycle proteins and are discussed separately below. Besides pherophorins and MMPs approximately

138 orthologues were identified from the predicted proteome of T. socialis. A further 4 genes were identified from TBLAST searches against the T. socialis genome assembly that were not generated by automated gene prediction.

This demonstrated that gene models expected to be present in the T. socialis genome/annotations were generally well captured.

Supplementary references

Abascal F, Zardoya R, Posada D. 2005. ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21:2104-2105. Adair WS, Steinmetz SA, Mattson DM, Goodenough UW, Heuser JE. 1987. Nucleated assembly of Chlamydomonas and Volvox cell walls. J Cell Biol 105:2373-2382. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402. Amon P, Haas E, Sumper M. 1998. The sex-inducing pheromone and wounding trigger the same set of genes in the multicellular green alga Volvox. Plant Cell 10:781-789. Arakaki Y, Kawai-Toyooka H, Hamamura Y, Higashiyama T, Noga A, Hirono M, Olson BJ, Nozaki H. 2013. The simplest integrated multicellular organism unveiled. PLoS One 8:e81641. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, et al. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455- 477. Besemer J, Borodovsky M. 2005. GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses. Nucleic Acids Res 33:W451-454. Blaby IK, Blaby-Haas CE, Tourasse N, Hom EF, Lopez D, Aksoy M, Grossman A, Umen J, Dutcher S, Porter M, et al. 2014. The Chlamydomonas genome project: a decade on. Trends Plant Sci 19:672-680. Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, Rosenbloom K, Clawson H, Green ED, et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14:708-715. Boetzer M, Pirovano W. 2012. Toward almost closed genomes with GapFiller. Genome Biol 13:R56. Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114-2120. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bansal P, Bridge AJ, Poux S, Bougueleret L, Xenarios I. 2016. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. Methods Mol Biol 1374:23-54. Bruen TC, Philippe H, Bryant D. 2006. A simple and robust statistical test for detecting the presence of recombination. Genetics 172:2665-2681. Buchheim MA, Keller A, Koetschan C, Forster F, Merget B, Wolf M. 2011. Internal transcribed spacer 2 (nu ITS2 rRNA) sequence-structure phylogenetics: towards an automated reconstruction of the green algal tree of life. PLoS One 6:e16931. Campbell MS, Holt C, Moore B, Yandell M. 2014. Genome Annotation and Curation Using MAKER and MAKER-P. Curr Protoc Bioinformatics 48:4 11 11-39. Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972-1973. Chang Z, Li G, Liu J, Zhang Y, Ashby C, Liu D, Cramer CL, Huang X. 2015. Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol 16:30. Coleman AW. 2012. A Comparative Analysis of the Volvocaceae (Chlorophyta)(1). J Phycol 48:491-513. Csuros M. 2010. Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics 26:1910-1912. Delsuc F, Brinkmann H, Philippe H. 2005. Phylogenomics and the reconstruction of the tree of life. Nat Rev Genet 6:361-375. Demuth JP, Hahn MW. 2009. The life and death of gene families. Bioessays 31:29- 39. Eddy SR. 2011. Accelerated Profile HMM Searches. PLoS Comput Biol 7:e1002195. Edgar RC. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792-1797. Ender F, Godl K, Wenzl S, Sumper M. 2002. Evidence for autocatalytic cross- linking of hydroxyproline-rich glycoproteins during extracellular matrix assembly in Volvox. Plant Cell 14:1147-1160. Ertl H, Mengele R, Wenzl S, Engel J, Sumper M. 1989. The extracellular matrix of Volvox carteri: molecular structure of the cellular compartment. J Cell Biol 109:3493-3501. Falda M, Toppo S, Pescarolo A, Lavezzo E, Di Camillo B, Facchinetti A, Cilia E, Velasco R, Fontana P. 2012. Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms. BMC Bioinformatics 13 Suppl 4:S14. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, et al. 2014. Pfam: the protein families database. Nucleic Acids Res 42:D222-230. Gao S, Sung WK, Nagarajan N. 2011. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. J Comput Biol 18:1681- 1691. Gnerre S, Maccallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, et al. 2011. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc Natl Acad Sci U S A 108:1513-1518. Godl K, Hallmann A, Rappel A, Sumper M. 1995. Pherophorins: a family of extracellular matrix glycoproteins from Volvox structurally related to the sex- inducing pheromone. Planta 196:781-787. Godl K, Hallmann A, Wenzl S, Sumper M. 1997. Differential targeting of closely related ECM glycoproteins: the pherophorin family from Volvox. EMBO J 16:25- 34. Goodenough UW, Heuser JE. 1985. The Chlamydomonas cell wall and its constituent glycoproteins analyzed by the quick-freeze, deep-etch technique. J Cell Biol 101:1550-1568. Gorman DS, Levine RP. 1965. Cytochrome f and plastocyanin: their sequence in the photosynthetic eletron transport chain of Chlamydomonas reinhardtii. Proc Natl Acad Sci U S A 54:1665-1669. Guindon S, Delsuc F, Dufayard JF, Gascuel O. 2009. Estimating maximum likelihood phylogenies with PhyML. Methods Mol Biol 537:113-137. Hallmann A. 2003. Extracellular matrix and sex-inducing pheromone in Volvox. Int Rev Cytol 227:131-182. Hallmann A, Amon P, Godl K, Heitzer M, Sumper M. 2001. Transcriptional activation by the sexual pheromone and wounding: a new gene family from Volvox encoding modular proteins with (hydroxy)proline-rich and metalloproteinase homology domains. Plant J 26:583-593. Hallmann A, Godl K, Wenzl S, Sumper M. 1998. The highly efficient sex-inducing pheromone system of Volvox. Trends Microbiol 6:185-189. Hallmann A, Kirk DL. 2000. The developmentally regulated ECM glycoprotein ISG plays an essential role in organizing the ECM and orienting the cells of Volvox. J Cell Sci 113 Pt 24:4605-4617. Hanschen ER, Marriage TN, Ferris PJ, Hamaji T, Toyoda A, Fujiyama A, Neme R, Noguchi H, Minakuchi Y, Suzuki M, et al. 2016. The Gonium pectorale genome demonstrates co-option of cell cycle regulation during the evolution of multicellularity. Nat Commun 7:11370. Heitzer M, Hallmann A. 2002. An extracellular matrix-localized metalloproteinase with an exceptional QEXXH metal binding site prefers copper for catalytic activity. J Biol Chem 277:28280-28286. Hochberg Y, Benjamini Y. 1990. More powerful procedures for multiple significance testing. Stat Med 9:811-818. Holt C, Yandell M. 2011. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12:491. Hongo JA, de Castro GM, Cintra LC, Zerlotini A, Lobo FP. 2015. POTION: an end- to-end pipeline for positive Darwinian selection detection in genome-scale data through phylogenetic comparison of protein-coding genes. BMC Genomics 16:567. Huber O, Sumper M. 1994. Algal-CAMs: isoforms of a cell adhesion molecule in embryos of the alga Volvox with homology to Drosophila fasciclin I. EMBO J 13:4212-4222. Hubisz MJ, Pollard KS, Siepel A. 2011. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief Bioinform 12:41-51. Ishida K. 2007. Sexual pheromone induces diffusion of the pheromone- homologous polypeptide in the extracellular matrix of Volvox carteri. Eukaryot Cell 6:2157-2162. Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, et al. 2014. InterProScan 5: genome-scale protein function classification. Bioinformatics 30:1236-1240. Kent WJ. 2002. BLAT--the BLAST-like alignment tool. Genome Res 12:656-664. Kielbasa SM, Wan R, Sato K, Horton P, Frith MC. 2011. Adaptive seeds tame genomic sequence comparison. Genome Res 21:487-493. Kirk DL. 2005. A twelve-step program for evolving multicellularity and a division of labor. Bioessays 27:299-310. Kirk DL, Birchem R, King N. 1986. The extracellular matrix of Volvox: a comparative study and proposed system of nomenclature. J Cell Sci 80:207-231. Kirk DL, Harper JF. 1986. Genetic, biochemical, and molecular approaches to Volvox development and evolution. Int Rev Cytol 99:217-293. Korf I. 2004. Gene finding in novel genomes. BMC Bioinformatics 5:59. Kuck P, Meusemann K. 2010. FASconCAT: Convenient handling of data matrices. Mol Phylogenet Evol 56:1115-1118. Lee TY, Bretana NA, Lu CT. 2011. PlantPhos: using maximal dependence decomposition to identify plant phosphorylation sites with substrate site specificity. BMC Bioinformatics 12:261. Li L, Stoeckert CJ, Jr., Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178-2189. Liu K, Linder CR, Warnow T. 2011. RAxML and FastTree: comparing two methods for large-scale maximum likelihood phylogeny estimation. PLoS One 6:e27731. Lopez D, Casero D, Cokus SJ, Merchant SS, Pellegrini M. 2011. Algal Functional Annotation Tool: a web-based analysis suite to functionally interpret large gene lists using integrated annotation and expression data. BMC Bioinformatics 12:282. Mages HW, Tschochner H, Sumper M. 1988. The sexual inducer of Volvox carteri. Primary structure deduced from cDNA sequence. FEBS Lett 234:407-410. Magoc T, Salzberg SL. 2011. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27:2957-2963. Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurences of k-mers. Bioinformatics 27:764-770. Merchant SS, Prochnik SE, Vallon O, Harris EH, Karpowicz SJ, Witman GB, Terry A, Salamov A, Fritz-Laylin LK, Marechal-Drouard L, et al. 2007. The Chlamydomonas genome reveals the evolution of key animal and plant functions. Science 318:245-250. Miller SM, Schmitt R, Kirk DL. 1993. Jordan, an active Volvox transposable element similar to higher plant transposons. Plant Cell 5:1125-1138. Muhlhausen S, Hellkamp M, Kollmar M. 2015. GenePainter v. 2.0 resolves the taxonomic distribution of intron positions. Bioinformatics 31:1302-1304. Niklas KJ. 2014. The evolutionary-developmental origins of multicellularity. Am J Bot 101:6-25. Nozaki H. 1986. Sexual reproduction in Gonium sociale (Chlorophta, Volvocales). Phycologia 25:29-35. Nozaki H. 1990. Ultrastructure of the extracellular matrix of Gonium (Volvocales, Chlorophyta). Phycologia 29:1-8. Nozaki H, Misawa K, Kajita T, Kato M, Nohara S, Watanabe MM. 2000. Origin and evolution of the colonial volvocales (Chlorophyceae) as inferred from multiple, chloroplast gene sequences. Mol Phylogenet Evol 17:256-268. Odronitz F, Pillmann H, Keller O, Waack S, Kollmar M. 2008. WebScipio: an online tool for the determination of gene structures using protein sequences. BMC Genomics 9:422. Prochnik SE, Umen J, Nedelcu AM, Hallmann A, Miller SM, Nishii I, Ferris P, Kuo A, Mitros T, Fritz-Laylin LK, et al. 2010. Genomic analysis of organismal complexity in the multicellular green alga Volvox carteri. Science 329:223-226. Putnam NH, Srivastava M, Hellsten U, Dirks B, Chapman J, Salamov A, Terry A, Shapiro H, Lindquist E, Kapitonov VV, et al. 2007. Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science 317:86-94. Quinlan AR. 2014. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics 47:11 12 11-34. Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841-842. Retief JD. 2000. Phylogenetic analysis using PHYLIP. Methods Mol Biol 132:243- 258. Rodriguez H, Haring MA, Beck CF. 1999. Molecular characterization of two light- induced, gamete-specific genes from Chlamydomonas reinhardtii that encode hydroxyproline-rich proteins. Mol Gen Genet 261:267-274. Siepel A, Haussler D. 2004. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol 21:468-488. Smith-Unna R, Boursnell C, Patro R, Hibberd JM, Kelly S. 2016. TransRate: reference-free quality assessment of de novo transcriptome assemblies. Genome Res 26:1134-1144. Srivastava M, Begovic E, Chapman J, Putnam NH, Hellsten U, Kawashima T, Kuo A, Mitros T, Salamov A, Carpenter ML, et al. 2008. The Trichoplax genome and the nature of placozoans. Nature 454:955-960. Srivastava M, Simakov O, Chapman J, Fahey B, Gauthier ME, Mitros T, Richards GS, Conaco C, Dacre M, Hellsten U, et al. 2010. The Amphimedon queenslandica genome and the evolution of animal complexity. Nature 466:720-726. Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30:1312-1313. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. 2006. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34:W435-439. Sumper M, Berg E, Wenzl S, Godl K. 1993. How a sex pheromone might act at a concentration below 10(-16) M. EMBO J 12:831-836. Sumper M, Hallmann A. 1998. Biochemistry of the extracellular matrix of Volvox. Int Rev Cytol 180:51-85. Tamura K, Peterson D, Peterson N, Stecher G, Nei M, Kumar S. 2011. MEGA5: Molecular Evolutionary Genetics Analysis Using Maximum Likelihood, Evolution Distance and Maximum Parsimony Methods. Mol Biol Evol 28:2731-2739. Tschochner H, Lottspeich F, Sumper M. 1987. The sexual inducer of Volvox carteri: purification, chemical characterization and identification of its gene. EMBO J 6:2203-2207. Vurture GW, Sedlazeck FJ, Nattestad M, Underwood CJ, Fang H, Gurtowski J, Schatz MC. 2017. GenomeScope: Fast reference-free genome profiling from short reads. Bioinformatics. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, et al. 2014. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9:e112963. Wang Y, Coleman-Derr D, Chen G, Gu YQ. 2015. OrthoVenn: a web server for genome wide comparison and annotation of orthologous clusters across multiple species. Nucleic Acids Res 43:W78-84. Wences AH, Schatz MC. 2015. Metassembler: merging and optimizing de novo genome assemblies. Genome Biol 16:207. Woessner JP, Goodenough UW. 1989. Molecular characterization of a zygote wall protein: an extensin-like molecule in Chlamydomonas reinhardtii. Plant Cell 1:901-911. Woessner JP, Molendijk AJ, van Egmond P, Klis FM, Goodenough UW, Haring MA. 1994. Domain conservation in several volvocalean cell wall proteins. Plant Mol Biol 26:947-960. Yang Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24:1586-1591. Yang Z, Bielawski JP. 2000. Statisitical methods for detecting molecular adaption. Trends Ecol Evol 15:496-503.

Supplememntary Figures

Supplementary Figure 1: A light microscope image of Tetrabaena socialis.

The extracellular matrix surrounding each cell is faintly visible.

Supplementary Figure 2: Kmer frequency plot.

Kmer counting was performed using Jellyfish and the outputs were used to estimate genome size using GenomeScope. Presented here is a k of 77.

Supplementary Figure 3: Venn diagram of orthologous genes amongst volvocine algae. The abbreviations Crei, Tsoc, Gpec and Vcar denote

Chlamydomonas reinhardtii, Tetrabaena socialis, Gonium pectorale and

Volvox carteri. Image generated using OrthoVenn with OrthoMCL families as an input.

Supplementary Figure 4: The number of Pfam-A domains identified in the proteomes of each species examined using HMMER3.

Supplementary Figure 5. Multicellular and enriched domains as categorised by presence in selected chlorophytes, plants and opisthokonta. In the left

Venn diagram identified Pfam-A domains were sorted for presence amongst selected chlorophytes, plants and opisthokonta. In the right Venn diagram enriched domains between Chlamydomonas reinhardtii and the multicellular taxa as well as multicellular domains were sorted according to the distribution of domains amongst selected chlorophyta, plantae and opisthokonta as represented in the left Venn diagram.

Supplementary Figure 6: Synteny relationships of CYCD1 genes

Tetrabaena_RDP1

100 78 Volvox_RDP1 Gonium_RDP2 Gonium_RDP1

58 96 Chlamydomonas_RPD2 Tetrabaena_RDP2 62 Tetrabaena_RDP3 73 89 Chlamydomonas_RDP1 Chlamydomonas_RDP3 Volvox_RDP3 Gonium_RDP3 Volvox_CDKD1 90 Gonium_CDKD1 82 61 Tetrabaena_CDKD1 Chlamydomonas_CDKD1 Gonium_CDKI1 57 70 Volvox_CDKI1 100 99 Chlamydomonas_CDKI1 Tetrabaena_CDKI1 Chlamydomonas_CDKE1 100 72 Gonium_CDKE1 Volvox_CDKE1 53 83 Tetrabaena_CDKB1 60 Chlamydomonas_CDKB1 100 Gonium_CDKB1 Volvox_CDKB1 Gonium_CDKA1 58 86 99 Tetrabaena_CDKA1 98 60 Chlamydomonas_CDKA1 Volvox_CDKA1 Volvox_CDKA2 100 85 Chlamydomonas_CDKA2 Tetrabaena_CDKA2

64 Chlamydomonas_CDKG2 93 Volvox_CDKG1 94 Gonium_CDKG1

76 Chlamydomonas_CDKG1 Tetrabaena_CDKG1 Chlamydomonas_CDKH1 100 52 Volvox_CDKH1

76 Gonium_CDKH1 Gonium_CDKC1 92 Volvox_CDKC1 Tetrabaena_CDKC1 Chlamydomonas_CDKC1

1.0

Supplementary Figure 7: Mid-root phylogeny of cyclin-dependent kinases and rhodanese/cell-cycle phosphatases (RDPs). Cyclin-dependent kinases from

Chlamydomonas reinhardtii (green), Tetrabaena socialis (magenta), Gonium pectorale (blue) and Volvox carteri (black).

N1 N1 N2

N2

N2 N3

N3 RB-A

RB-A

RB-A

RB-B

RB-B

C1

C4 Conserved CDK phosphorylation sites indicated in Hanschen et al. 2016 Nat. Commun. Degenerate or species specific CDK phosphorylation sites indicated in Hanschen et al. 2016 Nat. Commun. CDK phosphorylation sites predicted by PlantPhos server.

Supplementary Figure 8: MAT3/RB alignment and putative cyclin-dependent kinase phosphorylation targets.

Supplementary Figure 9: Maximum-likelihood (ML) tree of RB/MAT3 of volvocine algae.

Bootstrap values (≥50%) for the ML and neighbor-joining (NJ) analyses are indicated left and right side, respectively. The scale bar corresponds to 0.1 amino acid substitutions per position.

Supplementary Figure 10: Retinoblastoma gene structures of genome sequenced volvocines. Black bars underline regions where introns have been gained in the colonial/multicellular taxa.

Supplementary Figure 11: Mid-pointed rooted phylogeny of pherophorins from

Chlamydomonas reinhardtii, Tetrabaena socialis, Gonium pectorale and

Volvox carteri.

Proteins were from Hanschen et al (2016) collection of curated pherophorins with the addition of Tetrabaena socialis proteins. Nodes where proteins from all taxa were present were collapsed in order to simplify the image.

Highlighted in green are nodes and branches that consist exclusively of

Volvox carteri proteins, in red are branches where Volvox carteri proteins are absent and in purple are branches that contain Chlamydomonas reinhardtii hydroxyproline-rich glycoproteins upregulated during N-starvation (GAS proteins) as well as homologues from Gonium pectorale, Tetrabaena socialis and Volvox carteri.

Supplementary Figure 12: Mid-pointed rooted phylogeny of Matrix

Metalloproteases from Chlamydomonas reinhardtii, Tetrabaena socialis,

Gonium pectorale and Volvox carteri.

Proteins were from Hanschen et al (2016) collection of curated MMPs with the addition of Tetrabaena socialis proteins. Nodes where proteins from three or more taxa were present were collapsed in order to simplify the image.

Highlighted in green are nodes and branches that consist exclusively of

Volvox carteri proteins, in red Chlamydomonas reinhardtii proteins and in blue a branch with only Chlamydomonas reinhardtii and Tetrabaena socialis proteins.

Supplementary Figure 13: Mid-root phylogeny of RegA-related sequences from Chlamydomonas reinhardtii, Tetrabaena socialis, Gonium pectorale and

Volvox carteri.

RegA-related sequences from Chlamydomonas reinhardtii (green),

Tetrabaena socialis (magenta), Gonium pectorale (blue) and Volvox carteri

(black). The phylogeny was generated from an alignment of only the conserved VARL domain region. The phylogeny confirms that the regA-cluster of proteins (denoted by red branches that include regA, rlsA, rlsB and rlsC) is not present in Tetrabaena socialis. RLS1/rlsD proteins are shown as sister to the regA-cluster but without significant support. Supplementary Table 1: Kmer analysis with Jellyfish, Allpaths-LG and

BBtools. A substantial variance was observed between different tools or kmer sizes and a consensus for genome size or repeat content was not obtained.

Kmer/algorithm Estimated Genome Repeat content (%)

Size (mb)

Jellyfish/GenomeScope

K=19 101,475 20,4%

K=21 102,175 17,9%

K=25 103,308 16,7%

K=33 105,230 16,1%

K=45 108,311 16,9%

K=55 111,586 18,9%

K=77 117,848 23,7%

Allpaths-LG

K=25 178,9 43%

BBtools

K=31 114,954 10,8%

Supplementary Table 2: Sources of data used in comparative genome analyses.

Organism Version Source Reference

Chlamydomona Version 5.5, http://phytozome.jgi.doe.gov/ (Blaby et s reinhardtii Version 5.5 al., 2014; S.

(Chlorophyte: S. Merchant

Chlamydomona et al., 2007) dales)

Volvox carteri Version 1, http://genome.jgi.doe.gov/ and (Prochnik et

(Chlorophyte: Version 2.0, http://phytozome.jgi.doe.gov/ al., 2010)

Chlamydomona Version 2.X dales)

Gonium Version 1 Volvocales Genome (Hanschen pectorale Consortium now available at et al., 2016)

(Chlorophyte: http://www.ncbi.nlm.nih.gov/ge

Chlamydomona nome/?term=gonium+pectorale dales)

Ostreococcus Version 2 http://genome.jgi.doe.gov/ (Hindle et tauri al., 2014)

(Chlorophyte)

Ostreococcus Version 2 http://genome.jgi.doe.gov/ (Palenik et

RCC809 al., 2007)

(Chlorophyte)

Ostreococcus Version 2 http://genome.jgi.doe.gov/ (Palenik et lucimarinus al., 2007)

(Chlorophyte)

Chlorella Version 1 http://genome.jgi.doe.gov/ (Blanc et viriabilis al., 2010)

(NC64) (Chlorophyte)

Micromonas Version 3 http://genome.jgi.doe.gov/ (Worden et pusilla al., 2009)

CCMP1545

(Chlorophyte)

Micromonas Version 3 http://genome.jgi.doe.gov/ (Worden et pusilla RCC299 al., 2009)

(Chlorophyte)

Coccomyxa Version 2 http://genome.jgi.doe.gov/ (Blanc et subellipsoidea al., 2012)

(Chlorophyte)

Dunaliella Version 1 http://marinemicroeukaryotes.o (Rismani- tertiolecta (Transcriptome) rg/ yazdi et al.,

(Chlorophyte: 2011)

Chlamydomona dales)

Polytomella Version 1 http://marinemicroeukaryotes.o (Smith & parva (Transcriptome) rg/ Lee, 2014)

(Chlorophyte:

Chlamydomona dales)

Chlamydomona Version 1 http://marinemicroeukaryotes.o (Keeling et s leiostraca (Transcriptome) rg/ al., 2014)

(Chlorophyte: Chlamydomona dales)

Chlamydomona Version 1 http://marinemicroeukaryotes.o (Keeling et s (Transcriptome) rg/ al., 2014) chlamydogama

(Chlorophyte:

Chlamydomona dales)

Chlamydomona Version 1 http://marinemicroeukaryotes.o (Keeling et s eurale (Transcriptome) rg/ al., 2014)

(Chlorophyte:

Chlamydomona dales)

Bathycoccus Version 1 http://bioinformatics.psb.ugent. X(Moreau prasinos be/ et al., 2012)

(Chlorophyte:

Chlamydomona dales)

Auxenochlorell Version 1 www.ncbi.nlm.nih.gov/) (C. Gao et a al., n.d.) protothecoides

(Chlorophyte)

Caenorhabditu Version 84.245 http://www.wormbase.org/ (Equence & s elegans Iology, (Metazoan) 2012)

Drosophila Version 6 http://www.flybase.net/ (Ketchum et melanogaster al., 2000)

(Metazoan)

Selaginella Version 1 http://phytozome.jgi.doe.gov/ (Banks et moellendorffii al., 2007)

(Plant)

Arabidopsis TAIR10 https://www.arabidopsis.org/do thaliana (Plant) wnload/index-

auto.jsp?dir=/download_files/Pr

oteins

Physcomitrella Version 3.3 http://phytozome.jgi.doe.gov/ (Shapiro et patens (Plant) al., 2008)

Monosiga Version 1 http://genome.jgi.doe.gov/ (King et al., brevicolis 2008)

(Metazoan)

Saccharomyce Version http://www.yeastgenome.org/ (Mewes et s cerevisea R64.2.1 al., 1997)

(Fungi)

Homo sapiens Version 84.20 http://www.ensembl.org/ (Venter et

(Metazoan) al., 2001)

Supplementary Table 3: BLAST results of Tetrabaena socialis proteins against UNIPROT and the proteomes of other volvocines. Note that results include only top blast hits (E-values <1x10-7).

Query Proteomes Tetrabaena socialis matches (15

153 proteins)

C. reinhardtii 12 954 (83.5%)

G. pectorale 12 740 (82.1%)

V. carteri 12 218 (78.7%)

Volvocines 13 712 (88.3)

UNIPROT-SWISSPROT 7 838 (50.5)

UNIPROT-TREMBL 13 091 (84.3%)

All Blast hits 13 982 (90%)

Supplementary Table 4: The number of Chlamydomonas reinhardtii orthologues for multicellular volvocines. Percentages reflect the percentage of the multicellular volvocine proteins with Chlamydomonas reinhardtii orthologues.

Chlamydomonas Proteins from multicelled taxa

reinhardtii with Chlamydomonas

Orthologues reinhardtii orthologues.

Tetrabaena socialis 8931 57,5%

Gonium pectorale 9384 55,8%

Volvox carteri 9384 64%

Supplementary Table 5: Domains that are significantly different (p<0,05) between Chlamydomonas reinhardtii and the multicellular volvocines.

Domains are classified for presence amongst major eukaryotic lineages examined. A (+) symbol in column 3 indicates that domains are enriched in multicellular volvocines. Domains with an (X) in column 2 indicate that they are absent from Chlamydomonas reinhardtii but present in the proteomes of all 3 multicellular volvocines (Tetrabaena socialis, Gonium pectorale and

Volvox carteri).

PFAM P-value Enriched Annotation of Domain

Domain Multicell-

ular (+/-)

Common to all examined Eukaryotic lineages

PF00331 X Glycosyl hydrolase family 10

PF00832 X Ribosomal L39 protein

PF02320 X Ubiquinol-cytochrome C reductase hinge protein

PF03671 X Ubiquitin fold modifier 1 protein

PF05557 X Mitotic checkpoint protein

PF06331 X Transcription factor TFIIH complex subunit Tfb5

PF06645 X Microsomal signal peptidase 12 kDa subunit

(SPC12)

PF07145 X Ataxin-2 C-terminal region

PF07720 X Tetratricopeptide repeat

PF09202 X Rio2, N-terminal PF10138 X vWA found in TerF C terminus

PF10537 X ATP-utilising chromatin assembly and

remodelling N-terminal

PF11559 X Afadin- and alpha -actinin-Binding

PF12627 X Probable RNA and SrmB- binding site of

polymerase A

PF13465 X Zinc-finger double domain

PF14811 X Protein of unknown function TPD sequence-

motif

PF14822 X Vasohibin

PF13857 1,11E-16 (+) Ankyrin repeats (many copies)

PF13637 1,11E-16 (+) Ankyrin repeats (many copies)

PF00078 2,55E-15 (-) Reverse transcriptase (RNA-dependent DNA

polymerase)

PF12796 1,40E-14 (+) Ankyrin repeats (3 copies)

PF13606 1,67E-10 (+) Ankyrin repeat

PF00023 9,29E-08 (+) Ankyrin repeat

PF01011 3,24E-05 (-) PQQ enzyme repeat

PF13360 1,44E-04 (-) PQQ-like domain

PF00652 3,63E-04 (+) Ricin-type beta-trefoil lectin domain

PF14291 4,11E-04 (-) Domain of unknown function (DUF4371)

PF01753 7,84E-04 (+) MYND finger

PF00010 9,45E-04 (-) Helix-loop-helix DNA-binding domain

PF00665 1,08E-03 (+) Integrase core domain PF05686 1,16E-03 (-) Glycosyl transferase family 90

PF14200 1,41E-03 (+) Ricin-type beta-trefoil lectin domain-like

PF07645 1,44E-03 (-) Calcium-binding EGF domain

PF01453 2,50E-03 (+) D-mannose binding lectin

PF00233 2,80E-03 (-) 3'5'-cyclic nucleotide phosphodiesterase

PF13570 3,71E-03 (-) PQQ-like domain

PF01663 4,69E-03 (-) Type I phosphodiesterase / nucleotide

pyrophosphatase

PF07731 6,19E-03 (-) Multicopper oxidase

PF13371 8,49E-03 (-) Tetratricopeptide repeat

PF00059 1,26E-02 (-) Lectin C-type domain

PF08373 1,86E-02 (-) RAP domain

PF00394 1,94E-02 (-) Multicopper oxidase

PF13563 1,94E-02 (-) 2'-5' RNA ligase superfamily

PF00704 2,14E-02 (-) Glycosyl hydrolases family 18

PF13738 2,51E-02 (-) Pyridine nucleotide-disulphide oxidoreductase

PF05729 2,53E-02 (-) NACHT domain

PF10275 2,53E-02 (-) Peptidase C65 Otubain

PF00884 2,60E-02 (-) Sulfatase

PF01541 2,63E-02 (-) GIY-YIG catalytic domain

PF07732 3,01E-02 (-) Multicopper oxidase

PF13405 3,03E-02 (-) EF-hand domain

PF00581 3,04E-02 (-) Rhodanese-like domain

PF13469 3,07E-02 (-) Sulfotransferase family PF13499 3,35E-02 (-) EF-hand domain pair

PF13401 3,49E-02 (-) AAA domain

PF07676 3,55E-02 (+) WD40-like Beta Propeller Repeat

PF00188 4,20E-02 (+) Cysteine-rich secretory protein family

PF01636 4,56E-02 (+) Phosphotransferase enzyme family

PF13481 4,65E-02 (-) AAA domain

Unique to Chlorophytes and Opistokonta

PF02607 X B12 binding domain

PF07221 X N-acylglucosamine 2-epimerase (GlcNAc 2-

epimerase)

PF09594 X Protein of unknown function (DUF2029)

PF12051 X Protein of unknown function (DUF3533)

PF13358 X DDE superfamily endonuclease

PF13688 X Metallo-peptidase family M12

PF14740 X Domain of unknown function (DUF4471)

PF14864 X Alkyl sulfatase C-terminal

PF05887 2,10E-06 (-) Procyclic acidic repetitive protein (PARP)

PF12851 1,19E-04 (-) Oxygenase domain of the 2OGFeDO

superfamily

PF16588 6,17E-03 (-) C2H2 zinc-finger

PF07699 1,20E-02 (-) GCC2 and GCC3

PF07562 1,49E-02 (-) Nine Cysteines Domain of family 3 GPCR

PF06701 1,73E-02 (+) Mib_herc2 PF03762 2,83E-02 (+) Vitelline membrane outer layer protein I (VOMI)

PF02494 3,13E-02 (+) HYR domain

PF01822 3,47E-02 (-) WSC domain

PF01607 4,49E-02 (-) Chitin binding Peritrophin-A domain

Unique to Chlorophytes and Plants

PF07123 X Photosystem II reaction centre W protein

(PsbW)

PF08706 X D5 N terminal like

PF07460 2,44E-07 (-) NUMOD3 motif (2 copies)

PF00759 1,41E-03 (+) Glycosyl hydrolase family 9

PF09118 1,09E-02 (-) Domain of unknown function (DUF1929)

PF07250 1,39E-02 (-) Glyoxal oxidase N-terminus

PF03924 2,32E-02 (-) CHASE domain

Unique to Chlorophytes

PF02415 X Chlamydia PMP repeat

PF04230 X Polysaccharide pyruvyl transferase

PF08707 X Primase C terminal 2 (PriCT-2)

PF07282 1,55E-15 (-) Putative transposase DNA-binding domain

PF12499 1,89E-02 (+) Pherophorin

PF06863 1,44E-03 (-) Protein of unknown function (DUF1254)

PF06742 1,44E-03 (-) Protein of unknown function (DUF1214)

Supplementary Table 6: Prochnik outlier domains identified in

Chlamydomonas reinhardtii, Tetrabaena socialis, Gonium pectorale and

Volvox carteri.

Domain Annotation Chlamydomonas Tetrabaena Gonium Volvox

reinhardtii socialis pectorale carteri

PF00125 Histones 124 170 134 56

PF00023 Ankyrin 80 135 143 31

Repeats

PF00112 Cysteine 24 17 18 16

Protease

PF05548 Gametolysin 50 38 41 109

PF00560 Leucine- 36 39 21 23

Rich

Repeats

Supplementary Table 7: HMMER3 detected Ankyrin repeat domains in

Chlamydomonas reinhardtii, Tetrabaena socialis, Gonium pectorale and

Volvox carteri.

Ankyrin Chlamydomonas Tetrabaena Gonium Volvox

Domains reinhardtii socialis pectorale carteri

PF00023 104 203 253 61

PF12796 208 416 518 111

PF13606 112 247 286 70 PF13637 158 383 476 99

PF13857 150 365 403 105

Supplementary Table 8: Evaluation of parsimony implementations at 3 nodes

in the phylogeny for family gain and loss

Wagner’s Wagner’s Dollo

symmetric (gain asymmetric

cost=1) (gain cost=2)

Gonium/Volvox 1248 gain 261 gain 234 gain

67 loss 317 loss 406 loss

Multicellular 131 gain 517 gain 467 gain

volvocines 137 loss 70 loss 111 gain

Chlorophyceae/ 2657 gain 3099 gain 3455 gain

Trebouxiophyceae 203 loss 389 loss 601 loss

Supplementary Table 9: Gene Families gained at the ORIMC. The most

common domain found is included as well as a description of the domain.

Consensus annotations derived from Chlamydomonas reinhardtii v5.5 and

Volvox carteri v2 annotations as well as from argot2 are provided for each

family.

Common Domain/s

Family ID Domain/s Description Description

Protein kinase

OGFLEX_1019 PF00069 domain STELAR K+ outward rectifier Cornifin (SPRR) Putative nucleic acid binding

OGFLEX_1034 PF02389 family family

Ankyrin repeats (3

OGFLEX_1035 PF12796 copies) Ankyrin repeat family protein

PF14295 PAN domain/

/ Calcium-binding Cysteine proteinases

OGFLEX_1046 PF07645 EGF domain superfamily protein

Cysteine proteinases

OGFLEX_1063 PF00008 EGF-like domain superfamily protein

OGFLEX_1079 Unknown

Adenylate and

Guanylate cyclase Unknown putative signalling

OGFLEX_1126 PF00211 catalytic domain functions

BTB-POZ and MATH domain

OGFLEX_1172 PF00651 BTB/POZ domain 6

OGFLEX_1215 PF03016 Exostosin family Exostosin family protein

OGFLEX_1256 Unknown

General control non-

OGFLEX_1271 PF00098 Zinc knuckle repressible

Cysteine-rich

secretory protein

OGFLEX_1382 PF00188 family Unknown

BTB And C-terminal

OGFLEX_1467 PF07707 Kelch Unknown

OGFLEX_1610 PF14295 PAN domain Cysteine proteinases superfamily

OGFLEX_1619 PF02892 BED zinc finger Unknown

OGFLEX_2177 Unknown

OGFLEX_4318 Unknown

DNA N-6-adenine-

methyltransferase Unknown putative

OGFLEX_4319 PF05869 (Dam) methyltransferase activity

Microsomal signal

peptidase

12 kDa subunit Putative proteolytic signal

OGFLEX_4334 PF06645 (SPC12) peptidase

OGFLEX_5276 Unknown

CAP (Cysteine-rich secretory

proteins, Antigen 5, and

Cysteine-rich Pathogenesis-related 1

secretory protein protein)

OGFLEX_5317 PF00188 family superfamily

Similar to bacterial putative

OGFLEX_5320 PF01549 ShK domain-like methyl transferase

Calcineurin-like Unknown hydrolase putative

OGFLEX_5717 PF00149 phosphoesterase activity

OGFLEX_5804 Unknown glycine-rich family

Protein tyrosine ACT-like protein tyrosine

OGFLEX_6272 PF07714 kinase kinase family protein

OGFLEX_7103 Unknown

OGFLEX_7221 Unknown

OGFLEX_7232 Unknown

Ubiquitin fold Putative ubiquitin fold

OGFLEX_7343 PF03671 modifier 1 protein modifier 1 protein

DEAD/DEAH box

RNA helicase family

OGFLEX_7355 PF00270 protein RECQ helicase SIM

OGFLEX_7359 PF13639 Ring finger domain RPM1 interacting protein 2

Galactosyltransferas Galactosyltransferase family

OGFLEX_7429 PF01762 e protein

Colon cancer- Unknown putative

associated protein microtubule

OGFLEX_7511 PF07035 Mic1-like kinesin

OGFLEX_7538 PF02535 ZIP Zinc transporter Zinc/iron permease

OGFLEX_8628 PF00484 Carbonic anhydrase Unknown

RNA recognition

motif.

(a.k.a. RRM, RBD, Unknown putative nucleotide

OGFLEX_8632 PF00076 or RNP domain) binding

CAAD domains of

cyanobacterial

aminoacyl-tRNA

OGFLEX_8798 PF14159 synthetase Unknown

Glycosyl hydrolase

OGFLEX_8819 PF00331 family Glycosyl hydrolase family Glycosyl transferase

OGFLEX_8913 PF00535 family Glycosyl transferase family

OGFLEX_8916 Unknown

Chlorophyll A-B Chlorophyll A-B binding

OGFLEX_8987 PF00504 binding protein family

Protein phosphatase 2C

OGFLEX_9949 family

Formin Homology 2

OGFLEX_9953 PF02181 Domain Putative formin like family

OGFLEX_9954 MAPKKK-related family

OGFLEX_9968 Unknown

Domain of unknown

OGFLEX_9969 PF11822 function (DUF3342) Unknown

Fasciclin domain containing

OGFLEX_10455 PF02469 Fasciclin domain family

OGFLEX_10485 PF03016 Exostosin family Exostosin family protein

OGFLEX_10494 Unknown

Domain of unknown

OGFLEX_10512 PF14494 function (DUF4436) Unknown

PPPDE putative

OGFLEX_10556 PF05903 peptidase domain Unknown

OGFLEX_10589 Unknown

SM0018

OGFLEX_10729 4 Ring finger Breast cancer susceptibility1 SPFH/Band 7/PHB domain-

SPFH domain / Band containing membrane-

OGFLEX_12088 PF01145 7 family associated protein family

OGFLEX_12096 Unknown

OGFLEX_12100 Recovery protein 3 family

OGFLEX_12102 Unknown

Cyclic nucleotide-

OGFLEX_12103 PF00027 binding domain Unknown

MutL C terminal DNA mismatch repair protein,

OGFLEX_12106 PF08676 dimerisation domain putative

OGFLEX_12109 Unknown

NOL1/NOP2/sun NOL1/NOP2/sun family

OGFLEX_12898 PF01189 family protein

Polynucleotidyl transferase,

CAF1 family ribonuclease H-like

OGFLEX_13054 PF04857 ribonuclease superfamily protein

Protein kinase Unknown putative protein

OGFLEX_13080 PF00069 domain kinase family

Uncharacterized

conserved protein

OGFLEX_13153 PF10184 (DUF2358) Unknown

Ribonuclease III family

OGFLEX_13170 protein

Cyclophilin type Peptidyl-prolyl cis-trans

OGFLEX_13210 PF00160 peptidyl-prolyl isomerases cis-trans

isomerase/CLD

OGFLEX_13554 Unknown

Protein of unknown

OGFLEX_13619 PF07059 function (DUF1336) Unknown

Adenylyl- / guanylyl

SM0004 cyclase, catalytic Unknown putative signalling

OGFLEX_13648 4 domain functions

Leucine rich repeat,

SM0036 ribonuclease

OGFLEX_14038 8 inhibitor type Unknown

OGFLEX_14319 PF00258 Flavodoxin Putative P450 family

DNA-binding domain

in plant proteins

SM0038 such as APETALA2

OGFLEX_14401 0 and EREBPs AINTEGUMENTA-like 5

OGFLEX_14433 Unknown

N -terminal region of

Chorein,

a TM vesicle-

OGFLEX_14435 PF12624 mediated sorter Autophagy 2

2OG-Fe(II)

oxygenase 2OG-Fe(II) oxygenase

OGFLEX_14464 PF13532 superfamily superfamily

OGFLEX_14477 PF04117 Mpv17 / PMP22 Peroxisomal membrane family 22 kDa (Mpv17/PMP22)

family

OGFLEX_14502 PF03265 Deoxyribonuclease II Unknown

Proliferating cell

nuclear antigen,

OGFLEX_14506 PF02747 C-terminal domain Unknown

Protein of unknown Protein of unknown function

OGFLEX_14507 PF04720 function (DUF506) (DUF506)

OGFLEX_14508 PF00023 Ankyrin repeat Unknown

OGFLEX_14509 Unknown

Formin Homology 2

OGFLEX_14510 PF02181 Domain Actin-binding FH2 protein

OGFLEX_14511 PF00458 WHEP-TRS domain Unknown

OGFLEX_14513 Unknown

OGFLEX_14514 Unknown

OGFLEX_14515 Unknown

CASC3/Barentsz CASC3/Barentsz

OGFLEX_14516 PF09405 eIF4AIII binding eIF4AIII binding

RecF/RecN/SMC N Structural maintenance of

OGFLEX_14517 PF02463 terminal domain chromosome 3

Right handed beta ATPase, AAA-type, CDC48

OGFLEX_14518 PF13229 helix region family

OGFLEX_14519 Unknown

OGFLEX_14521 PF00069 Protein kinase Unknown putative protein domain kinase family

OGFLEX_14522 PF12906 RING-variant domain Unknown

OGFLEX_14523 Unknown

OGFLEX_14525 Unknown

Protein of unknown

OGFLEX_14526 PF10049 function (DUF2283) Unknown

OGFLEX_14527 Unknown

OGFLEX_14528 Unknown

FKBP-type peptidyl-prolyl

cis-trans isomerase

OGFLEX_14529 PF13414 TPR repeat family protein

OGFLEX_14530 Unknown

Homeodomain-like/winged-

helix

OGFLEX_14532 DNA-binding family protein

OGFLEX_14533 Unknown

OGFLEX_14534 Unknown

OGFLEX_14535 Unknown

OGFLEX_14536 Unknown

OGFLEX_14537 PF08373 RAP domain RAP family

N-acylglucosamine

2-epimerase

(GlcNAc 2- Putative cellobiose

OGFLEX_14538 PF07221 epimerase) epimerase OGFLEX_14539 Unknown

OGFLEX_14541 Unknown

OGFLEX_14542 Unknown

OGFLEX_14545 Unknown

OGFLEX_14546 Unknown

OGFLEX_14547 Unknown

Up -frameshift Unknown RNA binding

OGFLEX_14548 PF04050 suppressor protein family

OGFLEX_14549 Unknown

Diacylglycerol Diacylglycerol acyltransferase

OGFLEX_14550 PF03982 acyltransferase family

WD40/YVTN

WD domain, G-beta repeat-like-containing

OGFLEX_14551 PF00400 repeat domain family

OGFLEX_14552 FMN binding family

PAS domain-containing

Protein kinase protein tyrosine kinase

OGFLEX_14553 PF00069 domain family

Hemerythrin HHE

cation binding

OGFLEX_14554 PF01814 domain Unknown

Protein of unknown

OGFLEX_14555 PF12051 function (DUF3533) Unknown

Protein of unknown

OGFLEX_14556 PF08574 function (DUF1762) Unknown OGFLEX_14557 PF03600 Citrate transporter Unknown

OGFLEX_14559 Unknown

OGFLEX_14560 PF09359 VTC domain Unknown

OGFLEX_14561 Unknown

Peptidase family

OGFLEX_14562 PF01434 M41 FTSH protease 10 family

CorA-like Mg2+

OGFLEX_14563 PF01544 transporter protein Magnesium transporter 9

WD domain, G-beta Transducin family protein

OGFLEX_14564 PF00400 repeat / WD-40 repeat family

OGFLEX_14566 PF03016 Exostosin family Exostosin family protein

OGFLEX_14567 PF01124 MAPEG family Unknown

OGFLEX_14568 Unknown

Supplementary Table 10: Gene families expanded in multicellular volvocines.

For column 3 only the most common domain within a family is listed. Column

4 includes Chlamydomonas reinhardtii v5.5 annotations.

Gene Family PFAM Description of Chlamydomonas reinhardtii

domain Domain derived annotation

OGFLEX_1000 PF12796 Ankyrin repeats Unknown

(3 copies)

OGFLEX_1010 PF03372 Endonuclease/Ex Unknown

onuclease/phosp

hatase family

OGFLEX_1022 PF00078 Reverse Unknown transcriptase

(RNA-dependent

DNA polymerase)

OGFLEX_1097 PF02889 Sec63 Brl domain U5 small nuclear

ribonucleoprotein helicase,

putative

OGFLEX_1274 PF09359 VTC domain Major Facilitator Superfamily

with SPX

(SYG1/Pho81/XPR1)

OGFLEX_1334 PF03054 tRNA methyl tRNA (5-

transferase methylaminomethyl-2-

thiouridylate)-

methyltransferases

OGFLEX_1436 PF02676 Methyltransferase Met-10+ like family protein /

TYW3 kelch repeat-containing

protein

OGFLEX_1480 PF00069 Protein kinase casein kinase 1-like protein

domain 2

OGFLEX_1509 PF08029 HisG, C-terminal ATP phosphoribosyl

domain transferase 2

OGFLEX_1575 PF00069 Protein kinase Protein kinase superfamily

domain protein/mixed-lineage

kinase and Raf protein

kinase-like

OGFLEX_1581 PF00580 UvrD/REP P-loop containing helicase N- nucleoside triphosphate

terminal domain hydrolases superfamily

protein

OGFLEX_1614 PF00504 Chlorophyll A-B photosystem I light

binding protein harvesting complex gene 3

OGFLEX_1895 PF02922 Carbohydrate- limit dextrinase

binding module

48 (Isoamylase

N-terminal

domain)

OGFLEX_1966 PF00462 Glutaredoxin Glutaredoxin family protein

OGFLEX_2166 PF00112 Papain family Cysteine proteinases

cysteine protease superfamily protein

OGFLEX_2171 PF00106 short chain NAD(P)-binding Rossmann-

dehydrogenase fold superfamily protein

OGFLEX_3446 Unknown

OGFLEX_3451 PF13589 Histidine kinase-, MUTL protein homolog 3

DNA gyrase B-,

and HSP90-like

ATPase

OGFLEX_4309 PF00232 Glycosyl beta glucosidase 29

hydrolase family

1

OGFLEX_4312 PF03992 Antibiotic Unknown

biosynthesis monooxygenase

OGFLEX_4363 PF12499 Pherophorin Unknown

OGFLEX_4885 PF01494 FAD binding zeaxanthin epoxidase (ZEP)

domain

OGFLEX_4887 PF01485 IBR domain RING/U-box superfamily

protein

OGFLEX_4915 PF06650 Protein of Unknown

unknown function

(DUF1162

OGFLEX_5737 PF03351 DOMON domain Auxin-responsive family

protein

OGFLEX_6377 PF12499 Pherophorin Pherophorin

OGFLEX_6403 Unknown

Supplementary Table 11: Genes identified using the Potion pipeline and

Codeml M7/8 model with significant (q=0.05) evidence of positive selection.

Descriptions are from Chlamydomonas reinhardtii v5.5 annotations. Blank cells indicate that the gene has not been characterised.

Chlamydomonas Description

Reference Gene

Cre01.g004500 isopropyl malate isomerase large subunit 1

Cre01.g016050 Pre-mRNA-processing-splicing factor

Cre01.g034400 NagB/RpiA/CoA transferase-like superfamily protein

Cre01.g043350 Pheophorbide a oxygenase family protein with Rieske [2Fe-2S] domain

Cre01.g055420 Protein phosphatase 2A, regulatory subunit PR55

Cre02.g080200 Transketolase

Cre02.g091050 Aldolase superfamily protein

Cre02.g098450 eukaryotic translation initiation factor 2 gamma subunit

Cre02.g113400

Cre02.g118900 Metal-dependent protein hydrolase

Cre02.g143200 Alanyl-tRNA synthetase

Cre03.g168450 TCP-1/cpn60 chaperonin family protein

Cre03.g178150 calmodulin 6

Cre03.g180750 methionine synthase 3

Cre03.g183200 adenosine monophosphate kinase

Cre03.g185550 sedoheptulose-bisphosphatase

Cre03.g193850 Succinyl-CoA ligase, alpha subunit

Cre04.g214500 isocitrate dehydrogenase

Cre04.g216600 regulatory particle triple-A ATPase 6A

Cre04.g229300 rubisco activase

Cre05.g233800 glycyl-tRNA synthetase / glycine--tRNA ligase

Cre05.g234639 TCP-1/cpn60 chaperonin family protein

Cre05.g234652 plastid transcriptionally active 17

Cre06.g267350

Cre06.g269950 ATPase, AAA-type, CDC48 protein

Cre06.g278142 Methylthiotransferase

Cre06.g281250 Cyclopropane-fatty-acyl-phospholipid synthase Cre06.g294950 NAD(P)-binding Rossmann-fold superfamily protein

Cre06.g308350 flower-specific, phytochrome-associated protein phosphatase 3

Cre06.g308500 carbamoyl phosphate synthetase A

Cre07.g325500 magnesium-chelatase subunit chlH, chloroplast, putative / Mg-

protoporphyrin IX chelatase, putative (CHLH)

Cre07.g329700 regulatory particle AAA-ATPase 2A

Cre07.g340350

Cre08.g359350 acetyl Co-enzyme a carboxylase biotin carboxylase subunit

Cre08.g375500 glutamine-fructose-6-phosphate transaminase (isomerizing)s;sugar

binding;transaminases

Cre08.g378850 Phosphoribosyltransferase family protein

Cre09.g394436 Inorganic H pyrophosphatase family protein

Cre09.g405850 NADH dehydrogenase subunit 7

Cre09.g410950 nitrate reductase 2

Cre09.g415550

Cre10.g430050 ubiquitin-conjugating enzyme 22

Cre10.g433000 glycine-tRNA ligases

Cre10.g434750 ketol-acid reductoisomerase

Cre10.g439150 regulatory particle triple-A ATPase 5A

Cre10.g449250

Cre10.g461050 vacuolar ATP synthase subunit A

Cre11.g476850

Cre11.g481450 ATPase, F0 complex, subunit B/B\', bacterial/chloroplast

Cre12.g483950 Lactate/malate dehydrogenase family protein Cre12.g484000 acetyl-CoA carboxylase carboxyl transferase subunit beta

Cre12.g490350 4-hydroxy-3-methylbut-2-enyl diphosphate synthase

Cre12.g494900 protein phosphatase X 2

Cre12.g508150 chromatin-remodeling protein 11

Cre12.g514200 Glucose-methanol-choline (GMC) oxidoreductase family protein

Cre12.g560900 FAD/NAD(P)-binding oxidoreductase family protein

Cre13.g564250 eukaryotic translation initiation factor 3A

Cre13.g568650 Ribosomal protein S3Ae

Cre13.g572700 FIZZY-related 3

Cre13.g573351 Ribosomal protein S5 domain 2-like superfamily protein

Cre13.g582201 nudix hydrolase homolog 24

Cre13.g584400

Cre13.g587050 eukaryotic release factor 1-3

Cre13.g606050 Tyrosyl-tRNA synthetase, class Ib, bacterial/mitochondrial

Cre14.g612700 Galactose oxidase/kelch repeat superfamily protein

Cre14.g625400 regulatory particle triple-A 1A

Cre16.g666150

Cre16.g672385 histidinol phosphate aminotransferase 1

Cre16.g676197 26S proteasome regulatory subunit S2 1B

Cre17.g697450 RNA polymerase I-associated factor PAF67

Cre17.g703900 Transducin/WD40 repeat-like superfamily protein

Cre17.g734200 Pyridoxal phosphate (PLP)-dependent transferases superfamily

protein

Cre17.g734450 Ribosomal protein L19 family protein

Supplementary Table 12: Gene ontology (GO) enrichment analysis of gene identified by Potion as positively selected (p<0.05). Gene ontology terms were derived from Chlamydomonas reinhardtii v5.5 annotations.

Enriched GO terms P-vale glycine-tRNA ligase activity 0,00089293 response to bacterium 0,0013308 defense response to bacterium 0,0029941 embryo sac development 0,0035238 carboxylic acid metabolic process 0,0035564 organic acid metabolic process 0,0035564 oxoacid metabolic process 0,0035564 ligase activity 0,0041871 cellular ketone metabolic process 0,0044649 response to biotic stimulus 0,0049168 glycoside biosynthetic process 0,0049527 proteasomal protein catabolic process 0,0049527 gametophyte development 0,0050684 acetyl-CoA carboxylase activity 0,0051529 response to cadmium ion 0,0078005 multicellular organismal development 0,010828 ligase activity, forming carbon-carbon bonds 0,012392

CoA carboxylase activity 0,012392 response to other organism 0,014293 amino acid activation 0,016176 tRNA aminoacylation 0,016176 tRNA aminoacylation for protein translation 0,016176 porphyrin biosynthetic process 0,016176 tetrapyrrole biosynthetic process 0,01774 fatty acid omega-oxidation 0,018821 glycyl-tRNA aminoacylation 0,018821 sucrose biosynthetic process 0,018821 meristem growth 0,018821 isocitrate metabolic process 0,018821 cellular response to redox state 0,018821 cell-cell signaling 0,018821 response to redox state 0,018821 response to metal ion 0,019039 multi-organism process 0,020574 ligase activity, forming aminoacyl-tRNA and related 0,023233 compounds ligase activity, forming carbon-oxygen bonds 0,023233 aminoacyl-tRNA ligase activity 0,023233 glycoside metabolic process 0,023685 defense response 0,024101

5-methyltetrahydropteroyltriglutamate-homocysteine 0,030189

S-methyltransferase activity chlorophyllide a oxygenase activity 0,030189 6-phosphogluconolactonase activity 0,030189 sedoheptulose-bisphosphatase activity 0,030189 transketolase activity 0,030189 oxidoreductase activity, acting on single donors with 0,030189 incorporation of molecular oxygen, incorporation of one atom of oxygen (internal monooxygenases or internal mixed function oxidases)

ADP binding 0,030189 sugar binding 0,030189 ribulose-1,5-bisphosphate carboxylase/oxygenase 0,030189 activator activity isocitrate dehydrogenase (NADP+) activity 0,030189 mandelonitrile lyase activity 0,030189

5-methyltetrahydrofolate-dependent 0,030189 methyltransferase activity

5-methyltetrahydropteroyltri-L-glutamate-dependent 0,030189 methyltransferase activity enoyl-[acyl-carrier-protein] reductase (NADH) activity 0,030189 enoyl-[acyl-carrier-protein] reductase activity 0,030189 nitrate reductase activity 0,030189 ketol-acid reductoisomerase activity 0,030189 carbamoyl-phosphate synthase (glutamine- 0,030189 hydrolyzing) activity methionine synthase activity 0,030189 porphobilinogen synthase activity 0,030189 glutamine-fructose-6-phosphate transaminase 0,030189

(isomerizing) activity intramolecular transferase activity, transferring 0,030189 hydroxy groups fatty acid metabolic process 0,030597 modification-dependent macromolecule catabolic 0,032164 process modification-dependent protein catabolic process 0,032164 ubiquitin-dependent protein catabolic process 0,032164 monocarboxylic acid metabolic process 0,032197 proteolysis involved in cellular protein catabolic 0,033778 process proteasome regulatory particle, base subcomplex 0,034436 porphyrin metabolic process 0,035517 alanyl-tRNA aminoacylation 0,037295 root cap development 0,037295 proteasome core complex assembly 0,037295 root morphogenesis 0,037295 maintenance of root meristem identity 0,037295 nitric oxide biosynthetic process 0,037295 nitric oxide metabolic process 0,037295 fatty acid biosynthetic process 0,037902 tetrapyrrole metabolic process 0,037902 response to inorganic substance 0,039546 carbohydrate biosynthetic process 0,040367 cellular nitrogen compound biosynthetic process 0,041151 chlorophyll biosynthetic process 0,04378

Supplementary Table 13: Highly conserved elements identified by

PhastCONS sorted by annotation feature. The features were derived from

Chlamydomonas reinhardtii v5.5 annotations.

Annotation Feature Percentage of Conserved Bases

(811 033bp)

Gene 99,27%

Intergenic 0,73%

3 prime 0,28%

5 prime 0,22%

Exon 70,89%

CDS 70,52%

Intron 28,37%

*Note: In alternative transcripts 3’ and 5’ regions may overlap with CDS regions from another alternative transcript and thus a degree of redundancy is included within these calculations. Highly conserved elements from transcribed but un-translated regions are estimated to amount to 0,37%.

Supplementary Table 14: Intergenic regions and nearby gene identified as highly conserved in the volvocine lineage. Conserved regions are referenced to the Chlamydomonas reinhardtii genome. In column 4 the distance of the conserved region to a coding locus is indicated (in base-pairs) such that a (-) symbol indicates that the locus is upstream of the coding locus and no symbol indicates that the region is downstream of the coding locus.

Chromosom Conserv Nearest Distance Description e ed Gene from

Region Gene

(bp)

Chr1 1148- Cre01.g000 -17596

1170 017

Chr1 1958- Cre01.g000 -16777

1989 017

Chr1 2004- Cre01.g000 -16616

2150 017

Chr1 2164- Cre01.g000 -16577

2189 017

Chr1 2395- Cre01.g000 -16325

2441 017

Chr1 2648- Cre01.g000 -15864

2902 017

Chr1 2933- Cre01.g000 -15620

3146 017

Chr1 3250- Cre01.g000 -15467

3299 017 Chr1 16629- Cre01.g000 -2040

16726 017

Chr1 17000- Cre01.g000 -1700

17066 017

Chr1 17222- Cre01.g000 -1485

17281 017

Chr1 17301- Cre01.g000 -1418

17348 017

Chr1 17425- Cre01.g000 -1269

17497 017

Chr1 17512- Cre01.g000 -1084

17682 017

Chr1 17699- Cre01.g000 -1004

17762 017

Chr1 17885- Cre01.g000 -679

18087 017

Chr1 18348- Cre01.g000 -385

18381 017

Chr1 18427- Cre01.g000 -272

18494 017

Chr1 18534- Cre01.g000 -127

18639 017

Chr1 3126208- Cre01.g019 298 sumo conjugation

3126237 450 enzyme 1 Chr1 3126243- Cre01.g019 333

3126274 450

Chr1 3266665- Cre01.g020 1498

3266702 650

Chr1 4858662- Cre01.g033 -6758 maternal effect embryo

4858762 700 arrest 18

Chr1 6221036- Cre01.g044 3082

6221054 450

Chr1 6711070- Cre01.g048 904 RNA helicase, putative

6711145 200

Chr10 1629927- Cre10.g429 -2906

1629966 750

Chr10 1629995- Cre10.g429 -2840

1630032 750

Chr10 1630207- Cre10.g429 -2645

1630227 750

Chr10 1631353- Cre10.g429 -1500

1631372 750

Chr11 479595- Cre11.g467 1159 Calcium-dependent

479627 595 lipid-binding (CaLB

domain) family protein

Chr11 1888175- Cre11.g468 345

1888203 100

Chr11 3521801- Cre11.g481 -952 3521854 375

Chr12 365823- Cre12.g494 -155

365849 252

Chr12 5562351- Cre12.g531 -1269 eukaryotic translation

5562410 550 initiation factor 2 beta

subunit

Chr12 5562763- Cre12.g531 -891

5562788 550

Chr12 6219399- Cre12.g536 211

6219432 683

Chr12 6219506- Cre12.g536 81

6219562 683

Chr12 9716381- Cre12.g557 1832

9716416 503

Chr13 2602283- Cre13.g581 2164

2602367 200

Chr13 3005242- Cre13.g584 -1120

3005263 250

Chr13 3005398- Cre13.g584 -1276

3005435 250

Chr13 3005560- Cre13.g584 -1438

3005628 250

Chr13 3005740- Cre13.g584 -1618

3005766 250 Chr13 3005806- Cre13.g584 -1684

3005867 250

Chr13 3005944- Cre13.g584 -1822

3005972 250

Chr14 1881883- Cre14.g620 -4860

1881971 702

Chr14 1882292- Cre14.g620 -5269

1882320 702

Chr14 1882933- Cre14.g620 -5910

1882952 702

Chr14 1883022- Cre14.g620 -5999

1883045 702

Chr14 4124838- Cre14.g634 -611

4124905 193

Chr15 1754425- Cre15.g643 1361

1754438 515

Chr16 1072640- Cre16.g649 -1196

1072668 500

Chr16 1074963- Cre16.g649 935

1075035 550

Chr16 2491796- Cre16.g660 -449

2491852 850

Chr16 2492003- Cre16.g660 -656

2492046 850 Chr16 6228369- Cre16.g675 -375

6228387 050

Chr16 6228409- Cre16.g675 -415

6228427 050

Chr16 6228581- Cre16.g675 -587

6228605 050

Chr17 4547861- Cre17.g732 2751

4547933 700

Chr17 4885028- Cre17.g734 -257 translocon outer

4885089 300 complex protein 120

Chr2 6161018- Cre02.g113 -1144 ataurora3

6161173 600

Chr2 6161454- Cre02.g113 -1580

6161502 600

Chr2 6162350- Cre02.g113 1752

6162368 626

Chr2 9199259- Cre02.g143 255

9199274 647

Chr4 1819109- Cre04.g211 1087 Chlorophyll A-B binding

1819129 850 family protein

Chr4 1880001- Cre04.g211 1252

1880082 850

Chr4 1881351- Cre04.g217 -1471

1881432 900 Chr4 2179600- Cre04.g219 1117

2179636 800

Chr4 2180593- Cre04.g219 1591

2180643 851

Chr4 2270007- Cre04.g220 -941 K+ efflux antiporter 2

2270043 200

Chr4 2824610- Cre04.g224 1114 Actin-like ATPase

2824693 150 superfamily protein

Chr5 685313- Cre05.g230 1440 SNF2 domain-

685348 800 containing protein /

helicase domain-

containing protein / F-

box family protein

Chr5 714533- Cre05.g230 264 Regulator of

714606 600 chromosome

condensation (RCC1)

family protein

Chr5 715213- Cre05.g230 944

715284 600

Chr5 3482251- Cre05.g243 848

3482286 358

Chr5 3482295- Cre05.g243 788

3482346 358

Chr6 3272450- Cre06.g276 19 histone H2A 10 3272470 950

Chr6 6607837- Cre06.g276 243

6607911 950

Chr7 2239598- Cre07.g327 186

2239705 317

Chr7 5851183- Cre07.g353 285

5851255 600

Chr7 6205920- Cre07.g356 508

6205979 450

Chr8 533541- Cre08.g359 -30

533581 166

Chr8 3473462- Cre08.g374 -3949

3473558 950

Chr8 3761889- Cre08.g377 606 chromatin remodeling

3761931 200 factor CHD3 (PICKLE)

Chr8 4810011- Cre08.g384 622 P-loop containing

4810056 355 nucleoside triphosphate

hydrolases superfamily

protein

Chr8 4810082- Cre08.g384 577

4810101 355

Chr8 5030323- Cre08.g386 -6689

5030388 100

Chr8 5030444- Cre08.g386 -6810 5030656 100

Chr8 5030729- Cre08.g386 -7095

5030752 100

Chr8 5030778- Cre08.g386 -7144

5030875 100

Chr8 5030934- Cre08.g386 -7300

5031004 100

Chr9 584478- Cre09.g403 944

584579 700

Chr9 2191673- Cre09.g393 3

2191696 100

Chr9 2344681- Cre09.g392 89

2344706 105

Chr9 2573667- Cre09.g390 867 Protein phosphatase 2C

2573687 678 family protein

Chr9 3997546- Cre09.g392 -2250

3997661 840

Chr9 3997748- Cre09.g392 -2112

3997799 840

Chr9 6120006- Cre09.g405 535 Pseudouridine synthase

6120075 150 family protein

Chr9 6429892- Cre09.g407 -1051 Ribosomal protein S5

6429918 100 family protein

Chr9 6475030- Cre09.g407 2135 alpha/beta-Hydrolases 6475058 300 superfamily protein

Chr9 6817306- Cre09.g409 -1469

6817396 728

Supplementary Table 15: Genes evolving more rapidly in the multicellular volvocines than in Chlamydomonas reinhardtii but not necessarily faster than the neutral rate. Genes are sorted by putative functions into related to signalling, ubiquitination (the ubiquitin proteasomal pathway), membrane trafficking, cell cycle regulators, chaperone activity, integral components of the cytoskeleton, microtubule interacting kinesins and genes coding for DNA interacting proteins

Gene Name/symbol/Description Chlamydomonas reinhardtii

identifier

Signalling

Calcium-Ion Dependant or related:

Calmodulin 4 (CAM4) Cre06.g292150

Centrin 2 (CEN2) Cre01.g055428

Kinesin-like calmodulin-binding Cre06.g278125

(zwickel)

Membrane Trafficking:

RAB GTPase homolog B1C member Cre09.g390763 of RAB gene family (RABB1C)

Endomembrane protein 70 protein Cre01.g024350 family Major facilitator superfamily protein Cre16.g688850

Alpha-adaptin (alpha-ADR) Cre12.g488850

Sorting nexin 2B (SNX2b) Cre07.g326450

Importin alpha isoform 1 (IMP) Cre04.g215850

Importin alpha isoform 9 (IMPA-9) Cre03.g161100

Transporter associated with antigen Cre02.g083354 processing protein 1 (TAP1)

Exostosin family protein Cre02.g116600

Serine/Threonine Kinases:

Leucine-rich repeat receptor-like Cre06.g269200 protein kinase family protein (FLS2)

Protein kinase family proteins: Cre06.g281100

Protein kinase superfamily protein Cre13.g568050

(OST1)

Protein kinase superfamily protein Cre16.g673821

PAS-domain containing protein Cre13.g563300 tyrosine kinase family protein

Serine/threonine protein kinase Rio1 Cre13.g580900

Cytoskeleton/Microtubule/Microtubule motor activity

Myosin 1 (MYO1) Cre09.g416250

Tubulin beta chain 2 (TUB2) Cre12.g542250

Tubulin beta chain 2 (TUB2) Cre13.g549550

Spc97 / Spc98 family of spindle pole Cre12.g525500 body (SBP) P-loop containing nucleoside Cre10.g427700 triphosphate hydrolases superfamily protein

P-loop containing nucleoside Cre13.g568450 triphosphate hydrolases superfamily protein

Cell Cycle

Retinoblastoma-related 1 (RB/MAT3) Cre06.g255450

Regulator of chromosome Cre16.g691050 condensation family with FYVE zinc finger domain (RCC1)

Cell differentiation Rcd1-like protein Cre12.g540400

Cyclin-dependent kinase B1 (CDKB1) Cre08.g372550

Ataurora1 (AUR1) Cre17.g719450

Transducin/WD40 repeat-like Cre06.g255400 superfamily protein

Transducin/WD40 repeat-like Cre12.g522450 superfamily protein (ubiquitin-related)

Transducin family protein / WD-40 Cre17.g703900 repeat family protein (CUL4 RING ubiquitin ligase complex)

Ubiquitination

Ubiquitin activating enzyme 2 (UBA2) Cre09.g386400

Ubiquitin-protein ligase 1 (UPL1) Cre10.g433900 Chaperonins

Heat shock protein 70A (HSP70A) Cre08.g372100

Heat shock protein 70 family member Cre02.g080700

(BIP2)

DNA/RNA interacting

Topoisomerase II (TOPII) Cre01.g009250

DNA-directed RNA polymerase family Cre02.g075150 protein (NRPB2)

RNA polymerase II large subunit Cre16.g680900

(NRPB1)

MEI2-like protein 5 RNA binding Cre07.g338150 protein (AML5)

DNA/RNA-binding protein Kin17 Cre16.g654350

Squamosa promoter binding protein- Cre09.g390023 like 1 (SPL1)

LA RNA-binding protein Cre10.g441200

Nuclear RNA polymerase C2 (NRPC2) Cre12.g542800

Decapping 2 (DCP2) Cre07.g344850

Myb domain protein 3r-5 (MYB3R-5) Cre01.g034350

Supplementary table 16: Significantly enriched annotations associated with accelerating genes. Included are gene ontology terms, MapMAn annotations and KEGG pathways.

Annotation Category P-value /Level

Gene Ontology terms deoxyribonucleotide metabolic process Biological 0.0035976

process

(BP) response to high light intensity BP 0.014487 nucleobase, nucleoside and nucleotide BP 0.01457 metabolic process nucleoside phosphate metabolic process BP 0.020628 nucleotide metabolic process BP 0.020628 acetate metabolic process BP 0.025094 cellular monovalent inorganic cation BP 0.025094 homeostasis deoxyribonucleoside triphosphate BP 0.025094 biosynthetic process deoxyribonucleoside triphosphate metabolic BP 0.025094 process

ER-associated protein catabolic process BP 0.025094 generative cell differentiation BP 0.025094

L-phenylalanine biosynthetic process BP 0.025094 maltose biosynthetic process BP 0.025094 photosynthetic electron transport in BP 0.025094 cytochrome b6/f regulation of cellular pH BP 0.025094 regulation of intracellular pH BP 0.025094 response to hydroperoxide BP 0.025094 response to lipid hydroperoxide BP 0.025094 response to non-ionic osmotic stress BP 0.025094

S-adenosylmethionine biosynthetic process BP 0.025094

S-adenosylmethionine metabolic process BP 0.025094 response to light intensity BP 0.034549 response to temperature stimulus BP 0.041941 regulation of cell cycle BP 0.048085 aromatic amino acid family biosynthetic BP 0.049569 process, prephenate pathway actin filament-based movement BP 0.049569 cytoskeleton-dependent intracellular BP 0.049569 transport deoxyribonucleoside diphosphate metabolic BP 0.049569 process deoxyribonucleotide biosynthetic process BP 0.049569

L-phenylalanine metabolic process BP 0.049569 monovalent inorganic cation homeostasis BP 0.049569 nucleoside diphosphate metabolic process BP 0.049569 nucleotide transport BP 0.049569 phosphoinositide-mediated signaling BP 0.049569 purine nucleotide transport BP 0.049569 regulation of pH BP 0.049569 ribonucleoside-diphosphate reductase Molecular 0.0016465 activity function

(MF) oxidoreductase activity, acting on CH or CH2 MF 0.0093643 groups, disulfide as acceptor acid-thiol ligase activity MF 0.0093643

CoA-ligase activity MF 0.0093643 ligase activity, forming carbon-sulfur bonds MF 0.015197 oxidoreductase activity, acting on CH or CH2 MF 0.015197 groups motor activity MF 0.020867 copper ion binding MF 0.028122

D-arabinono-1,4-lactone oxidase activity MF 0.040881 electron transporter, transferring electrons MF 0.040881 from cytochrome b6/f complex of photosystem II activity

GDP-mannose 3,5-epimerase activity MF 0.040881 oxidoreductase activity, acting on the CH- MF 0.040881

OH group of donors, oxygen as acceptor acetate-CoA ligase activity MF 0.040881 acireductone synthase activity MF 0.040881 aconitate hydratase activity MF 0.040881 acyl-[acyl-carrier-protein] hydrolase activity MF 0.040881 anion:cation symporter activity MF 0.040881 arogenate dehydratase activity MF 0.040881 beta-amylase activity MF 0.040881 cyclin binding MF 0.040881

GDP binding MF 0.040881 homoserine dehydrogenase activity MF 0.040881 methionine adenosyltransferase activity MF 0.040881 prephenate dehydratase activity MF 0.040881 sodium:dicarboxylate symporter activity MF 0.040881 ubiquitin activating enzyme activity MF 0.040881

MapMan Term Level cell 1 0.0034654 nucleotide metabolism 1 0.023417 transport 1 0.045198

RNA.transcription 2 0.01709 cell.organisation 2 0.022647 cell.division 2 0.026768 transport.metabolite transporters at the 2 0.033262 mitochondrial membrane nucleotide metabolism.deoxynucleotide 3 0.0002380 metabolism.ribonucleoside-diphosphate 8 reductase protein.targeting.nucleus 3 0.0024086 lipid metabolism.FA synthesis and FA 3 0.015743 elongation.ACP thioesterase amino acid metabolism.synthesis.aspartate 3 0.02801 family nucleotide metabolism.phosphotransfer and 3 0.031248 pyrophosphatases.guanylate kinase nucleotide metabolism.synthesis.PRS-PP 3 0.031248

PS.lightreaction.cytochrome b6/f 3 0.046518 amino acid metabolism.synthesis.aromatic 4 0.011686 aa.phenylalanine redox.ascorbate and 4 0.011686 glutathione.ascorbate.GME major CHO 4 0.023255 metabolism.degradation.starch.starch cleavage amino acid metabolism.synthesis.aspartate 4 0.046045 family.misc protein.degradation.ubiquitin.E1 4 0.046045

Supplementary table 17: CDS sequences accelerating in the multicellular volvocines relative to Chlamydomonas reinhardtii.

Gene Description Chlamydomonas ID

Endomembrane protein 70 protein family Cre01.g024350

ARM repeat superfamily protein Cre01.g034000

Cre01.g053450

DNA-directed RNA polymerase family protein Cre02.g075150 Cre02.g098300

Phosphatidylinositol 3- and 4-kinase family protein Cre02.g100300

Cre03.g155300

Cre03.g167550

Cre04.g217700

Cre04.g217911

Cre06.g250300

Cre06.g278212

Cre07.g324200 magnesium-chelatase subunit chlH, chloroplast, putative / Cre07.g325500

Mg-protoporphyrin IX chelatase, putative (CHLH)

Cre07.g330750

Cre08.g358150 nucleotide transporter 1 Cre08.g358526 ubiquitin activating enzyme 2 Cre09.g386400

Cre09.g391050 ribosomal protein S1 Cre09.g394750 vacuolar proton ATPase A2 Cre09.g402500 ubiquitin-protein ligase 1 Cre10.g433900

Root hair defective 3 GTP-binding protein (RHD3) Cre10.g439600

Cre11.g481750 ribonucleotide reductase 1 Cre12.g492950

Cre12.g512000

Ribosomal protein S5/Elongation factor G/III/V family Cre12.g516200 protein

Cre12.g520450

Spc97 / Spc98 family of spindle pole body (SBP) Cre12.g525500 component tubulin beta chain 1 Cre12.g542250 nuclear RNA polymerase C2 Cre12.g542800

SNF2 domain-containing protein / helicase domain- Cre14.g618860 containing protein transcription regulators Cre14.g632950 two-pore channel 1 Cre16.g665050

Cre16.g691753

Transducin/WD40 repeat-like superfamily protein Cre17.g703900

Histone superfamily protein Cre17.g711800

Cre17.g715300

Supplementary Table 18. Primers used for amplifications and sequencing of cyclin D genes of Tetrabaena socialis (TsCYCD1.3, TsCYCD2, TsCYCD3 and

TsCYCD4).

Primer Designation Sequences (5’-3’)

TsCYCD1.3_F1 CATGTCCATTGCGGTCAAGTA

TsCYCD1.3_R2 AGCTGCTGCAGGCTGAGG

TsCYCD1.3_3'RACE_F1 CGACGTCCATCCTAGACCGCTTCAC

TsCYCD1.3_5'RACE_R1 ACCTCCTCGTACTTGACCGCAATGG TsCYCD2_F1 GGATGGTGGAGGTGGTGACT

TsCYCD2R2 TGCTGTAGCCGTACGATAGGAA

TsCYCD2_3'RACE_F1 GTTGACTCGGACAACAACCCGCTCTACCTG

TsCYCD2_3'RACE_F2 AAGGGTCTCGTGTATGGTACAGCCCGTGAG

TsCYCD2_5'RACE_R1 GTACATGAGCGACAGCTCCGCCAGGAAGTT

TsCYCD2_5'RACE_R2 CAGTTCCAGGCTCGACTCATCCTCATCG

TsCYCD3_F1 CGTTCCTCGATCGCTTCCTA

TsCYCD3_R2 GAGTCCATGAGGGCAAGCTC

TsCYCD3_3'RACE_F1 GTCGCTTCTGCAGCTCCTGGCTCTAACCTG

TsCYCD3_3'RACE_F2 GACTCCGAGCTCACAGGCGTCAGCTACA

TsCYCD3_5'RACE_R1 TGCAGGAACATGGAGAGGAAGGTGTAGATG

G

TsCYCD3_IC_F2 CGTCGCTTCTGCAGCTCCTGGCTCT

TsCYCD3_IC_R3 CGATGCGGTGCAGGAACATGGAGAG

TsCYCD4_F1 AGACTCTCTTCGCCGCCACTT

TsCYCD4_R3 AAGCGCTGCAGGAAGGTGAG

Supplementary Table 19. Protein IDs, LXCXE motifs, and abbreviations of volvocine cyclin proteins used in this study.

Protein Species Name Protein IDs LXCXE Abbreviation

Name Motif

CYCD1.1 Chlamydomonas Cre11.g467772a LICTE CrCYCD1

reinhardtii

Tetrabaena scaffold_357b LQCLE TsCYCD1.1 socialis

Gonium GPECTOR_47g299a N. D GpCYCD1.1

pectorale

Volvox carteri Vocar20010063ma LTCTE VcCYCD1.1

CYCD1.2 T. socialis scaffold_357b LQCLE TsCYCD1.2

G. pectorale GPECTOR_47g300a LLCDE GpCYCD1.2

V. carteri Vocar20010067ma N. D VcCYCD1.2

CYCD1.3 T. socialis scaffold_221b N. D TsCYCD1.3

G. pectorale GPECTOR_47g301a N. D GpCYCD1.3

V. carteri Vocar20010127ma N. D VcCYCD1.3

CYCD1.4 G. pectorale GPECTOR_100g16a LLCTE GpCYCD1.4

V. carteri Vocar20013188ma LLCTE VcCYCD1.4

CYCD2 C. reinhardtii Cre06.g289750a LQCDE CrCYCD2

T. socialis scaffold_137b LACDE TsCYCD2

G. pectorale GPECTOR_41g668a LECEE GpCYCD2

V. carteri Vocar20013437ma LICEE VcCYCD2

CYCD3 C. reinhardtii Cre06.g298750a LFCGE CrCYCD3

T. socialis scaffold_137b LACDE TsCYCD3

G. pectorale GPECTOR_44g48a LECED GpCYCD3

V. carteri Vocar20007422ma LHCED VcCYCD3

CYCD4 C. reinhardtii Cre06.g259500a LDCTE CrCYCD4

T. socialis scaffold357b LGCTE TsCYCD4

V. carteri Vocar20007145ma LECSE VcCYCD4

CYCD5 G. pectorale GPECTOR_11g188a N. D GpCYCD5 CYCA1 C. reinhardtii Cre03.g207900a N. D CrCYCA1

V. carteri Vocar20005704ma N. D VcCYCA1

CYCB1 C. reinhardtii Cre08.g370401a N. D CrCYCB1

V. carteri Vocar20013188ma N. D VcCYCB1

CYCAB1 C. reinhardtii Cre10.g466200a N. D CrCYCAB1

V. carteri Vocar20015027ma N. D VcCYCAB1

N. D.: Not detected. aBased on Hanschen et al. (2016). bDetermined in this study.

Supplementary Table 20: The length of the linker region between the retinoblastoma A and B domain in Chlorophyte orthologues of MAT3/RB.

Species Length of Linker Region

Chlamydomonas reinhardtii 130

Yamagishiella unicocca 106

Tetrabaena socialis 76

Gonium pectorale 47

Eudorina spp 93

Pleodorina starrii 96

Volvox carteri f. nagariensis male 83

Volvox carteri f. nagariensis female 47

Supplementary Table 21: Genes identified in this study related to the ubiquitin proteasomal pathway

Chlamydomonas Description Role in the Comparative reinhardtii identifier ubiquitin Analysis

(where appropriate). proteasomal

pathway

Cre09.g386400 Ubiquitin Ubiquitin PhyloP

activating activation E1-type Whole Gene/

enzyme 2 Potion

Cre10.g430050 Ubiquitin- Ubiquitin Potion

conjugating conjugation E2-

enzyme 22 type

Cre10.g433900 Ubiquitin-protein Ubiquitin ligation PhyloP

ligase 1 to target protein Whole Gene

E3-type & CDS

Cre03.g150850 RING/U-box RING protein Family

superfamily similar to ubiquitin expanded at

protein ligase E3. the ORIMC

Related to RPM1- Ubiquitin ligation Family

Cre05.g231050 interacting to target protein gained at

protein 2 E3-type ORIMC

Cre12.g522450 WD40 repeat Similar to Cullin- PhyloP

transducin Ring ubiquitin Whole Gene

ligase complex transducin.

Cre06.g255400 WD40 repeat Similar to Cullin- PhyloP

transducin Ring ubiquitin Whole Gene

ligase complex

transducin.

Cre16.g672350* breast cancer Ubiquitin ligation Family

susceptibility1 E3-type gained at

ORIMC

Mib-herc2 Common in Domain

domain metazoan E3 expanded in

ligases. multicelled

taxa

Cre08.g372100 Heat shock Interplay with PhyloP

protein 70 (Hsp UPP through Whole Gene

70) family regulating protein

protein entry into the UPP

Cre02.g080700 heat shock Interplay with PhyloP

protein 70 UPP through Whole Gene

(Hsp70) family regulating protein & CDS

protein entry into the UPP

Similar to FTSH protease Triple-A ATPase Family

Cre01.g019850 10 type function gained at

ORIMC Cre10.g439150 Regulatory Triple-A ATPase Potion

particle triple-A of proteasome

ATPase 5A

Cre04.g216600 Regulatory Triple-A ATPase Potion

particle triple-A of proteasome

ATPase 6A

Cre07.g329700 Regulatory Triple-A ATPase Potion

particle AAA- of proteasome

ATPase 2A

Cre16.g676197 26S Proteasome Component of Potion

regulatory Proteasome

subunit S2 1B

*An orthologue in found in Chlamydomonas reinhardtii.

Supplementary Table 22: Optimising Gamma and Omega values based on

the phylogenetic information threshold (PIT) and highly conserved CDS

bases from chromosome 1 of Chlamydomonas reinhardtii.

Gamma 25 Gamma 20 Gamma 15 Gamma 10

Omega 12 Omega 15 Omega 30 Omega 30

PIT 11,11 12,14 13,74 14,64

CDS bases 72 224 62419 64 561 61 209

(bp)

Supplementary Table 23. Primers used for amplification and sequencing of

RB/MAT3 gene of Tetrabaena socialis (TsRB/MAT3).

Primer Designation Sequences (5’-3’)

TsMAT3_F1 GACTGCGATGAGACTGTTTCTG

TsMAT3_F3 CGAGCAAGACGAAGACGTCAAG

TsMAT3_F5 GTAATGGGGTTGCTGGCAAAAA

TsMAT3_F8 CAGTGCAGCAGCTCAGTCGT

TsMAT3_F17 AATTGGGCAGCAAGTTGGTCGG

TsMAT3_R2 GGCACAATAGACATACGCAAAA

TsMAT3_R6 GTCGAAGGCCTTGATGTGAAGC

TsMAT3_R7 TGCTCAATGGCCTCGTAAACCT

TsMAT3_3'RACE_F1 CAGCCGCAGTCACAGCAATCCATTT

TsMAT3_3'RACE_F2 AGCACAGGCATATCCCCCTTGAGAC

TsMAT3_3'RACE_F3 ACAGCCAGCAGAGTGCAGACAACGAAGAAG

TsMAT3_5'RACE_R1 TTTTGCCAGCAACCCCATTACCACCAC

TsMAT3_5'RACE_R2 TACGACGTTGGCCTCGCGAAAGAAGTC

Supplementary Table 24. Prochnik Table of Genes

Chla Chla Chlam Vol mydo mydo Goni ydom vox mona Chlamy mona um onas Goniu Volvox Volvox prot s v4 Volvox v1 JGI Tetrabaen domona s protei v5.5 m v2 Defline from ein JGI protein ID(s) a (ID)s s JGI protei n transc ID(s) ID(s) JGI nam protei defline n name ript e n name ID(s) ID(s) Protei n secre tion and mem brane traffic king Rab-related scaffol small Au9.Cr Vocar2 GTPase; RABA RABA Rab 19551 d0000 Rab- e03.g1 60379 000568 Rab-related 1 1 A 9 1.g861 related 89250 6m GTPase; .t1 GTPase TSOC_01 RabA/Rab11 2290-RA Ypt scaffol small small G- Au9.Cr Vocar2 RABB RABB V4/ 14883 d0000 TSOC_00 Rab- protein e02.g1 106534 001038 1 1 Rab 6 5.g307 6167-RA related (yptV4); 26100 4m B .t1 GTPase RabB/Rab2

scaffol small Au9.Cr Vocar2 Rab-related RABC RABC Rab 19551 d0000 Rab- e09.g3 66052 000372 GTPase; 1 1 C1 8 5.g394 related 86900 9m RabC/Rab18 .t1 TSOC_00 GTPase 6791-RA scaffol small Au9.Cr Vocar2 Rab-related RABC RABC Rab 19587 d0001 TSOC_01 Rab- e13.g5 78758 000246 GTPase; 2 2 C2 2 0.g942 1166-RA related 95450 6m RabC/Rab18 .t1 GTPase

Ypt scaffol small small G- Au9.Cr Vocar2 RABC RABC V3/ d0001 Rab- protein e11.g4 108490 001445 3 3 Rab 0.g942 related (yptV3); 82900 8m C3 .t1 TSOC_00 GTPase RabC/Rab18 4957-RA Ypt scaffol small small G- Au9.Cr Vocar2 RABD RABD V1/ d0007 Rab- protein 60490 e12.g5 104295 000850 1 1 Rab 5.g726 related (yptV1); 60150 7m D .t1 TSOC_00 GTPase RabD/Rab1 0131-RA Ypt scaffol small small G- Au9.Cr Vocar2 RABE RABE V2/ 19552 d0006 Rab- protein e15.g6 83299 000605 1 1 Rab 0 2.g924 related (yptV2); 41800 6m E .t1 TSOC_00 GTPase RabE/Rab8 9328-RA scaffol small Au9.Cr Vocar2 RABF RABF Rab d0000 Rab- 81259 e12.g5 108963 001435 RabF/Rab5 1 1 F 8.g301 TSOC_01 related 17400 0m .t1 0627-RA GTPase RABG1, small Rab- Ypt scaffol small G- Vocar2 related RABG RABG V5/ d0000 protein 79428 000791 GTPase; 1 1 Rab 4.g531 (yptV5); 3m small G .t1 RabG/Rab7 Rab- TSOC_00 related 6651-RA GTPase scaffol small Au9.Cr Vocar2 Rab-related RABH RABH Rab 19552 d0000 Rab- e01.g0 73845 000956 GTPase; 1 1 H 2 3.g233 TSOC_00 related 47550 3m RabH/Rab6 .t1 9040-RA GTPase scaffol small Rab-related RABL Au9.Cr Vocar2 RABL Rabl 19244 d0020 Rab- GTPase; 1/RA e17.g7 102687 000137 1 2A 1 9.g407 related Rab-related BL2A 22350 2m .t1 TSOC_01 GTPase GTPase 2595-RA RAB23, small Rab- scaffol Vocar2 related Rab-related Rab2 Rab2 Rab d0001 115722 000256 GTPase; GTPase; 3 3 23 0.g849 6m small Rab23 .t1 Rab- TSOC_00 related 9161-RA GTPase scaffol small Au9.Cr Vocar2 FAP1 FAP1 Fap 12919 d0000 Rab- Rab-related e01.g0 80291 000826 56 56 156 3 3.g206 TSOC_00 related GTPase 47950 2m .t1 2912-RA GTPase scaffol small Au9.Cr Vocar2 Rab-related RAB2 RAB2 Rab 19552 d0000 Rab- e17.g7 62839 001521 GTPase; 8 8 28 3 5.g394 TSOC_00 related 15050 7m Rab28-like .t1 8030-RA GTPase Qa- scaffol Au9.Cr Vocar2 SNARE Qa-SNARE, Syp 19596 d0005 SYP1 SYP1 e13.g5 89085 000227 Sso1/Sy Sso1/Syntaxi 1 9 9.g666 88550 0m ntaxin1, n1 (PM)-type .t1 Manual Build PM-type Qa- scaffol SNARE Au9.Cr Vocar2 Qa-SNARE, Syp 19541 d0002 protein, SYP3 SYP3 e16.g6 81752 000226 Sed5/Syntaxi 3 2 3.g53.t Sed5/Sy 92050 7m n5-family 1 TSOC_00 ntaxin5- 9357-RA family Qa- scaffol SNARE Au9.Cr Vocar2 Qa-SNARE, Syp 19540 d0000 protein, SYP4 SYP4 e12.g5 103767 000102 Tlg2/Syntaxi 4 1 2.g148 Tlg2/Syn 07450 3m n16-family 3.t1 TSOC_00 taxin16- 3932-RA family Qc- scaffol SNARE Au9.Cr Vocar2 Qc-SNARE, Syp 18901 d0022 protein, SYP5 SYP5 e17.g7 91993 000324 Syn8/Syntaxi 5 6 2.g484 Syn8/Sy 09350 9m n8-family .t1 TSOC_01 ntaxin8- 2559-RA family Qc- scaffol SNARE Au9.Cr Vocar2 Qc-SNARE, Syp 13055 d0002 protein, SYP6 SYP6 e06.g2 55748 000934 Tlg1/Syntaxi 6 9 1.g597 Tlg1/Syn 90100 1m n 6-family .t1 TSOC_01 taxin 6- 3462-RA family Qc- SNARE, SYP7- family; Au9.Cr Qc- SYP7 19540 e02.g0 scaffol SNARE Vocar2 1; SYP7 Syp 5; 99050; d0020 protein, Qc-SNARE, 91067 000606 SYP7 1 71 19540 Au9.Cr 0.g359 SYP7- SYP7-family 0m 2 6 e02.g0 .t1 family; 98950 TSOC_00 Qc- 5813- SNARE RA/TSOC protein, _010988- SYP7- RA family Qa- SNARE scaffol Au9.Cr Vocar2 protein, Syp 19541 d0002 SYP8 SYP8 e17.g7 96826 001171 Ufe1/Sy 8 1 4.g207 11450 8m ntaxin .t1 TSOC_00 18 7239-RA family Qc- scaffol Au9.Cr Vocar2 SNARE Qc-SNARE, SYP6 SYP6 Syp 16002 d0002 e06.g2 55748 000934 SYP6- Tlg1/Syntaxi L L 6L 5 1.g597 93100 1m TSOC_01 like n 6-family .t1 3462-RA protein Qa- scaffol SNARE Au9.Cr Vocar2 Syp 40624 d0000 protein, SYP2 SYP2 e01.g0 86761 000998 2 7 7.g130 Pep12/S 10700 6m 6.t1 TSOC_00 yntaxin 9496-RA 7-family R- scaffol VAM5 VAM5 Au9.Cr Vocar2 SNARE Vam 19540 d0022 /VAM /VAM e13.g5 54976 000299 protein, p75 9 0.g470 P75 P75 96900 8m TSOC_00 VAMP71 .t1 6464-RA -family R- scaffol Au9.Cr Vocar2 SNARE SEC2 SEC2 Sec d0002 R-SNARE, 57076 e12.g5 80612 000652 protein, 2 2 22 1.g733 Sec22-family 54700 0m TSOC_01 Sec22- .t1 2066-RA family R- scaffol Au9.Cr SNARE 19546 d0000 Vocar2 YKT6 YKT6 Ykt6 e17.g7 75062 protein, 5 6.g510 001313 28150 TSOC_00 YKT6- .t1 6m 6634-RA family R- scaffol VAM1 VAM1 Au9.Cr Vocar2 SNARE Vam 12877 d0001 /VAM /VAM e04.g2 70315 001182 protein, p71 7 6.g606 P71 P71 24750 3m TSOC_00 VAMP72 .t1 7676-RA -family R- scaffol VAM2 VAM2 Au9.Cr Vocar2 SNARE Vam 19540 d0001 /VAM /VAM e04.g2 70298 001180 protein, p72 3 6.g625 P72 P72 25900 2m VAMP72 .t1 -family R- scaffol VAM3 VAM3 Au9.Cr Vocar2 SNARE Vam 19540 d0001 /VAM /VAM e04.g2 78158 001183 protein, p73 4 6.g606 P73 P73 25850 6m TSOC_00 VAMP72 .t1 7676-RA -family R- VAM4 VAM4 Au9.Cr SNARE 13618 /VAM /VAM e04.g2 protein, 8 P74 P74 24800 VAMP72 -family Qc- scaffol SNARE Au9.Cr Vocar2 Qc-SNARE, 18390 d0000 protein, BET1 BET1 Bet1 e03.g1 120875 000540 Bet1/mBET1 4 1.g916 Bet1/mB 98100 0m family .t1 TSOC_00 ET1 9557-RA family Bet3 scaffol Au9.Cr Vocar2 .1; 18461 d0004 BET3 BET3 e01.g0 76384; 76381 000770 Bet3 3 0.g592 59600 2m .2 .t1 R- scaffol SNARE Au9.Cr Vocar2 R-SNARE, SNR7 SNR7 Tml 19546 d0007 protein, e12.g5 86534 000899 Tomsyn-like /TML1 /TML1 1 8 5.g772 Tomsyn- 58250 9m family .t1 TSOC_00 like 0104-RA family Qb+c- SNARE, SNAP25 scaffol Au9.Cr Vocar2 -family; SNAP SNAP Sna 19540 d0006 e02.g1 104786 000589 Qb+c- 34 34 p34 8 2.g897 00250 1m SNARE .t1 protein, TSOC_00 SNAP25 1539-RA -family Qb- scaffol SNARE Au9.Cr Vocar2 Qb-SNARE, MEM MEM Me 19546 d0002 protein, e12.g5 104271 000639 Bos1/Membri B1 B1 mb1 2 1.g740 Bos1/Me 54200 9m n family .t1 TSOC_00 mbrin 4534-RA family scaffol Au9.Cr Vocar2 SNAP SNAP Sna 17371 d0008 gamma e08.g3 95241 000473 G1 G1 pG1 0 7.g421 TSOC_00 SNAP 85700 1m .t1 8997-RA scaffol Au9.Cr Vocar2 SNAP SNAP Sna d0059 alpha- 23974 e02.g0 81196 000411 A1 A1 pA1 7.g669 TSOC_00 SNAP 99150 7m .t1 8196-RA Protein involved in ubiquitin scaffol Au9.Cr Vocar2 - CDC4 CDC4 Cdc 13417 d0002 e06.g2 78972 000291 depende 8 8 48 1 7.g630 69950 7m nt .t1 degradat ion of TSOC_00 ER- 7546-RA bound substrat es

Au9.Cr scaffol Vocar2 VPS4 VPS4 Vps 19547 e07.g3 d0007 77223 001394 TSOC_00 5 5 45 1 33950 7.g4.t1 8m 7955-RA scaffol Au9.Cr Vocar2 VPS3 VPS3 Vps 19547 d0009 e12.g5 103690 001427 3 3 33 0 0.g527 TSOC_00 15550 6m .t1 4500-RA Actin cytos kelet on scaffol d0000 MYO1 9.g540 myosin type XI Au9.Cr Vocar2 A; Myo 18510 .t1; TSOC_01 heavy myosin MYO1 e16.g6 82838 001052 MYO1 A 4 scaffol 0099- chain, heavy chain 58650 2m B d0000 RA/TSOC class XI MyoA 9.g542 _012467- .t1 RA scaffol myosin type XI Au9.Cr Vocar2 Myo 11931 d0001 heavy myosin MYO2 MYO2 e13.g5 57247 001474 B 7 3.g634 chain, heavy chain 63800 0m .t1 TSOC_00 class XI MyoB 2407-RA myosin scaffol type VIII Au9.Cr Vocar2 heavy Myo 11376 d0005 myosin MYO3 MYO3 e09.g4 107085 000551 chain, C 0 5.g312 heavy chain 16250 6m class .t1 TSOC_00 MyoC 8396-RA VIII scaffol class XI Vocar2 Myo d0000 myosin 103833 000100 D 9.g540 heavy chain 0m .t1 TSOC_01 MyoD 2467-RA scaffol myosin putative Au9.Cr Vocar2 Myo 37786 d0000 heavy class XI MYO5 MYO5 e03.g1 90836 000540 E 3 1.g921 chain, myosin 97800 9m .t1 TSOC_01 class XI MyoE 5383-RA fgenes h4_pg. Myo type XI 95387 C_scaf F myosin MyoF fold_43 000114 scaffol Au9.Cr Vocar2 Act d0009 IDA5 IDA5 24392 e13.g6 109972 001110 actin A 7.g773 TSOC_01 03700 1m .t1 2574-RA scaffol d0000 1.g510 NAP1 Nap .t1; Vocar2 novel actin- NAP1 ; 127374 1 scaffol 000810 like protein NAP2 d0009 4m 7.g773 TSOC_01 .t1 2574-RA scaffol Aug.Cr Vocar2 Actin- Arp d0000 actin-related ARP2 ARP2 24114 e16.g6 107669 000157 related 2 4.g946 TSOC_01 protein Arp2 87500 6m protein .t1 3115-RA scaffol Au9.Cr Vocar2 Actin- Arp 11874 d0009 actin-related ARP3 ARP3 e16.g6 101864 000022 related 3 5 7.g773 TSOC_01 protein Arp3 76050 6m protein .t1 2574-RA Au9.Cr scaffol Vocar2 Actin- Arp 11237 actin-related ARP5 ARP5 e06.g2 d0000 69220 000348 TSOC_00 related 4 8 protein Arp4 49300 5.g126 2m 8038-RA protein .t1

scaffol Vocar2 Actin- Arp d0000 actin-related 121135 001101 related 6 2.g104 protein Arp6 5m protein 3.t1 scaffol Au9.Cr Vocar2 Actin- Arp 10306 d0000 actin-related ARP7 ARP7 e12.g5 94505 000641 related 7 7 7.g119 TSOC_00 protein Arp7 45000 6m protein 0.t1 1376-RA ARP2/3 scaffol Au9.Cr Vocar2 (actin- actin-related ARPC ARPC Arp 17720 d0000 e16.g6 58274 000942 related protein 1 1 C1 3 3.g208 48100 6m TSOC_00 protein) ArpC1 .t1 8086-RA complex scaffol Vocar2 Actin- actin-related ARPC ARPC Arp 51833 d0000 119936 000397 related protein 2 2 C2 2 9.g713 TSOC_00 2m protein ArpC2 .t1 6552-RA scaffol Vocar2 Actin- actin-related ARPC ARPC Arp 39956 d0001 65632 000456 related protein 3 3 C3 3 8.g105 TSOC_00 3m protein ArpC3 .t1 3210-RA scaffol Au9.Cr Vocar2 Actin- actin-related ARPC ARPC Arp 40084 d0001 e06.g2 72447 000245 related protein 4 4 C4 8 0.g757 TSOC_01 60800 8m protein ArpC4 .t1 5482-RA

putative scaffol Au9.Cr Vocar2 protein For 14402 d0000 FOR1 FOR1 e03.g1 107471 000684 formin containing a A 7 1.g640 66700 4m FH2 domain; .t1 TSOC_00 formin 0371-RA scaffol actin- actin- Au9.Cr Vocar2 Adf 37887 d0000 depolym depolymerizi e07.g3 68164 000153 A 7 4.g845 erizing ng factor 39050 7m .t1 TSOC_00 factor AdfA 7706-RA scaffol Au9.Cr Vocar2 actin-binding gelsoli gelsoli gels 19092 d0005 e12.g5 108308 000867 gelsolin protein n n olin 5 6.g375 TSOC_01 55600 5m Gelsolin .t1 5472-RA scaffol Au9.Cr Vocar2 18188 d0000 PRO1 PRO1 PrfA e10.g4 106260 001158 profilin profilin 7 1.g67.t TSOC_00 27250 8m 1 4650-RA actin actin scaffol monome Au9.Cr Vocar2 monomer- Cap 40205 d0027 r-binding CAP1 CAP1 e06.g3 79641 000221 binding A 6 5.g721 protein 04100 0m protein .t1 Cap/Srv TSOC_00 Cap/Srv2p 1558-RA 2p scaffol TSOC_01 f-actin- Au9.Cr Vocar2 20621 d0001 3691- binding, villin-like VIL1 VIL1 VilA e12.g4 97468 001190 5 1.g292 RA/TSOC Viilin-like protein 93700 3m .t1 _009069- protein RA Micro tubul e cytos kelet on scaffol TSOC_00 Au9.Cr d0010 Vocar2 3975- Tub 12852 e03.g1 9.g204 000340 RA/TSOC alpha TUA1; TUA1; A1; 3; 90950; .t1; 1m; tubulin alpha tubulin 77526; 109786 _003980- TUA2 TUA2 Tub 18602 Au9.Cr scaffol Vocar2 RA/TSOC 1; alpha (tubA2) A2 3 e04.g2 d0017 000737 _003982- tubulin 2 16850 5.g216 7m RA/TSOC .t1 _003984- RA scaffol Au9.Cr d0025 Vocar2 Tub 12987 e12.g5 4.g636 000658 beta TUB1; TUB1; B1; 6; 42250; .t1; 0m; TSOC_01 tubulin beta tubulin 75910; 77081 TUB2 TUB2 Tub 12986 Au9.Cr scaffol Vocar2 0371- 1; beta (tubB2) B2 8 e12.g5 d0000 000654 RA/TSOC tubulin 2 49550 7.g111 6m _011286- 6.t1 RA scaffol Gamma Au9.Cr Vocar2 Tub 18893 d0002 tubulin; Gamma TUG1 TUG1 e06.g2 80553 000922 G 3 2.g770 TSOC_01 was tubulin 99300 7m .t1 0899-RA TUG scaffol Au9.Cr Vocar2 Tub 13608 d0000 delta UNI3 UNI3 e03.g1 60395 000566 Delta tubulin D 2 1.g844 TSOC_00 tubulin 87350 2m .t1 8734-RA scaffol Eta Au9.Cr Vocar2 Tub 15437 d0003 tubulin; TUH1 TUH1 e12.g5 116596 001430 Eta tubulin H 6 5.g836 TSOC_00 was 13450 2m .t1 8569-RA TUH scaffol Epsilon Au9.Cr Vocar2 Tub 18819 d0000 tubulin; Epsilon TUE1 TUE1 e03.g1 56250 000789 E 5 1.g530 TSOC_00 was tubulin 72650 4m .t1 7487-RA TUE scafffol d0030 Vocar2 cent 5.g832 000616 CEN1 Au9.Cr VFL2/ rin; 15955 .t1; 6m; putative ; e11.g4 109845; 66997 CEN1 Cnr 4 scaffol Vocar2 centrin CEN2 68450 A d0030 001186 5.g832 8m TSOC_00 .t1 7964-RA scaffol katanin katanin Au9.Cr Vocar2 Kat d0000 catalytic catalytic KAT1 KAT1 53314 e10.g4 82654 001129 A 1.g65.t subunit, subunit, 60 27600 0m 1 TSOC_00 60 kDa kDa 4648-RA katanin scaffol p60 Au9.Cr Vocar2 katanin p60 Kat d0003 catalytic VPS4 VPS4 98650 e10.g4 76999 001381 catalytic B 4.g691 subunit; 46400 4m subunit .t1 TSOC_00 was 4336-RA KAT2

microtubule scaffol Au9.Cr Vocar2 severing Kat d0000 PF15 PF15 80954 e03.g1 94919 001134 protein C 1.g457 60450 3m katanin, p80 .t1 TSOC_00 subunit 4336-RA scaffol microtubule Vocar2 Map d0074 associated 99065 000391 1 9.g916 protein 9m .t1 TSOC_00 (MAP) 1a 3008-RA Microtub scaffol Au9.Cr Vocar2 ule microtubule Mor 17514 d0004 TOG1 TOG1 e03.g1 121331 000757 associat organizing A 3 0.g529 49800 5m TSOC_00 ed protein MorA .t1 0455-RA protein scaffol microtubule- Au9.Cr Vocar2 MAP6 MAP6 map 39446 d0008 associated e14.g6 120144 001443 5 5 65 2 20.g49 protein 14050 8m .t1 TSOC_00 MAP65 1728-RA microtub scaffol microtubule Au9.Cr Vocar2 ule plus- EBP1/ EBP1/ 19420 d0005 plus-end EB1 e17.g7 107043 001003 end EB1 EB1 9 3.g116 binding 41200 6m binding .t1 TSOC_01 protein EB1 1100-RA protein scaffol CLIP- Au9.Cr Vocar2 CLIP- CLAS CLAS Clas 20620 d0001 associati e09.g4 120833 000067 associating P P p 6 2.g410 TSOC_00 ng 15700 2m protein .t1 1489-RA protein

putative scaffol cortical Au9.Cr Vocar2 Spr 41618 d0006 microtubule SPR1 SPR1 e02.g1 103536 000509 1 8 2.g875 associated 30150 4m .t1 protein TSOC_00 SPIRAL1 2968-RA Gamma scaffol gamma Au9.Cr Vocar2 tubulin Gcp 15071 d0000 tubulin GCP2 GCP2 e12.g5 105911 000386 interacti 2 2 8.g158 interacting 25500 6m ng .t1 TSOC_00 protein 9859-RA protein Gamma scaffol gamma Au9.Cr Vocar2 tubulin Gcp 15258 d0001 tubulin GCP3 GCP3 e22.g7 97867 001469 interacti 3 5 8.g57.t interacting 64400 5m ng 1 TSOC_00 protein 7776-RA protein Gamma scaffol gamma Au9.Cr Vocar2 tubulin Gcp 14632 d0002 tubulin GCP4 GCP4 e01.g0 118030 001017 interacti 4 3 5.g308 interacting 19150 5m ng .t1 TSOC_01 protein 4290-RA protein

putative putative MT scaffol MT associated Au9.Cr Vocar2 Pld 19040 d0001 associat signaling PLD1 PLD1 e13.g5 99461 001346 A 3 7.g787 ed protein 91900 9m .t1 signaling phospholipas TSOC_00 protein e D 2024-RA Cytoplas mic Cytoplasmic scaffol dynein dynein 1b Au9.Cr Vocar2 D1BLI D1b 13039 d0002 1b light light e02.g1 92778 001239 C LIC 4 7.g622 intermed intermediate 35900 6m .t1 iate chain, TSOC_00 chain, D1bLIC 6086-RA D1bLIC

TSOC_00 3946- scaffol DHC1 Au9.Cr Vocar2 RA/TSOC Cytoplasmic DHC1 DH d0004 2/DH 24009 e06.g2 64869 000353 _005462- dynein 1b B C1b 9.g493 C1B 50300 4m RA/TSOC heavy chain .t1 _009812- RA/TSOC _010820- RA scaffol Au9.Cr Vocar2 abnorma microtubule- 30682 d0006 ASP ASP Asp e06.g2 121726 000250 l spindle associated 8 1.g829 TSOC_00 81150 4m protein protein Asp .t1 3050-RA scaffol Au9.Cr Vocar2 40205 d0027 e06.g3 79641 000221 6 5.g721 TSOC_00 04100 0m .t1 1558-RA scaffol Au9.Cr Tubulin tubulin 40179 d0002 e_gw1. TTL1 TtlA e06.g3 59059 tyrosine tyrosine 0 2.g769 13.31.1 Manual 00250 ligase ligase .t1 Build scaffol TTL2/ TTL2/ Au9.Cr Vocar2 Tubulin tubulin 10076 d0001 FAP2 FAP2 TtlB e17.g6 30444 001082 tyrosine tyrosine 0 4.g277 TSOC_01 67 67 99500 0m ligase ligase .t1 5101-RA scaffol Au9.Cr Vocar2 Tubulin 11925 d0004 TTL3 TTL3 TtlC e01.g0 65051 000743 tyrosine 0 0.g581 TSOC_00 59200 9m ligase .t1 1227-RA scaffol Vocar2 d0003 TTL4 TtlD 105441 001231 1.g393 TSOC_00 1m .t1 0326-RA scaffol Au9.Cr Vocar2 Tubulin 51385 d0000 TTL5 TTL5 TtlE e12.g5 107180 000634 tyrosine 2 7.g114 TSOC_00 47700 4m ligase 1.t1 0073-RA scaffol Au9.Cr Tubulin 14689 d0000 e_gw1. TTL6 TtlF e01.g0 69077 tyrosine 3 3.g176 86.15.1 TSOC_01 50451 ligase .t1 1047-RA scaffol CYG4 CYG4 Au9.Cr Vocar2 Tubulin 19039 d0001 0/TTL 0/TTL TtlG e13.g5 99465 001348 tyrosine 8 7.g783 TSOC_00 7 7 91700 6m ligase .t1 2020-RA scaffol Au9.Cr Vocar2 Tubulin 41726 d0005 TTL8 TTL8 TtlH e02.g1 108470 001438 tyrosine 7 8.g590 TSOC_00 20750 2m ligase .t1 2685-RA Basal Body

Protei ns Bardet- scaffol Au9.Cr Vocar2 Biedl bbs 18229 d0003 Bardet-Biedl BBS5 BBS5 e06.g2 78967 000291 syndrom 5 9 0.g279 syndrome 5 67550 0m TSOC_00 e 5 .t1 7731-RA protein scaffol Au9.Cr Vocar2 SF- 12799 d0005 SF- SFA SFA SFA e07.g3 84701 000499 assembli 5 4.g252 TSOC_00 assemblin 32950 8m n .t1 6732-RA TRP scaffol putative TRP Au9.Cr Vocar2 protein bbs 14011 d0013 protein for BBS8 BBS8 e16.g6 78311 000678 for 8 3 0.g573 flaggelar 66500 3m ciliary .t1 TSOC_00 function 5276-RA function Bardet- scaffol Au9.Cr Biedl 19005 d0000 BBS7 e01.g0 syndrom 4 3.g298 43750 TSOC_00 e 7 .t1 4733-RA protein Bardet scaffol Au9.Cr Vocar2 Biedl Bardet-Biedl Bbs 12994 d0000 BBS4 BBS4 e12.g5 66874 000673 syndrom syndrome 4 8 7.g112 48650 3m TSOC_01 e 4 protein 4 6.t1 5012-RA protein Bardet- scaffol Au9.Cr Vocar2 Biedl small Arf- BBS3 BBS3 d0000 arf8 24475 e16.g6 70270 001022 syndrom related B B 9.g460 64500 4m TSOC_00 e protein GTPase .t1 1462-RA 3B Bardet- Au9.Cr Biedl Bardet-Biedl Bbs 12675 BBS2 e06.g2 121664 syndrom syndrome 2 8 57250 TSOC_01 e protein protein 2 2520-RA 2 scaffol Bardet- Au9.Cr Vocar2 Bardet-Biedl Bbs 13253 d0005 Biedl BBS1 BBS1 e17.g7 84205 001255 syndrome 1 7 3.g86.t TSOC_00 syndrom 41950 4m protein 1 1 8472-RA e protein 1

Au9.Cr Vocar2 basal Ofd basal body OFD1 31640 e17.g7 90133 001063 TSOC_00 body 1 protein 03600 4m 9165-RA protein protein required protein scaffol for conserved Au9.Cr Vocar2 13054 d0006 template only in VFL3 VFL3 Vfl3 e06.g2 84510 000252 2 1.g821 d organisms 79900 2m .t1 centriole with motile TSOC_00 assembl cilia 3059-RA y scaffol Au9.Cr Vocar2 basal BLD1 BLD1 16606 d0003 basal body bldJ e10.g4 57967 000103 body 0 0 2 6.g112 Manual protein 18250 4m protein .t1 Build Bardet- scaffol Biedl d0030 syndrom 0.g813 BBS9; Vocar2 e 9; bbs 10113 .t1; Bardet-Biedl BBS9 BBS1 119731 000189 Bardet- 9 7 scaffol syndrome 9 0 2m Biedl d0030 syndrom 0.g814 TSOC_01 e 9 .t1 0849-RA protein Kines in Motor Protei ns kinesin-like 93532 protein scaffol Au9.Cr Vocar2 Kinesin-II FLA1 FLA1 18575 d0005 flaJ e17.g7 107307 000500 Motor 0 0 0 5.g287 TSOC_00 30950 7m Protein .t1 7372-RA scaffol Au9.Cr Vocar2 Kinesin- flaH/ 15076 d0000 kinesin-like FLA8 FLA8 e12.g5 103736 001418 II motor klpA 6 2.g130 protein 22550 6m subunit 6.t1 scaffol Au9.Cr Vocar2 34535 d0001 putative e09.g4 97023 000565 4 5.g280 TSOC_01 Kif3C kinesin 15450 3m .t1 0691-RA

Kif9 type scaffol Au9.Cr Vocar2 TSOC_00 Kinesin- kinesin, 18641 d0000 KLP1 KLP1 e02.g0 58470 001367 3895- like similar to C. 4 3.g471 73750 7m RA/TSOC protein reinhardtii .t1 _003893- KLP1 RA 94697 scaffol Au9.Cr Vocar2 Kif6 type d0000 e01.g0 64266 000833 kinesin-like 5.g430 TSOC_01 36800 3m protein .t1 4361-RA estExt_ Genew putative Kif9 76107 ise1.C kinesin _40003 9 HAL_e scaffol stExt_f Au9.Cr kinesin 12608 d0003 genesh IAR1 IAR1 invA e10.g4 127192 family kinesin invA 1 6.g129 5_synt. 18950 protein .t1 C_901 Manual 08 Build kinesin-like 127420 protein kinesin Au9.Cr 14971 motor kinesin-like e09.g3 127421 3 TSOC_00 family protein 86700 4599-RA protein Kinesin Au9.Cr family kinesin-like 13743 e03.g2 127422 member protein 02000 TSOC_00 heavy 1315-RA chain kinesin-like 127423 protein kinesin-like 105173 protein subfamily 99650 14A kinesin kinesin-like 95520 protein scaffol Vocar2 d0007 kinesin-like 127417 001094 6.g809 TSOC_00 protein 1m .t1 4316-RA scaffol kinesin Vocar2 d0004 related to 95391 001081 0.g569 Arabidopsis 3m .t1 TSOC_01 ATK5 0103-RA scaffol Kar3 Vocar2 d0000 member 127355 000243 9.g588 kinesin-like 4m .t1 TSOC_01 protein 1162-RA kinesin-like 127424 protein kinesin-like 93886 protein kinesin-like 127418 protein scaffol Vocar2 d0002 kinesin-like 106268 001169 8.g750 TSOC_00 protein 5m .t1 8189-RA scaffol Au9.Cr Vocar2 kinesin- similar to FAP1 FAP1 19081 d0003 e12.g5 94482 000675 like kinesin 25 25 6 4.g710 TSOC_01 46100 5m protein FAP125 .t1 1506-RA

scaffol appears to Vocar2 d0012 contain 93168 001411 2.g462 kinesin motor 5m .t1 TSOC_00 domain 3022-RA kinesin-like 96903 protein fgenes h4_pg. putative 98132 C_scaf Kif21-type fold_68 TSOC_00 kinesin 000053 0006-RA STM_f genesh 4_pg.C kinesin-like 127414 _scaffo protein ld_110 TSOC_00 00145 3249-RA scaffol TSOC_00 Au9.Cr Vocar2 40875 d0000 3800- e17.g7 31481 000049 9 8.g90.t RA/TSOC 35200 8m 1 _014900- RA 127427 scaffol Vocar2 d0000 Kif4A type 44127 001298 1.g269 kinesin 9m .t1 STM_f genesh 4_pg.C kinesin-like 127415 _scaffo protein ld_160 00212 Cell cycle Putative rhodane scaffol se RDP1 RPD1 rdp1 d0000 98123 Au9.Cr domain 4.g624 e05.g2 TSOC_00 phospha 47400 5421-RA tase Putative rhodane scaffol Vocar2 se RDP2 RDP2 rpd2 d0002 000671 Cre01. domain 1.g722 8m g0248 TSOC_00 phospha 50 4527-RA tase Putative rhodane scaffol se RDP3 RPD3 rdp3 d0000 120993 Cre07. domain 6.g640 g3525 TSOC_00 phospha 50 0732-RA tase scaffol Au9.Cr Vocar2 cyclin CDKA CDKA cdk 12728 d0004 e10.g4 127504 001508 dependent 1 1 a1 5 7.g387 TSOC_00 65900 5m kinase .t1 7286-RA

scaffol plant specific Au9.Cr Vocar2 CDKB CDKB cdk d0007 cyclin 59842 e08.g3 103386 001354 1 1 b1 9.g124 dependent 72550 5m .t1 TSOC_00 kinase 6010-RA scaffol Au9.Cr Vocar2 cyclin CDKC CDKC cdkc 14839 d0008 e08.g3 82776 000448 dependent 1 1 1 5 7.g425 TSOC_00 85850 8m kinase .t1 9000-RA scaffol Au9.Cr Vocar2 CDKD CDKD cdk 13745 d0000 e09.g3 65162 000357 1 1 d1 7 5.g340 TSOC_00 88000 5m .t1 9378-RA scaffol Au9.Cr Vocar2 cyclin CDKE CDKE cdk 12088 d0005 e04.g2 68336 000207 dependent 1 1 e1 1 4.g202 13850 4m kinase .t1 scaffol cyclin Cyclin Au9.Cr Vocar2 CDK CDK cdk 12677 d0001 depende dependent e06.g2 127266 000275 G1 G1 g1 6 0.g105 nt kinase; G1 71100 4m 8.t1 TSOC_00 kinase subfamily 4551-RA cyclin Au9.Cr cyclin CDK CDK cdk 13990 depende e17.g7 127318 dependent G2 G2 g2 8 nt 42250 kinase. kinase scaffol cyclin Au9.Cr Vocar2 cyclin CDKH CDKH cdk 15397 d0000 depende e07.g3 83876 000684 dependent 1 1 h1 0 6.g738 nt 55400 8m kinase .t1 kinase scaffol cyclin Au9.Cr Vocar2 cyclin CDKI CDKI cdki 19578 d0001 depende e12.g4 119542 001324 dependent 1 1 1/lf2 1 9.g245 TSOC_00 nt 94500 3m kinase .t1 7933-RA kinase scaffol d0000 CYCA 1.g929 Au9.Cr Vocar2 CYCA 1; cyca 14745 .t1; A-type e03.g2 127267 000570 A type cyclin. 1 CYCA 1 3 scaffol cyclin 07900 4m 2 d0090 3.g158 TSOC_00 .t1 9562-RA scaffol Au9.Cr Vocar2 CYCB CYCB cycb 20611 d0000 B-type B type e06.g2 127276 000376 1 1 1 5 8.g51.t TSOC_01 cyclin mitotic cyclin 84350 8m 1 2370-RA scaffol d0004 7.g388 CYCA .t1; B1; scaffol Vocar2 Related to A CYCA CYCA cyca 20611 d0004 127293 001502 and B type B1 B2; b1 2 7.g389 7m cyclins. CYCA .t1; B3 scaffol d0004 7.g390 TSOC_01 .t1 3800-RA scaffol Vocar2 CYCC CYCC cycc d0000 127505 000627 C type cyclin 1 1 1 1.g491 TSOC_01 9m .t1 0052-RA scaffol Au9.Cr Vocar2 CYCD CYCD cycd 19578 d0004 D-type e11.g4 127281 001006 D type cyclin. 1 1.1 1.1 0 7.g299 TSOC_00 cyclin 67772 3m .t1 8914-RA scaffol Vocar2 CYCD cycd d0004 D-type 127284 001006 D type cyclin 1.2 1.2 7.g300 TSOC_00 cyclin 7m .t1 8919-RA scaffol Vocar2 CYCD cycd d0004 D-type 127282 001012 D type cyclin 1.3 1.3 7.g301 TSOC_00 cyclin 7m .t1 6929-RA scaffol Vocar2 CYCD cycd d0010 D-type 127283 001318 D type cyclin 1.4 1.4 0.g16.t cyclin 8m 1 scaffol Au9.Cr Vocar2 CYCD CYCD cycd 19176 d0004 D-type e06.g2 127277 001343 D type cyclin 2 2 2 2 1.g668 TSOC_00 cyclin 89750 7m .t1 5199-RA scaffol Au9.Cr Vocar2 CYCD CYCD cycd 20611 d0004 D-type e06.g2 127287 000742 D type cyclin 3 3 3 0 4.g48.t TSOC_00 cyclin 84350 2m 1 5193-RA scaffol Au9.Cr Vocar2 CYCD CYCD cycd 20616 d0004 D-type e09.g4 127321 000714 D type cyclin 4 4 4 6 1.g668 TSOC_00 cyclin 14416 5m .t1 8922-RA scaffol CYCD d0001 D-type D-type cyclin 5 1.g188 cyclin .t1 Aug9. scaffol Vocar2 CYCL CYCL cycl Cre07. d0000 120598 000317 L type cyclin 1 1 1 g3216 1.g226 TSOC_01 8m 50 .t1 0026-RA scaffol Au9.Cr Vocar2 CYC CYC cyc d0066 e10.g4 107638 001048 cyclin M1 M1 m1 9.g798 TSOC_00 53050 0m .t1 9098-RA conserv ed express scaffol Au9.Cr Vocar2 ed CYCU CYCU cycu d0001 e02.g1 127509 000515 protein cyclin 1 1 1 5.g335 18050 7m of .t1 unknow n function scaffol d0045 3.g353 Au9.Cr Vocar2 CYCT CYCT cyct 19346 .t1; TSOC_00 e14.g6 120142 001439 1 1 1 1 scaffol 1716- 13900 9m d0045 RA/TSOC 3.g354 _001717- .t1 RA scaffol Au9.Cr Vocar2 CDK wee 19458 d0000 wee1 kinase WEE1 WEE1 e07.g3 127274 000684 inhibitor 1 9 6.g736 TSOC_00 ortholog 55250 5m y kinase .t1 3531-RA scaffol d0001 7.g809 CKS1 Au9.Cr Vocar2 18277 .t1; CKS1 CKS1 ; cks1 e03.g1 127315 000802 9 scaffol homolog CKS2 80350 6m d0000 6.g542 TSOC_00 .t1 7834-RA Minus Au9.Cr Vocar2 retinobla mat 18724 _MT.g MAT3 MAT3 e06.g2 127376 000352 stoma 3 8 1278.t TSOC_01 55450 6m protein 1 2621-RA scaffol E2F Au9.Cr Vocar2 d0000 transcription E2F1 E2F1 e2f1 e01.g0 127253 000930 3.g108 factor family 52300 8m .t1 TSOC_00 homolog. 8865-RA scaffol Vocar2 putative DP Au9.Cr d0000 DP1 DP1 dp1 121369 001274 transcription e07.g3 1.g256 TSOC_00 2m factor 23000 .t1 4326-RA related to E2F and DP transcrip tion factors, scaffol related to Au9.Cr Vocar2 Chlamyd E2FR E2FR e2fr 16856 d0002 E2F and DP e13.g5 127270 000983 omonas 1 1 1 3 2.g868 transcription 73000 9m specific; .t1 factors transcrip tion factor, E2F and DP- related Cell wall protei ns Hydroxy proline- rich Cre09. glycopro GP1 g3989 tein 00 compon ent of the outer cell wall Hydroxy proline- rich Cre06. scaffol glycopro gb_EFJ52729. GP2 GP2 gp2 g2588 d0001 tein 1 00 0.g752 compon ent of the outer cell wall Hydroxy proline- rich gb_CA scaffol glycopro GP3 GP3 gp3 J9872 d0002 105161 tein 1.1 6.g494 compon ent of TSOC_00 the outer 4978-RA cell wall