Evaluating patterns of selection in reproductive and digestive protein genes of seed .

A comparative approach.

Konstantinos Papachristos

Degree project in biology, Master of science (2 years), 2021 Examensarbete i biologi 60 hp till masterexamen, 2021 Biology Education Centre and Department of Ecology and Genetics, Uppsala University Supervisor: Göran Arnqvist External opponent: Philipp Kaufmann, Sanne Cornelia Everling

Abstract

Seminal fluid proteins (SFPs) have been shown to affect the physiology, behaviour and immune responses of mated females in some species. This open window for manipulation of female’s fitness allows the possibility for complex evolutionary dynamics between the SFPs and proteins of females that would counter the effects of the former, the female reproductive proteins (FRPs). Also, the bean beetles of the Bruchinae subfamily are pests to pre- ferred species of plant hosts. The hosts have a great variety of secondary defensive metabolites between them and to detoxify those compounds, each species is expected to have a well adapted arsenal of digestive proteins for a specific host. I carried out a comparative study with four species of bean beetles with the aim to identify patterns of selection in the proteins mentioned. Expres- sion data for one of those species, Callosobruchus maculatus, has allowed to identify its SFPs, FRPs and digestive proteins and with orthology inference I identified their orthologues in the other three species. Then I estimated the ratio of non-synonymous to synonymous substitution rates (ω) for each pro- tein by using codeML of the PAML package and used them as a proxy for estimating selection. FRPs had about the same ω values as conserved genes found across the phylum, while the SFPs and digestive proteins had higher ω values, indicating more relaxed purifying selection. I also performed tests of positive selection and have identified 92 digestive proteins, 9 FRPs and 26 SFPs as potential targets for future functional work. Finally, I examined the scenario of co-evolution between SFPs and FRPs because of direct interaction. By correlating branch-specific ω values for each possible pairs of proteins I found that SFPs are associated on average more with FRPs than with digestive or conserved genes, as expected. The same was true for the FRPs. Also I examined the possibility of factors contributing to the association such as expression levels, sex-biased expression and protein func- tion. Using linear regression models I found that expression levels and protein function do predict in some degree the ω estimates and could thus also affect the correlations examined. High gene expression levels reduce the overall ω values of genes, also known as E-R anticorrelation. Sex-biased expression does not affect the overall ω values, but does affect the intensity of the E-R anticorrelation, with it being less prominent in male-biased genes and more prominent towards female-biased genes.

i

To Katerina and my family. Thank you for all your love and support.

Contents

1 Introduction ...... 7 1.1 The species ...... 7 1.2 The digestive proteins ...... 10 1.3 The reproductive proteins ...... 11 1.4 Identifying orthologues ...... 12 1.4.1 Homology ...... 12 1.4.2 Orthology and Paralogy ...... 13 1.4.3 Orthogroups ...... 13 1.5 Identifying selection ...... 15 1.5.1 dN and dS ratios ...... 15 1.5.2 MacDonald – Kreitman test ...... 17 1.5.3 Tajima’s D ...... 19 1.6 Identifying co-evolution ...... 20

2 Methods ...... 21

3 Results ...... 23 3.1 Orthofinder ...... 23 3.2 phylopypruner ...... 23 3.3 Omega ratios ...... 26 3.4 Tajima’s D estimates and MacDonald - Kreitman tests ...... 37 3.5 Linear models on omega values ...... 39

4 Discussion ...... 46 4.1 Proteins under positive selection ...... 47 4.2 Distributions of omegas for sites ...... 48 4.3 Linear model of ω values ...... 50 4.4 Correlations of branch omegas and protein coevolution ...... 51 4.5 Limitations of the study ...... 52

5 Conclusions ...... 54

References ...... 55

1. Introduction

“ Τὰ πάντα ῥεῖ καὶ οὐδὲν μένει.” Everything flows and nothing stays the same. Herakleitos of Ephesus

In his verse, Herakleitos sums up a true property of life, that her only con- stant is change. Tectonic plates move constantly, ocean and air currents change direction, mountains rise, glaciers form and melt, all perpetually shaping the conditions in which organisms have to survive. During the span of thousands to millions of years when these changes are taking place, hardships to deal with and opportunities to seize are created for all life. So life must evolve in order to keep up with a constantly changing world. In this work I am mainly interested in one of the mechanisms of evolu- tion, selection, and the patterns that it leaves behind when acting on proteins. Specifically, the proteins examined belong to four species of beetles that infest beans and will be introduced in the next chapters. Also, the proteins can be split into three sets, proteins that are expressed in the male reproductive tract of the beetles and are characterised as Seminal fluid proteins (SFPs), proteins whose expression in females changes after mating, the Female reproductive proteins (FRPs) and proteins that are expressed in the gastrointestinal tract of the beetles, the digestive enzymes. In a later chapter I will explain why these sets of genes were chosen in the first place. I will also examine the possibility of co-evolution between the SFPs and the FRPs. As to why this would be expected or how it will be tested, I will discuss it when introducing SFPs and FRPs. Finally, I will introduce the topic of homology and orthology, as they are vital when doing comparative work between species, and I will describe the theory behind the methods used to identify and quantify selection.

1.1 The species Three of the beetles for this work belong to the genus Callosobruchus. They are C. maculatus, C. analis and C. chinensis. The fourth species is Acan- thoscelides obtectus. They all belong to the subfamily Bruchinae which in- cludes seed beetles. This subfamily was traditionally considered to be a fam- ily related to Chrysomelidae, the Bruchidae, but this view changed when the

7 Callosobruchus maculatus Callosobruchus analis Callosobruchus chinensis obtectus

Figure 1.1. The species tree as understood with molecular data.

Bruchidae were found to have the Sagrinae, a subfamily of Chrysomelidae, as a sister group [1,2]. All species belong to the Chrysomelidae family, which includes seed and leaf beetles and is quite rich in diversity, as it is estimated to have more than 50.000 species [3]. The phylogenetic relationships among the seed beetles is shown in figure 1.1, according to molecular data [4,5]. The four species of interest are all seed beetles of legumes with wide spread distribution around the globe [6] (See figure 1.2). They usually infest storages of crop products, vital for the economy of developing third-world countries and nutrition of their people [7]. In Ethiopia, A. obtectus and another pest, Zabrotes subfasciutus, were responsible for up to 38% of stored beans being damaged [8,9]. Other sources estimate the reduction of yield by A. obtectus to be between 50-60% [10]. So, the destructive potential of these pests is great and thus there is incentive to study the evolutionary mechanisms that made them suitable for this lifestyle. The native and preferred plant hosts for each beetle species are [Arnqvist G., pers. communication, January 25, 2021]: • Black eyed beans (Vigna unguiculata) for C. maculatus • Mung beans (Vigna radiata) for C. analis • Adzuki beans (Vigna angularis) for C. chinensis • Common beans (Phaseolus vulgaris) for A. obtectus though they may infest other types of beans, but with some cost to their fitness under certain conditions [11]. The life-cycles of these species are quite similar, so I will only describe that of C. maculatus. After mating takes place, adult females lay their eggs on the beans. The eggs are oval shaped, shiny and stuck onto the bean. The larva later hatches, passes the seed coat and burrows into the endosperm. Then the egg, still on the surface of the bean, becomes white opaque as it accumulates the faeces of the larva. The larva feeds on the endosperm and undergoes moulting several times and moves to a position underneath the seed coat. Next stage is pupation and at the area where it is taking place, a "window" on the bean forms. After pupation is complete, the larva has transformed into a winged adult which will emerge from the position of the window and will be sexually matured after 24-36 hours. The males will seek to mate with the females immediately and when they do so, the females will store the sperm in their spermatotheca and look for beans to oviposit their eggs. From the time of

8 Acanthoscelides obtectus Callosobruchus chinensis

Callosobruchus analis Callosobruchus maculatus

Figure 1.2. Geographic distribution of the four beetle species studied, according to the Invasive Species Compendium.

emergence from the bean, the adults do not need to consume food or water. The larval period lasts for about 3-4 weeks in ideal humidity and temperature conditions. The adult lifetime is about 10-14 days. It is clear that these species have a great importance as model organisms. Studying them is motivated by their negative impact on the agricultural econ- omy and this can be seen by the numerous studies investigating pest control methods alternative to insecticides which damage the environment [12, 13, 14, 15, 16]. But, their short life cycle also allows for experimental evolution in the lab. This means that the evolution of traits, which might take thousands to millions of years in nature, can be observed and manipulated in the lab. With this approach, we can study how evolution of traits happens and under which conditions. Such traits that are being investigated are lifespan [17, 18, 19] and host shifts [20, 21, 22] in Acanthoscelides obtectus. In Callosobruchus mac- ulatus, thermal tolerance and related adaptations [23, 24], reproductive costs [25] and sexual conflict [26, 27, 28] are being investigated.

9 1.2 The digestive proteins As briefly mentioned, bean crops are really important for humans. They are cheap to grow and require little land area [7]. Also, they can fix atmospheric nitrogen as they form symbiotic relationships with the nitrogen-fixing bacteria of the genus Rhizobia in their root nodules and can thus be used in intercrops, or in rotation with other crops to enrich the soil with nitrogen and enhance its overall fertility [7]. Legumes are a greater source of protein compared to other plant foods [29] and are related to many health benefits in human health, with a positive impact in various types of cancer, diabetes and cardiovascular diseases [30]. Given the high nutritional value of their seeds, it is not surprising that the Fabaceae defend them from predators by storing a plethora of chemical com- pounds in them [31]. Also, because of their nitrogen-fixing property, more nitrogen-rich metabolites can be found in beans than in the seeds of other plants, serving a dual role, defense from predators and nitrogen storing. Such secondary metabolites are alkaloids, non-protein amino acids, cyanogenic glu- cosides and peptides and while some are abundant in all taxa of Fabaceae, others are present in limited taxa [31]. For humans it is trivial to utilise this resource, as cooking achieves a thermal inactivation of these defensive com- pounds [32], but for herbivores in nature it means that, in order consume beans, they need a more sophisticated arsenal of digestive enzymes than when eating other plant foods. Depending on the different types of resources that utilise, they can be divided into specialists and generalists, with the first to consume only one source of food and the latter to consume many different types.Yet, between the specialist and the generalist state, there is a gradient of intermediate states in which we can find different taxa. The Bruchinae fall closer to the specialists category, as it is estimated that 80% of the species in the group are associated with one to three species of plants [5]. This could be explained by the difficulty of digesting beans, requiring a specialised set of digestive proteins for each bean type. The species studied here, are also not complete specialists, as they prefer their native host, but can infest other species of legumes too. When looking at the Bruchini tribe and their association with Fabaceae, it is apparent that shifts in the subfamily and tribe level of the host have occurred [5]. It is reasonable to assume that these host shift events would be accom- panied by radical changes in the digestive weaponry of these beetles, as new compounds have to be detoxified with each contact with a new legume host. This makes the digestive enzymes of these beetles potential targets for posi- tive selection, as changes in their amino acid sequences could modify them in a way that permits the detoxification of new secondary metabolites and increase the fitness of larvae on new hosts.

10 The molecular data for the digestive enzymes are a result of the work of Sayadi et al., 2019 [28] and are defined as proteins that are expressed in the gastrointestinal tract of the beetle species.

1.3 The reproductive proteins In this work, I am examining two types of reproductive proteins. The first in- cludes the seminal fluid proteins (SFPs) and are defined as proteins expressed in the male reproductive tract that are also transferred to the female at mating and the second type is what I refer to as female reproductive proteins (FRPs), i.e. proteins expressed in the female reproductive tract and whose expression level changes after mating. Molecular data for these come from Sayadi et al., 2019 [28] and Bayram et al., 2019 [33] and expression data from Immonen et al., 2017 [26]. These proteins are important, as they participate in mating between the sexes. Mating can be seen as a form of cooperation between males and fe- males in order to increase their fitness. Under monogamy, meaning that all in- dividuals mate only with one other individual, then the reproductive success of both sexes is equal. But this is not true under cases of polygamy, where males and females have multiple matings [34]. This is because of differences that exist between males and females, in attributes such as gamete size, parental care, female and male choice and sperm competition [35]. These differences then dictate a difference in the fitness optima between the sexes, resulting in sexual conflict which can manifest before or/and after mating [34, 35]. The SFPs and FRPs may mediate post-mating sexual conflict with their function. As reviewed in Sirot et al., 2015 [36], SFPs are known to affect remating, sperm utilisation, egg production, food intake, activity, immune re- sponse and the life span of females after mating. The influence of SFPs in female physiology can benefit the reproductive success of males with a cost in female fitness. Under this scenario, one can assume that the reproductive proteins of females have evolved to have an effect that counters the one of the SFPs, resulting in an arms-race known as sexually antagonistic coevolution [34]. This means that a higher rate of evolution (rate of amino acid substitu- tions) in SFPs and FRPs is expected, compared to the average protein encoded by the genome. Given the above, elevated diversity levels in SFPs can be considered as evidence of post-mating selection acting, but there can be a simpler explana- tion. Reproductive genes tend to have sex specific expression and this alone is enough to have any beneficial mutation fixed or any deleterious mutation purged with a lower probability than a mutation in a universally expressed gene. This can result in a two-fold increase in gene diversity [37]. Further- more, in the case of multiply inseminated females, sperm competition is ex- pected to increase even more the diversity of the SFPs [37]. The intensity

11 of sperm competition and the strength of selection acting on the sperm genes are affected by the harmonic number of mates per female (H). The more in- tense the sperm competition, the higher the diversity of genes affected by it. Diversity levels of genes under sperm competition approach the diversity of genes under sex-specific expression only when H is really high, about 8-10 [37], meaning that elevated diversity is usually expected compared to the case of sex-specific expression. Unlike FRPs, SFPs have been studied extensively. SFPs are products of accessory sexual glands, like prostate glands, epididymi, and seminal vesi- cles of mammals and [38] the male accessory glands and the ejaculatory ducts of [39]. These proteins can be proteases and protease inhibitors, sugar-binding lectins, cysteine-rich secretory proteins, antimicrobial and an- tioxidant proteins, and coagulation proteins such as transglutaminases, small peptides and larger prohormone-like molecules [40]. For their counterpart in females, the FRPs, the data are scarce. Since the SFPs can have an effect on females, they should interact with some proteins in the female reproductive tract, and this interaction should mediate some signal transduction process. This is why SFPs and FRPs are are predicted to have a ligand-receptor or enzyme-substrate relationship [36]. To my knowledge, the only case where a protein has been characterised as an FRP or a female effector of an SFP, is the sex peptide receptor of Droshophila melanogaster [41]. The beetle species studied here exhibit sexual dimorphism, which can be a result of sexual selection [35] and they mate under polygamy. Thus, the reproductive proteins of these species are a great candidate to test for the evo- lutionary scenarios discussed here and to identify interesting patterns of these processes.

1.4 Identifying orthologues 1.4.1 Homology In order to first talk about orthologous proteins, we must first discuss homol- ogy. Homology refers to the "sameness" of "things" and those may be anything from arbitrary nucleotide sequences, to proteins, to the limbs of the tetrapods. What may connect these different "things" together and thus being identified as homologous, is a shared evolutionary history, or in other words, shared ancestry. Now returning to DNA or proteins sequences, depending on the way that pairs of homologues are related to each other, new subtypes of homology emerge. So two sequences could be orthologues, after a speciation event, paralogues, after a duplication event in the genome, ohnologues, after ho- mopolyploidization or autopolyploidization events, homoeologues, after al- lopolypoidization events and xenologues, after lateral gene transfer events [42].

12 1.4.2 Orthology and Paralogy Because of the way the orthology is defined, i.e. sequences resulting from a speciation event, it is implied that future duplication events do not affect the relationship of the sequences. As a consequence of this, one protein may have more than one orthologues in one other species and therefore orthology can describe not only a one–one relationship, but also one–many, many–one and many–many. Since one protein can have many orthologues in one other species, it is also implied that orthology is independent of position of the genes in the genome. Take into consideration an example where two species, A and B, have in- herited from their common ancestor a gene that was subsequently duplicated in species B. Species A will have one copy of the gene G and species B will have two copies of gene G, G1 and G2. G1 an G2 are paralogues to each other and both of them are homologous to G. Assuming that G1 has remained in the ancestral locus, G2 resides in a new locus and is still an orthologue to G. Another thing to note is that paralogy is still relevant for genes found in different species. Humans and mice have two types of hemoglobin, α and β. The human hemoglobin α is a paralogue of the mouse hemoglobin β, as those two genes are a result of gene duplication in a common ancestor of humans and mice. The same stands for the human hemoglobin β and the mouse hemoglobin α. All the above lead us to the conclusion that orthology and paralogy are two non-transitive relationships. This means that we cannot infer the relationship of two genes if we know what their relationship to a third gene is. More specif- ically, if gene A and B are orthologues and so are genes B and C, then it is not necessary that genes A and C are orthologues. We can use the first example with the G genes. G of species A is orthologous to both G1 and G2 of species B, but G1 and G2 are paralogues. The same applies for paralogy. Human hemoglobin α is a paralogue to human hemoglobin β and mouse hemoglobin β, but the latter two are homologues to each other.

1.4.3 Orthogroups Gene duplication or gene loss events occur quite often in the evolution of an ancestral gene, so it may be difficult to identify one-one orthologues for comparative analyses for many taxa. So it is more convenient to identify or- thogroups instead of orthologues. An orthogroup is a set of genes that have descended from a single ancestral gene, and so it includes both orthologous and paralogous genes. By identifying orthogroups, we can keep a complete dataset, as we do not have to discard paralogous genes to keep a one-one or- thologue relationship and in comparative analyses we can easily define our unit comparison, meaning the taxon or taxa we are interested in and compare them to the rest of the taxa. For example, if one is interested in a set of proteins

13 Speciation event C in species A

Cα in species B C in species B Duplication event β

Figure 1.3. A hypothetical gene tree. Gene C is found in species A, while genes Cα and Cβ are found in species B. in C. maculatus, by identifying their orthogroup in a set of related species, one can find orthologues to C. maculatus from all species and have information on duplication events. Two types of orthogroups are recognised [42, 43]. The first one is Com- plete orthogroups, or Hierarchical orthogroups, which are discussed above and contain both orthologous and paralogous genes, and all descend from a single gene in some common ancestor of the species examined. The second one is strict orthogroups, which are defined as sets of genes in which the re- lationship between two genes is always that of orthology. More simply, they are one-one orthologous gene sets. As mentioned, these sets do not necesser- ily contain full data, as in the case of duplication events, one-one orthology breaks down and thus paralogous genes will be excluded. The methods of inference of orthogroups can be divided into Tree-based and Graph-based approaches. In tree-based approaches, orthogroup inference can be done via gene tree–species tree reconciliation using parsimony or max- imum likelihood methods, so the species tree must be known in this case. An- other way is the species overlap approach, where internal nodes of genes trees are labeled either as speciation or duplication events, depending on whether a species apppears more than once in a subtree of an internal node (a duplication event in this case). With this approach the species tree can be fully unresolved. Graph-based approaches, rely on the fact that, between two species, pairs of genes that have diverged the least, tend to be orthologues and the rest of the pairs would be paralogues. This is because the paralogues are a result of an earlier duplication event and thus have had more time to diverge than the pair that come up from speciation. For example, species A has a gene C and two homologues have been identified in species B, genes Cα and Cβ . If we make the gene tree, we see that C gene is more closely related to Cα , and this clade is then related to Cβ . Because of that topology in figure 1.3, there must have been a duplication event prior to a speciation event and the orthologue to C should be Cα , while Cβ should be a paralogue of C. Graph-based meth- ods avoid infering species trees and instead make pairwise comparisons of the sequences and make graphs with genes as vertices and some measure of se- quence similarity between the genes as the edges. Then clustering algorithms produce the orthogroups out of such graphs.

14 In this work I used the program Orthofinder 2.4 [44], in order to identify hi- erarchical orthogroups. The program uses a graph-based approach to perform the task.

1.5 Identifying selection 1.5.1 dN and dS ratios One way to detect signals of positive selection in proteins is by estimating the ratio between non-synonymous (dN) and synonymous substitutions (dS) in the nucleotide sequence in the genes encoding the proteins. This ratio is referred to as ω. If a protein is of no importance for the fitness of an or- ganism, then selection should not act on the amino acid sequence, meaning that no substitution of amino acid residues is favoured over other residues in sites. This means that the rate of synonymous nucleotide substitutions should not be different than that of non-synonymous ones and so it should be that non-synonymous and synonymous nucleotide substitutions should be about the same. This means that it should be ω=dN/dS=1. If a protein has an impor- tant effect on the fitness of an organism, then it can be that non-synonymous amino acid substitutions have a negative effect, by either disrupting its func- tion completely, or by reducing the efficiency with which the function is per- formed. Alternatively, it could be that the non-synonymous amino acid sub- stitutions have a positive effect, by creating new variants that are beneficial by being more efficient in performing their function or by gaining a new benefi- cial function. The former is a case of purifying or negative selection acting, and it should be dNdS and ω>1 [45]. If we view a protein molecule evolving under one of the above scenarios, we assume that this protein will have one ω ratio for all its amino acid sites. But this is not a reasonable assumption to make. Substitutions affect the function of a protein molecule directly by affecting its structure and its stability. Amino acid substitution in sites with no apparently direct functional role, far away from active sites or binding regions of proteins, have been shown to result in a less stable molecule [46, 47, 48, 49, 50, 51]. Also, certain sites of a protein may be important for forming hydrogen bonds and hydrophobic or ionic bonds with their side chains. Should such substitutions in such sites to happen, in molecules important for fitness, struc- ture and stability are compromised and negative selection will most likely re- move them from the gene pool. These sites will be conserved and end up with an ω ratio close to 0. Then are the sites which neutrally evolve and in which non-synonymous substitutions do not affect negatively nor positively the or- ganism and such sites will have ω=1 or close to one and there are the sites under positive selection in which the amino acid substitutions increase fitness and then ω>1 for them [52].

15 Figure 1.4. Pathways of codon substitutions. Two cases of two different ancestral codons, TTC and CTA, leading to two different substitutions in two different species, TTA in species A and CTC in species B. When applying a codon substitution model, the two scenarios have different likelihoods to have occurred. Picture from: Ziheng Yang and Rasmus Nielsen, Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models. Mol. Biol. Evol. 17(1):32–43. 2000.

Averaging the ω rate for all sites would reduce signals of selection as it is improbable that most protein sites are positively selected and have ω>1. This was the case for Hughes and Nei (1988) and Hughes et al (1990) who identified positive selection in the human major histocompatibility complex (MHC) loci. When ω was averaged for all sites it was below 1 [53, 54]. To overcome this, they relied on data of the tertiary structure of the molecules, to infer sites that might be positively selected. Thus, they found an ω significantly higher than 1 for the antigen recognition site (ARS) of the MHCs.

Counting synonymous and non-synonymous differences In two nucleotide sequences of two species which produce a protein homol- ogous protein we want to determine whether positive selection has acted by determining the ω ratio. When a codon differs by one base it is easy to infer that one substitution happened (assuming that there were not intermediate sub- stitutions, for example if ATC->ATG->ATT happened to a triplet, we would infer only ATC->ATT) and determine whether it’s a transition or transversion and synonymous or non-synonymous. But problems arise when codons differ by more than one base, as shown in figure 1.4. Species A has the TTA triplet and species B has the CTC triplet. For this to have happened, the common ancestor of the two species had either a TTC triplet and a non-synonymous substitution happened in both extant species or a CTA triplet and a synonymous substitution in each species. So, it is necessary to infer the ancestral sequence by weighting each route of substitutions. This can be done by using parsimony reconstruction methods, where routes with the smaller number of substitutions is weighted as more probable than those with more. The problem is that such reconstructed ancestral sequences are not real observed data, and the method prefers the scenarios where evolution took place by the smallest number of steps, which is not necessary a correct as- sumption to make. Parsimony also does not consider some properties of DNA, like unequal rates of transitions and transversions and biased base frequencies.

16 On the other hand, Maximum Likelihood methods account for these charac- teristics of DNA and make no assumption on how evolution has progressed, by averaging over all possible ancestral states by using a codon substitution model. This substitution model is inferred by the data and other parameters like branch lengths are estimated for all sites and are then fixed when estimat- ing the ω ratio for each site. Because this would be computationally heavy for long sequences, with many parameters to be estimated, a third approach, the Bayesian or empirical Bayes approach, is used by assigning a prior distribu- tion of ωs and a posterior distribution of ω rates is calculated given the data. While these approaches give estimates of selection for each site, to detect pos- itive selection on a protein we should apply a correction for multiple testing, like Bonferroni correction [55].

Codon substitution models in CodeML The program codeML of the PAML package [56] that is used in this work, uses a maximum likelihood method to estimate dN/dS. Four types of codon substitution models that describe the ω ratios can be implemented, the basic model, branch models, site models and brach-site models. The basic model assumes one ω ratio for all sites and lineages. It is specified by setting model = 0 and NSsites = 0, in the control file codeml.ctl. It is not a realistic model, but it is useful as a null model to compare against the branch and site models. The branch models assume the ω ratio to be varying along branches of a phylogeny and are thus useful for detecting selection among lineages. They can be used by setting model = 1 or model = 2 and NSsites = 0, in the control file codeml.ctl. When model is set to 1 all branches are assumed to have their own ω rate, but with many branches this can be computationally heavy. So, the user can set for model = 2, and be able to specify which branches should have dif- ferent ratios, by using branch labels in the tree file. The site models assume that protein sites have different ω ratios and is thus more useful for detecting selection in proteins than the basic model. These models can be used by set- ting model = 0 and NSsites = 0,1,2,7,8, in the control file codeml.ctl (when NSsites = 0 we use the basic model). All these models can be run at once in one go by typing NSsites = 0 1 2 7 8. Model comparison can be done via Likelihood ratio tests [57].

1.5.2 MacDonald – Kreitman test MacDonald and Kreitman [58] devised a neutrality test (MK test) which con- siders the ratios of non-synonymous to synonymous substitutions found be- tween species and the polymorphisms found within populations. Deviations from the null hypothesis of neutrality is interpreted as positive, purifying or balancing selection.

17 To establish their null hypothesis under neutrality, MacDonald and Kreit- man considered a phylogeny with alleles from more than one species in the ab- sence of recombination. The relationships of the alleles can be represented in one phylogenetic tree with branches categorised as between species branches and within species branches. Mutations that have occurred in the between species branches are fixed differences between the species and mutations in the within species branches are polymorphisms within the species. Further- more, they considered substitutions in coding sequences as being either syn- onymous or replacement (or non-synonymous). With Mr being the number of possible replacement neutral mutations, Ms the number of possible synony- mous neutral mutations, Tb the time in the between species branches, Tw the time in the within species branches and µ the mutation rate of nucleotides per site with equal chance of one nucleotide turning into any of the three others, then from the neutral theory they defined the expected number of fixed re- placement (dN), fixed synonymous (dS), polymorphic replacement (pN) and polymorphic synonymous substitutions (pS). More specifically those would be dN = Tb(µ/3)Mr, dS = Tb(µ/3)Ms, pN = Tw(µ/3)Mr and pS = Tw(µ/3)Ms. Given that the number of synonymous substitutions is independent of the num- ber of replacement substitutions, then we have:

dN T (µ/3)M M = b r = r dS Tb(µ/3)Ms Ms

pN T (µ/3)M M = w r = r pS Tw(µ/3)Ms Ms So under neutrality it should be dN/dS = pN/pS. This condition can be tested for statistical significance by using an exact test, as recommended by MacDon- ald [59]. A neutrality index (NI) can be calculated in order to interpret the relation- ship of the two ratios, as coined by Rand and Kann [60]. pN/pS NI = dN/dS

When NI = 1 we have a case of neutrality. When NI < 1, then non-synonymous substitutions are fixed at a higher rate than expected and thus purifying or pos- itive selection is taking place. When NI > 1, then non-synonymous substitu- tions are being purged at a higher rate than expected, keeping polymorphisms high. This would be a signature of balancing selection. One other way that it could be NI < 1, but not due to selection acting, is in a case involving specific assumptions about the demography of the species con- sidered. If the populations of all species were smaller in the past and expanded until today, then slightly deleterious replacements will remain fixed between species (dN) due to them being effectively neutral while slightly deleterious

18 replacements within species (pN) will be purged more effectively with the growing population sizes [58]. With the reverse situation happening, meaning large ancestral populations decreasing in size until today, will have as a result slightly deleterious replacements between species being purged more in the past and slightly deleterious replacements within species remaining due selec- tion being less effective [61]. Thus, it will be NI > 1, a signal of balancing selection.

1.5.3 Tajima’s D The amount of genetic variation of a neutral locus, under random mating is equal to θ = 4Neµ, with Ne the effective population size and µ the muta- tion rate per generation [45]. In 1989, Tajima developed a statistical method to test for neutrality by comparing two estimates of θ [62]. The first esti- mate was the one known as Watterson’s theta, θs [63]. θs can be estimated as θs = Sn/(Lαn), with Sn the number of polymorphic sites found in n sequences n−1 and αn = ∑i=1 1/i. The second estimate is the average number of pairwise nucleotide differences among n sequences, θπ [64]. Tajima defined the statistic D, for which it is:

θπ − θs D = p var(θπ − θs)

Under neutrality it should be that θs = θπ = θ = 4Neµ and thus D = 0. But in the presence of selection, the presence of deleterious polymorphisms will tend to inflate the values for θs, while their effect on θπ will be minor. This is because the frequency of the polymorphism is not taken into account in θs and thus neutral and deleterious polymorphisms are weighted equally, in contrast to θπ , in which the frequency is taken into account. The same pat- tern occurs under positive selection, with sites being in near fixation and the few non-selected polymorphisms inflating θs. So when purifying or positive selection operates, then D < 0. When D > 0, then θπ > θs and this would hap- pen when there is a lack of polymorphisms with low frequency and excess of polymorphisms with intermediate frequency. This pattern would be consistent with balancing selection. It should be noted that changes in population size will affect the D statistic. After population bottlenecks, is is likely that sites with low frequency will be lost, thus pushing the statistic to positive values. Also, population expansion will result in many new polymorphisms as singletons (a single occurrence and thus low frequency) and will push D to negative values.

19 1.6 Identifying co-evolution As mentioned in Section 1.3 in the scenario of sexual antagonistic selec- tion, co-evolution between male and female derived proteins is expected. Co- evolution is defined by Thompson [65] as “reciprocal evolutionary change in interacting species”. This definition can be extended to molecules too, as changes in one locus affect the selective pressure on another and thus recipro- cal changes could be happening perpetually in both, assuming some protein to protein interaction. Because interactions between proteins are mediated via their various bind- ing surfaces, amino acid substitutions in such areas would potentially alter the specificity and strength of the interaction. This can lead to patterns of corre- lated evolution reviewed by Lovell and Robertson, 2010 [66]. These patterns include correlated presence/absence of orthologous proteins in various taxa, correlated evolution across whole sequences and correlated change at specific sites. In this work I am examining correlated evolution across whole sequences, as I am correlating omega values per branch of the phylogeny of the beetle species. Each branch-omega value is the average rate of non-synonymous to synonymous nucleotide substitutions across each sequence for one species. Thus, I test for correlation of branch-omega values between pairs strict or- thogroups, with the idea that significant correlations would be because of protein to protein interactions that are conserved across the phylogeny of the species. It is important to note that co-evolution implies direct interaction between two or more proteins, but correlated evolution does not. As reviewed in Lovell and Robertson, 2010 [66], correlated evolutionary rates, and thus correlated evolution patterns, can arise from external correlating factors such as proteins dispensability, meaning whether the reduction in fitness is great when the pro- tein loses its function, the overall protein structure, the developmental stage in which the protein is expressed, the breadth of expression and the expression level. Also, any selective constraint caused from environmental and ecologi- cal factors on a number of proteins/traits, at a certain branch of the phylogeny, could increase the correlation between the evolutionary rates of those genes, thus giving a pattern of correlated evolution. This means that patterns of cor- related evolution could be detected between non-interacting proteins, so it is important to quantify the effect of these confounding factors to be able to iden- tify correlations only because of protein to protein interactions. To account for this I use regression models with omega values as the response variable and expression levels and sex-biased expression in C. maculatus as the predictors.

20 2. Methods

Coding sequences of the proteomes for the species C. maculatus, C. analis, C. chinensis and A. obtectus were obtained (unpublished data), resulting in four FASTA files (CDS FASTA). Based on the expression data from Immonen et al., 2017 [26] and Sayadi et al., 2019 [28], I gathered four sets of genes for digestive proteins, female reproductive proteins (FRPs), seminal fluid proteins (SFPs) and single-copy genes of C. maculatus that are conserved across the Arthropod phylum (Busco genes) were identified in the proteome of C. maculatus. The digestive protein set contained 741, the FRP set 126, the SFP set 185 and the Busco set 1137 genes. Hierarchical orthogroups were identified among the four proteomes using Orthofinder version 2.4 [44]. Nucleotide coding sequences for the proteome were translated into amnio acid sequences using the program transeq of EM- BOSS v6.6.0.0. More information about Orthofinder can be found in the au- thor’s Github page. The hierarchical orthogroups that contained gene IDs of C. maculatus in the gene sets described above were extracted using a custom python script (orthogroup_fetcher.py). Because orthogroups created from orthofinder contain protein sequences, a python script was created to get the nucleotide sequences from CDS FASTA files for each protein sequence (protein_2_nucleotide.py). The script change_fasta_title.py was run to ensure proper formatting for running in phylopypruner. Next, a script was used to filter out any C. maculatus sequence that was not in the lists of the desirable gene sets (cleaning_c_mac.py). This was done to ensure that orthologues would be found only for the desired C. maculatus genes. Then, the genes in each orthogroup were aligned using MAFFT v7.453 [67] using the L-INS-i option. Gene trees were created using a script that executed Fasttree version 2.1.11 [68] for each orthogroup (run_fasttree.py). After creating the alignments and their trees, the hierarchical orthogroups were converted to strict orthogroups using phylopypruner 0.9.7. Prefilters in- cluded removing, any sequence below 100 base pairs long, any sequence with a branch length longer than five times the standard deviation of all branches within a hierarchical orthogroup, in order to avoid long branch attraction, se- quences that belong to different taxa, but have very short branch lengths in order to avoid cases of cross contamination, any sequence with support value below 70%. Trees were rooted with A. obtectus as the outgroup. Paralogy pruning was done using the maximum inclusion method. Postfiltering in- cluded the minimum taxa allowed equal to four, the forced inclusion of C.

21 maculatus sequences in the orthogroups and removal of any sequence with gaps occupying more than 20% of the alignment length. More information about phylopypruner can be found in the author’s Gitlab page. For each strict orthogroup the FASTA file was renamed appropriately and moved to each own folder using a bash script (rename_move_files.sh). For each FASTA file, a tree with the topology of the unrooted species tree, ac- cording to the PAML manual [57], was created using another python script (treemaker.py). To model dN and dS ratios, codeML 4.8a of the PAML package [56] was used via the ETE3 python framework [69]. For each strict orthogroup, the FASTA alignment of the CDS sequences and the species tree created by treemaker.py were given as inputs. The models run were M0, M1a, M2a, M3, M7, M8 of the sites models, the free-ratios (fb) of branch models and the relaxation (bsA1) and the positive-selection (bsA) of branch-sites mod- els. The branch models were run 4 times for each orthogroup, each time with one species as the foreground branch and the remaining three as background branches. Likelihood ratio tests (LRTs) were performed between the model pairs M0-M1a, M0-M7, M1a-M2a, M7-M8, M0-M3, M0-fb, M1-bsA1, bsA1- bsA. The parameter cleandata was set to 1, in order to remove gaps and ambiguity characters from the alignments. Python scritps were made to extract the omega values from the codeML analyses and the preferred models from the LRTs. Omega values for the M0 model, for all orthogroups, were extracted. Then omega values of M0, M1a and M2a models were extracted, with priority to the most parameter rich model that passed the LRT. The same was done for the M0, M7 and M8 mod- els. This was done in order to pick the best model for omega values for each orthogroup. The data were imported in the R software environment for statistical com- puting and graphics [70] and manipulated with the tidyverse package [71]. Correlations between omegas of the free ratios branch model of different or- thogroups were calculated using the corrr package [72]. Proteins passing LRT for a selection model were annotated by InterProScan v.5.51-85.0 [73].

22 3. Results

3.1 Orthofinder The four proteomes for the C. maculatus, C. analis, C. chinensis and A. obtec- tus species were given as input to Orthofinder. Starting from a total of 218,157 genes, Orthofinder assigned 88.7% of them (193,446) in orthogroups, leaving 24,711 genes unassigned to orthogroups. The number of orthogroups recog- nised were 28,637, with 7,396 of them being species-specific and containing about 20.7% (45,207) of the total number of genes. Orthogroups present in all four species were 8,891 in number. Single copy orthogroups were 1,300 in number. Tables 3.1, 3.2, 3.3 and 3.4 sum up the results in detail (OG stands for orthogroup). From the 28,637 orthogroups identified, 1,129 contained sequences anno- tated as BUSCO genes, 709 contained genes annotated as digestive enzymes, 123 contained genes annotated as FRPs and 173 contained genes annotated as SFPs.

3.2 phylopypruner Phylopypruner was used to convert the hierarchical orthogroups identified to contain BUSCO, digestive, FRP and SFP sequences into strict orthogroups. Additionally, pre-filtering ensured that distance between aligned sequences was smaller than the five standard deviations and that data completeness (no gaps) was more than 80% of the sites of each alignment. This is enough to pro- duce reasonable alignments and thus avoiding manual curation of thousands of them. In particular the BUSCO and the digestive set of alignments was not manually curated, but the SFP and FRP ones were checked manually. From 1,129 BUSCO orthogroups, 792 passed the filtering criteria, from the 709 digestive orthogroups 337 passed filtering, from the 123 FRP orthogroups 55 passed filtering and from 173 SFP orthogroups, 68 passed filtering. Miss- ing data varied in the begining from 11.6% to 29.9% before filtering. After filtering the varied between 4.0% to 5.9% among orthogroups after filtering. More information about pre, post filtering and their strength can be found in tables 3.5, 3.6, 3.7 and 3.8.

23 Table 3.1: Overall statistics for Orthofinder results. Species-specific orthogroup: An orthogroups that consist entirely of genes from one species. G50: The number of genes in the orthogroup such that 50% of genes are in orthogroups of that size or larger. O50: The smallest number of orthogroups such that 50% of genes are in orthogroups of that size or larger. Single-copy orthogroup: An orthogroup with exactly one gene (and no more) from each species. These orthogroups are ideal for inferring a species tree and many other analyses. Unassigned gene: A gene that has not been put into an orthogroup with any other genes.

Overall statistics for orthogroup assignment Number of species 4 Number of genes 218,157 Number of genes in orthogroups 193,446 Number of unassigned genes 24,711 Percentage of genes in orthogroups 88.7 Percentage of unassigned genes 11.3 Number of orthogroups 28,637 Number of species-specific orthogroups 7,396 Number of genes in species-specific orthogroups 45,207 Percentage of genes in species-specific orthogroups 20.7 Mean orthogroup size 6.8 Median orthogroup size 4.0 G50 (assigned genes) 9 G50 (all genes) 8 O50 (assigned genes) 5,226 O50 (all genes) 6,683

24 Table 3.2: Classes of orthogroups (OGs) depending on the average number of genes per species. The majority of orthogroups contains either one gene for some of the species (class <1) or contains on average 1 gene per species (class ’1).

Average number of Number of Percentage Number of Percentage genes per species OGs of OGs genes of genes in OG <1 11145 38.9 26681 13.8 ’1 10254 35.8 53243 27.5 ’2 3796 13.3 34935 18.1 ’3 1581 5.5 20876 10.8 ’4 697 2.4 12040 6.2 ’5 385 1.3 8217 4.2 ’6 219 0.8 5537 2.9 ’7 166 0.6 4873 2.5 ’8 99 0.3 3293 1.7 ’9 63 0.2 2354 1.2 ’10 44 0.2 1831 0.9 11-15 107 0.4 5495 2.8 16-20 27 0.1 1923 1.0 21-50 39 0.1 4502 2.3 51-100 6 0.0 1706 0.9 101-150 5 0.0 2336 1.2 151-200 3 0.0 2052 1.1 201-500 1 0.0 1552 0.8 501-1000 0 0.0 0 0.0 ’1001+ 0 0.0 0 0.0

25 Table 3.3: Number of genes assigned and unassigned to orthogroups (OGs) and num- ber of OGs containing genes from one, two, three or all of the species studied.

Number of species in OG Number of OGs 1 7,396 2 6,882 3 5,468 4 8,891

Figure 3.1. Counts and percentage of Likelihood ratio tests (LRTs) per type of protein and per model that passed or failed.

3.3 Omega ratios Likelihood ratio tests (LRTs) between a null model without selection and an alternative model with selection were performed for sites and branch-sites model in order to reject the hypothesis that positive selection is absent. As expected, fewer LRT were passed than the ones that failed. It is notewor- thy that the conserved BUSCO set of genes passed the M1a vs M2a and the branch-site for A. obtectus LRTs more frequently (18.9% and 20.78% respec- tively) compared to the other sets of genes (1.5-2.94% and 5.99-8.82%). The numbers for the site and branch-site LRTs that were passed can be found in Figure 3.1. The sites models are well suited to identify positive and negative selection on certain sites of the amino acid sequence of the proteins. The simplest model is M0 and it assumes that all sites of proteins of all species evolve under one omega value. The distribution of omega values for this model is available in Figure 3.2 on panel A. The M7 model splits the sites in 8 classes of the

26 Table 3.4: Statistics per species for orthogroup assignment from Orthofinder. Check Table 3.1 for definitions of statistics.

Species Statistics C. maculatus C. analis C. chinensis A.obtectus

Number of genes 68,811 54,848 59,338 35,160 Number of genes 61,718 47,463 52,045 32,220 in OGs Number of unas- 7,093 7,385 7,293 2,940 signed genes Percentage of 89.7 86.5 87.7 91.6 genes in OGs Percentage of 10.3 13.5 12.3 8.4 unassigned genes Number of OGs containing 18,936 19,701 19,230 15,261 species Percentage of OGs containing 66.1 68.8 67.2 53.3 species Number of species specific 2,773 1,865 1,790 968 OGs Number of genes in species specific 17,226 10,504 12,557 4,920 OGs Percentage of genes in species 25.0 19.2 21.2 14.0 specific OGs

27 Table 3.5: Statistics for BUSCO set of genes as input and the output from phylopy- pruner.

BUSCO genes Alignment statistics: Description Input Output No. of alignments 1,129 792 No. of sequences 6,935 3,168 No. of OTUs 4 4 Avg no. of sequences / alignment 6 4 Avg no. of OTUs / alignment 3 4 Avg sequence length (ungapped) 1,454 1,329 Shortest sequence (ungapped) 138 207 Longest sequence (ungapped) 9,633 6,360 % missing data 11.60 4.00 Concatenated alignment length 1,789,267 1,102,568 Methods summary: Description No. removed % of input Short sequences 0 0.00 Long branches 8 0.12 Ultrashort distance pairs 35 0.50 Divergent sequences 0 0.00 Collapsed nodes 490 7.07 OTUs < occupancy threshold 0 0.00 Genes < occupancy threshold 108 1.56

28 Table 3.6: Statistics for digestive set of genes as input and the output from phylopy- pruner.

Digestive genes Alignment statistics: Description Input Output No. of alignments 709 337 No. of sequences 5,687 1,348 No. of OTUs 4 4 Avg no. of sequences / alignment 8 4 Avg no. of OTUs / alignment 3 4 Avg sequence length (ungapped) 1,491 1,391 Shortest sequence (ungapped) 69 138 Longest sequence (ungapped) 15,003 7,140 % missing data 22.60 5.90 Concatenated alignment length 1,296,224 502,787 Methods summary: Description No. removed % of input Short sequences 3 0.05 Long branches 8 0.14 Ultrashort distance pairs 40 0.70 Divergent sequences 0 0.00 Collapsed nodes 407 7.16 OTUs < occupancy threshold 0 0.00 Genes < occupancy threshold 89 1.56

29 Table 3.7: Statistics for female reproductive set of genes as input and the output from phylopypruner.

Female reproductive genes Alignment statistics: Description Input Output No. of alignments 123 55 No. of sequences 983 220 No. of OTUs 4 4 Avg no. of sequences / alignment 7 4 Avg no. of OTUs / alignment 3 4 Avg sequence length (ungapped) 1,892 1,175 Shortest sequence (ungapped) 177 315 Longest sequence (ungapped) 13,908 3,762 % missing data 20.40 4.00 Concatenated alignment length 232,031 67,491 Methods summary: Description No. removed % of input Short sequences 0 0.00 Long branches 0 0.00 Ultrashort distance pairs 8 0.81 Divergent sequences 0 0.00 Collapsed nodes 68 6.92 OTUs < occupancy threshold 0 0.00 Genes < occupancy threshold 15 1.53

30 Table 3.8: Statistics for seminal fluid set of genes as input and the output from phy- lopypruner.

Seminal fluid genes Alignment statistics: Description Input Output No. of alignments 173 68 No. of sequences 1,243 272 No. of OTUs 4 4 Avg no. of sequences / alignment 7 4 Avg no. of OTUs / alignment 3 4 Avg sequence length (ungapped) 1,286 1,073 Shortest sequence (ungapped) 66 267 Longest sequence (ungapped) 13,029 4,371 % missing data 29.90 5.50 Concatenated alignment length 289,672 77,620 Methods summary: Description No. removed % of input Short sequences 1 0.08 Long branches 0 0.00 Ultrashort distance pairs 11 0.88 Divergent sequences 0 0.00 Collapsed nodes 76 6.11 OTUs < occupancy threshold 0 0.00 Genes < occupancy threshold 21 1.69

31 same size and infers the different omega values (ω < 1) for each class using a beta distribution. The M8 model is an M7 model that also assumes one extra class of sites with omega greater than one. The distribution of omega values is available in Figure 3.2 on panel B. The M1a model assumes two classes of sites, one with sites under purifying selection (ω < 1) and one with sites evolving under neutrality (ω = 1). The M2a is an M1a model with an extra class of sites that evolve under positive selection (ω > 1). See panel C in Figure 3.2. Because not all models are equivalent, as for a certain alignment of se- quences one model may be better fitting the data and have a larger likelihood, the data for each protein in Figure 3.2 come from the most rich parameter model that has a great enough likelihood to pass the LRT with a less complex model. For example when considering the M0, M1a and M2a models, if the LRT between M1a-M2a is passed, then the M2a model is selected to provide with data for the sites of the proteins. If the LRT rejected the M2a model, but the M0-M1a LRT kept the M1a model, then the M1a is selected to provide the data. If the M0-M1a LRT also rejects the M1a model, then the M0 model provides the data for the alignment. From the distributions of omega values among the different protein types in Figure 3.2 we can see that the median of omegas for the SFPs and the digestive enzymes is about 1.4-3 times greater than the median for the con- served BUSCO genes. Also the FRPs seem to be equally conserved to the BUSCO genes when considering the M0, M7 and M8 models, or even more conserved when considering the M0 model alone or the M0, M1a and M2a models. Kolmogorov-Smirnov (KS) tests were performed to check if omega values of different protein types come from the same distribution. KS tests did not detect a significant difference between the FRP and the Busco distribution (D = 0.18, p − value = 0.07) and between the SFP and digestive distribution (D = 0.07, p−value = 0.81). All other pairwise comparisons were significant. Table 3.9 summarises the results of the KS tests. Proteins passing some LRT for a site or branch-site model with positive selection were 92 digestive proteins, 9 FRPs and 26 SFPs. The majority of them were annotated by Inter-ProScan v.5.51-85.0 [73] with information from the SUPERFAMILY (1.75), ProSiteProfiles (2019_11), ProSitePatterns (2019_11), Pfam (33.1) and GeneOntology databases. Figures 3.3 and 3.4 show summarise the annotations extracted from the databases. To detect correlated evolution between different types of proteins, omega values from the free-ratios model were correlated to each other. The free- ratios model estimates an omega value for each branch of our phylogeny and in co-evolving proteins that interact with each other, these omega values are expected to be producing positive correlations, so that their evolutionary rates are "coordinated" enough to maintain the interaction across the phylogeny. A total of 530 proteins had the logarithm of their omega values per branch correlated to each other. The logarithm of omegas was used in order to nor-

32 Figure 3.2. The distribution of omega values of classes of sites found in different site models. The position of the median for each distribution is indicated with a dotted line and the numbers over each distribution indicate the value of the median. A) All proteins are assumed to have one class of sites. B) Proteins are assumed to have nine classes of sites if they passed the M7-M8 LRT, eight classes if they passed the M0-M7 LRT or just one class of sites if neither LRT is passed. C) Omegas are assumed to have 3 classes of sites, if they pass the M1a-M2a LRT, two classes of sites if they pass the M0-M1a LRT or just one class of sites if they pass neither LRT. This way, the most parameter-rich models of site evolution that are significantly more likely to describe the data are chosen for each protein. Omega values for all classes of sites for each protein are plotted.

33 Table 3.9: Summary table of two-way Kolmogorov-Smirnov tests for comparing den- sity distributions. Three different ways for estimating distributions of ω values for sites were used, based on model M0, models M0, M1a and M2a and models M0, M7 and M8. For each way of modeling, the distributions of ω values for each protein type were compared to each other. Null-hypothesis states that the two sample distributions compared come from the same distribution. The statistic D of the test is shown, i.e. the supremum of the absolute differences between the empirical cumulative distribution functions compared. More simply it is the largest absolute difference between the two distribution functions across all x values.

K-S tests Models that estimated ω values M0 M0, M1a, M2a M0, M7, M8 Comparisons D P-value D P-value D P-value Busco-Dig. 0.17 <0.01 0.13 <0.01 0.09 <0.01 Busco-FRPs 0.25 <0.01 0.18 0.07 0.03 1 Busco-SFPs 0.27 <0.01 0.17 0.02 0.12 <0.01 Dig.-FRPs 0.39 <0.01 0.31 <0.01 0.11 0.04 Dig.-SFPs 0.12 0.41 0.07 0.81 0.04 0.67 FRPs-SFPs 0.43 <0.01 0.29 0.01 0.14 0.03

Figure 3.3. Summary of number of proteins per type that were found to show signs of selection, were annotated and were annotated with Gene Ontology terms.

34 Figure 3.4. Summary of the number of times that a Gene Ontology term was found. Data for proteins found to be selected are presented per protein type. Here only GO terms that were found more than once are shown, as to present the ones most frequent and avoid making a long list. malise their distribution and allow for Pearson’s r estimates that were not bi- ased towards negative values, as proteins are expected to produce correlations centered around r = 0 when they do not interact. Then any possible interaction would move the distribution of Pearson’s correlations towards positive values, producing correlations with shapes similar to Figure 3.5A. In Figure 3.5 the distributions of Pearson’s r per type of proteins being com- pared are summarised. In A, density plots show the distribution shapes, with dotted lines indicating the position of the medians and the numbers in the plots showing the value of the medians. In B, the results of Kruskal-Wallis tests are summarised. First, there are differences in Pearson’s r among the different types of comparisons (p = 1.1 · 10−9). Second, the significance of pairwise comparisons is shown with brackets. Finally, significance for differences be- tween the means of each distribution and the total mean of Pearson’s r (dashed line) is shown above each boxplot. To see which correlations between the 8464 pairwise SFP-FRP compar- isons are of importance, the false discovery rate correction for multiple testing was applied to the p-values of the Pearson’s r correlations. Significant corre- lations are presented at Table 3.10. One SFP has strongly correlating branch omega values with 3 different FRPs, so 3 SFPs and 5 FPRs have been identi- fied to show a pattern of correlated evolution.

35 Figure 3.5. A. The distribution of Pearson’s r coefficient between correlating estimates of branch omegas according to the free-ratios model. The correlations produced when comparing the omega of branches of SFPs to the FRPs are on average greater than any comparison between any other protein types. The value of the median for each distribution is indicated with a number and the median’s position with a dotted line. B. Summary of Kruskal-Wallis test results. Pearson’s r differs between types of com- parisons. With brackets, the p-values of pairwise comparisons and below them the p-values of each distribution compared to the total are summarised.

36 Table 3.10: Proteins producing high and significant correlations of branch omega vales after a false discovery rate correction for multiple testing. The codes for the transcripts of each protein are given.

Correlating pairs of SFPs and FRPs SFP FRP Pearson’s r CALMACT00000035026 CALMACT00000004474 0.998 CALMACT00000016355 CALMACT00000031120 0.998 CALMACT00000024208 CALMACT00000010734 0.999 CALMACT00000024208 CALMACT00000014632 0.999 CALMACT00000024208 CALMACT00000022367 1.00

3.4 Tajima’s D estimates and MacDonald - Kreitman tests Population data for C. maculatus from three areas, Brazil, California and Yemen, were provided from Sayadi et al, 2019 [28]. Estimates of Tajima’s D for non-synonymous polymorphisms for each coding sequence are pre- sented by population and by protein type in Figure 3.6. Kruskal-Wallis Rank Sum Tests did not detect differences in Tajima’s D medians per protein type (Kruskal-Wallis chi-squared = 6.3898, df = 3, p-value = 0.09411), or per pop- ulation (Kruskal-Wallis chi-squared = 3.9396, df = 2, p-value = 0.1395). Additionally, pN and pS estimates for each coding sequence were provided. In combination with estimates of dN and dS rates from the M0 model from the codeML analysis, MacDonald-Kreitman tests were performed by calculating the neutrality index NI, (equation on page 18) for each coding sequence. Es- timates of NIs by population and by protein type is presented in Figure 3.7. Kruskal-Wallis Rank Sum Tests did not detect differences in NIs’ medians per population (Kruskal-Wallis chi-squared = 0.29853, df = 2, p-value = 0.8613), but found differences in medians between types of proteins (Kruskal-Wallis chi-squared = 12.425, df = 3, p-value = 0.006061). To identify which pairs of protein types showed a significant difference at the NIs, I performed a Dunn’s test with a false discovery rate correction. There is no significant difference between the Busco proteins, the FRPs and the SFPs, but the digestive enzymes differ significanlty from all other pro- tein types and have greater NI values on average. The Dunn’s test output is summarised in Table 3.11.

37 Figure 3.6. Estimates of Tajima’s D for non-synonymous polymorphisms by popula- tion and protein type. Violin plots show the shape of the distributions and boxplots within them indicate the median and the quartiles. Vertical dashed lines indicate the Tajima’s D value for neutrality.

Figure 3.7. Estimates of neutrality indices (NIs) by population and protein type. Violin plots show the shape of the distributions and boxplots within them indicate the median and the quartiles. Vertical dashed lines indicate the NI value for neutrality.

38 Table 3.11: Dunn’s test for group similarity. Here are shown the groups being com- pared, number of observations for each group (n1 and n2), the test’s statistic, the p-value and the q-value, i.e. the corrected p-value for multiple testing by using the false discovery rate method. Busco proteins, FRPs and SFPs do not have different NIs. Digestive proteins have significantly larger NIs than all other protein types.

Dunn’s test for pairwise comparison of group NIs

Group 1 Group 2 n1 n2 Statistic p-value q-value Busco Digestive 1644 801 2.72 0.00659 0.0357 Busco FRP 1644 116 -1.14 0.255 0.306 Busco SFP 1644 162 -1.21 0.226 0.306 Digestive FRP 801 116 -2.28 0.0226 0.0453 Digestive SFP 801 162 -2.52 0.0119 0.0357 FRP SFP 116 162 0.0806 0.936 0.936

3.5 Linear models on omega values As mentioned on Section 1.6, many factors can may influence the evolution- ary rates of proteins. So, in order to determine whether such factors affect the omega values of the proteins examined, linear models with omega values of the M0 model as the response variable were created and compared with models with different dependent variables. The factors examined in the linear models were the type of the proteins, the expression levels of the proteins in C. maculatus (in fragments per kilobase of transcript per million of mapped reads, FPKM) and the expression bias in sex when there was one (LOGFC), again in C. maculatus. To avoid the covariance of the interaction of FPKM:LOGFC with FPKM and LOGFC, these terms were centered by subtracting with their mean and then scaled by dividing with their standard deviation. The distributions of FPKM values per protein type and sex-biased expression per protein type, be- fore and after centering and scaling, are presented in Figure 3.8 and Figure 3.9 respectively. Notice in Figure 3.9 that the FRPs are not sex-biased in expres- sion as much as SFPs. This is because FRPs are proteins whose expression levels change in females after mating, while SFPs are proteins that are ex- pressed in the seminal fluid and the male reproductive tract. A linear model was created that included the protein type, the scaled log- arithm of 1+FPKM, the scaled LOGFC and all the two-way interactions. To improve the fit and satisfy the assumptions of linear regression, the Boxcox procedure was used to identify the best power transformation for the omega values, using the boxcox function of the MASS package in R. Lambda was found to be λ = 0.3030. In Figures 3.10 and 3.11 are the diagnostic plots

39 Figure 3.8. Boxplot summarising the distribution of expression levels of proteins per type of protein, before and after centering and scaling. FRPs are on average more highly expressed compared to the rest of the types, followed next by the SFPs. Con- served Busco genes are on average expressed the least.

Figure 3.9. Boxplot summarising the expression bias in sex per gene type, before and after centering and scaling. Negative values indicate male-biased expression, positive values indicate female-biased expression and values around zero indicate no bias. As expected, the SFPs are male-biased. FRPs are not strongly female-biased.

40 Table 3.12: Analysis of variance for the linear model of omega. All predictors of the model are significant, except for LOGFC.

Analysis of Variance Table for Linear Model Term Sum Sq. (Type III) Df F-value Pr(>F) FPKM 0.078 1 8.110 <0.001 LOGFC 0.013 1 1.343 0.247 Type 0.341 3 11.785 <0.001 FPKM:LOGFC 0.218 1 22.615 <0.001 FPKM:Type 0.103 3 3.558 0.001 LOGFC:Type 0.189 3 6.526 <0.001 Residuals 6.977 723 for the model before and after the transformation. Tables 3.12 and 3.13 give a summary of the model. The model is described by the equation:

λ ω = β0 + β1xFPKM + β2xLOGFC + β3tDigestive + β4tFRP + β5tSFP + β6xFPKMxLOGFC + β7xFPKMtDigestive + β8xFPKMtFRP + β9xFPKMtSFP + β10xLOGFCtDigestive + β11xLOGFCtFRP + β12xLOGFCtSFP where tDigestive, tFRP and tSFP are evaluated by recoding as dummy variables, i.e. they are equal to 0 or 1. The model is statistically significant and explains 18% of the variance of omega values. Generally high expression levels are linked with low omega values. Sex-biased expression does not seem to affect omegas. Digestive en- zymes are linked with increased omega values, while FRP and SFP types do not seem to be so, though the SFP estimate is as large as the digestive one and with a not so big p-value (0.106). Expression levels and sex-biased ex- pression have a significant interaction which is explored in Figure 3.13. Also, protein type interacts with LOGFC, meaning that per protein type, the effect of sex-biased expression on omegas changes. The same is true for the effect of expression levels on omegas. These interactions are explored in Figures 3.14, 3.15 and 3.16.

41 Figure 3.10. Diagnostic plots before the Boxcox transformation.

Figure 3.11. Diagnostic plots after the Boxcox transformation. The linear relationship between response and dependent variables is improved, residuals are more normally distributed, heteroscedacity is reduced and values that influenced the regression no longer do so.

42 Table 3.13: Summary of covariates of linear model in table 3.12. The coefficient estimates for all continuous predictors and their significance is summarised. About 18% of the variance of omega values can be explained by the model.

Summary Table of Covariates of Linear Model Term Estimate Std Error t-statistic P-value Intercept 0.392 0.005 76.451 <0.001 FPKM -0.017 0.006 -2.848 0.005 LOGFC -0.009 0.008 -1.159 0.247 FPKM:LOGFC -0.02 0.004 -4.755 <0.001

2 2 Statistics of model Residual σ Residual df R Rad j 0.0982 723 0.20 0.18 F-statistic df P-value 14.82 12 4.95 · 10−28

Figure 3.12. Estimating the most likely value of lambda for power transformation of omega values.

43 Figure 3.13. Sex-biased expression influences whether expression levels will affect the omega values. For slightly male-biased genes (LOGFC = -0.88), the effect of ex- pression levels on omega values will not be modified. For more male-biased genes the effect of expression levels on omegas will become more positive, increasing omega values, while for female-biased genes effect of expression levels on omegas will be- come more negative, resulting in smaller omegas.

Figure 3.14. Interaction between protein type and expression levels. The effect of expression levels on omega values changes depending on the protein type. The great- est negative effect of expression level is for the FRPs. For the SFPs, the effect of expression levels on omegas seems to be reversed, as the slope of the regression line is slightly positive.

44 Figure 3.15. Interaction between protein type and sex-biased expression. It seems that more male-biased SFPs and digestive enzymes will have greater omegas than proteins of the same type which are less male-biased. The opposite seems to be true for the FRPs and the Busco genes, as with greater female-biased expression, omega values seem to increase.

Figure 3.16. Regression model with all three predictor variables, protein type, expres- sion levels and sex-biased expression. The slopes and intercepts of the regression lines of expression levels to omega values change depending on the expression-bias of the genes and the protein type.

45 4. Discussion

The protein sets studied here show interesting patterns of selection and for a good reason. The digestive proteins are the main agents for detoxifying the great variety of defensive secondary metabolites that the host plants produce. Without proper function of the digestive proteins, larvae of the beetles will be unable to survive the chemical stress and digest the content of the beans, resulting in great reduction of survival at early stages of life. This implies that patterns of strong purifying selection would be found. At the same time though, when looking at the phylogenetic tree of the Bruchinae subfamily, many host shift events can be inferred [5]. For these events to happen, changes in the digestive proteins would have to occur for two reasons. First to detoxify novel defensive compounds of new hosts and second to detoxify compounds that were not abundant in the ancestral host, but are now in the new host and thus have a greater effect at stopping growth of the beetles. Because of this, patterns of episodic positive selection in the branches where host shifts took place are expected. These changes due to positive selection in certain branches would move the distribution of ω values for sites towards greater values than in the case of having only purifying selection. This is what one can see happening in Figure 3.2, where three different assumptions of how sites evolve are taken into account and in all scenarios the digestive enzymes have a significantly greater median than the Busco proteins, proteins that should be mostly under purifying selection. The SFPs and the FRPs are expected to be involved in sexual conflict that results in sexually antagonistic co-evolution. SFPs have been shown to ma- nipulate many aspects of female fitness in order to increase their own [36], so these effects are expected to be countered by the FRPs. The data show some patterns that are in agreement with the sexually antagonistic co-evolution sce- nario, where the changes in SFPs "try to keep up" with the changes in FRPs. First, the SFPs have significantly greater ω values than the FRPs. If the FRPs can negate the effects of the SFPs by avoiding interacting with them, then all it takes is a few mutations in key regions of the FRPs to change their tertiary structure and stop the interactions. For the SFPs to maintain the interactions, it would require many more mutations to adjust their tertiary structure accord- ingly. This means that SFPs should be evolving faster than the FRPs, which is the case in Figure 3.2, where in all scenarios of site evolution, SFPs have significantly greater ω values than the FRPs. Second, correlation of ω values for branches was the largest between the SFPs and FRPs as shown in Fig- ure 3.5. This suggests that the evolutionary rates of the proteins are correlated

46 because of interactions that are maintained across the phylogeny of the species studied, which is in agreement with the sexually antagonistic co-evolution sce- nario. An alternative explantion is that the SFPs are expected to be evolving fast as they exhibit greater sex-specific expression than the FRPs, as shown in Figure 3.9. Sex specific expression is predicted to increase gene diversity by reducing the effectiveness of positive and negative selection [37] and could explain the increased ω values compared to the FRPs, but cannot account for the high correlation between the branch omega values between the SFPs and FRPs.

4.1 Proteins under positive selection To identify which proteins are under positive selection, I tried to combine the information from the codeML modeling of omega values, the MacDonald- Kreitman tests and the Tajima’s D values for non-synonymous polymorphisms. I first searched for proteins that pass a likelihood ratio test (LRT) for some model of codeML for positive selection (site and branch-site models) and then filtered for proteins that have a neutrality index that is significantly lower than 1 according to Fisher’s exact test and a Tajima’s D value lower than -2 for at least one population of C. maculatus. Only one digestive protein (Gene code: CALMACG00000013581, Tran- script code: CALMACT00000022427) completed all those criteria and is annotated by similarity as an E3 ubiquitin-protein ligase, UBR3. It passed the branch-site LRT for Acanthoscelides obtectus as the foreground branch, mean- ing that in that species sites with ω > 1 have been identified, while in the other species is ω ≤ 1. UBR3 has a ligase activity that allows ubiquitylation of ly- sine residues of a target protein. This target, depending on the ubiquitylation, will be marked for degradation by the proteasome or interact with proteins involved with membrane and endocytic trafficking, inflammatory response, protein translation, and DNA repair [74]. According to Gene Ontology, other biological processes associated with UBR3 are embryo development ending in birth or egg hatching, sensory perception of smell and suckling behaviour of mammals. As it has a wide variety of functions and effects, modifications on critical residues of this protein in A. obtectus could allow for changes in trafficking of digestive proteins in the the cell membrane or the digestive tract, which in turn could allow for detoxification of chemical compounds that were rarely found in a previous host and thus allowing for the digestion of a new type of bean. As CALMACG00000013581 is a promising, but only one candidate, I decided to reduce the filtering criteria for proteins under selection. This meant that I would take into account only one or two of the three selection tests. Also the tests seem to largely disagree on which proteins are under selec- tion. For instance, the codeML modeling of omega values found 92 digestive

47 proteins, 9 FRPs and 26 SFPs to pass an LRT for site and branch-site mod- els. The MK tests found 65 digestive proteins, 6 FRPs and 18 SFPs under selection. Proteins identified from both tests as being under selection were 26 digestive and 8 SFPs, while no FRPs were found to be so. When doing the same for Tajima’s D values below -2 and the codeML LRTs, only one additional digestive protein fulfilled both criteria, with the first being CAL- MACG00000013581. Also, because population data were available only for C. maculatus, non-synonymous and synonymous substitutions within species are assumed to be the same with C. maculatus for all species, which could lead in biases of the MK tests. For these reasons, the codeML modeling of omega values is the most robust of the three tests of selection, so it is the one with the most promising candidate genes under selection, with 92 digestive proteins, 9 FRPs and 26 SFPs. These proteins were annotated by InterProScan v.5.51-85.0 [73] with infor- mation from the SUPERFAMILY (1.75), ProSiteProfiles (2019_11), ProSite Patterns (2019_11), Pfam (33.1) and Gene Ontology databases. From the 92 digestive, 9 FPRs and 26 SFPs, proteins that had no match were 29 diges- tive, 1 FRPs and 6 SFPs, suggesting that they are potentially new proteins not encountered before. Gene Ontology terms were recovered for some of the an- notated proteins (see Figure 3.3 for the number of proteins annotated). Among the different functions identified to be related with these proteins (summarised in Figure 3.4), the common ones were Proteolysis, Protein binding, Oxidore- ductase activity, Hydrolase activity and Carbohydrate metabolic processes. Specifically, four FRPs were related to proteolysis and oxidoreductase activ- ity. These could be involved in activating signal transduction pathways by modifying and activating proteins at the beginning of such pathways, while the pathways control aspects of female fitness, such as remating, sperm util- isation, egg production, food intake, activity. The SFPs were also related to proteolysis and oxidoreductase activity and could thus disrupt the functionality of the FRPs by modifying them. Finally, many digestive proteins were related with protein binding, carbohydrate metabolic processes and oxidoreductase activity, with the last function possibly being able to detoxify the secondary metabolites that plant hosts produce for defence.

4.2 Distributions of omegas for sites When considering the omega values of the M0 model alone and looking at Fig- ure 3.2 and Table 3.9 for the medians of the distributions and p-values of the K-S tests, some inferences can be made. On average the FRPs have the lowest omega values, the conserved Busco proteins have on average greater omega values than the FRPs, the SFPs and digestive proteins have higher omega val- ues on average than the Busco and finally SFPs and digestive proteins have about the same omega values. These patterns, of the SFPs and digestive pro-

48 teins, are in agreement with the expectation of positive selection acting on them, thus producing higher overall ω values. When I allow for some proteins to have also some neutral and positively selected sites, i.e. I have omegas from M0, M1a and M2a models where they fit better, then the pattern described in the previous paragraph changes for the FRPs and Busco proteins, as they now have about the same omega values. In the case where I have only the M0 model, if I allow for certain proteins to have a finer classification of sites by including more omega classes, i.e. eight classes of omega values without positive selection and eight plus one when with positive selection, then the same again the FRPs have about the same distribution of omegas with the Busco proteins with the medians of the distributions even closer this time. By comparing the three ways to describe the evolution of sites, i.e. with the M0 model only, with the M0, M1a and M2a models or with the M0, M7 and M8 models, I can tell that the digestive proteins and the SFPs have about the same distribution of omegas for sites and that this distribution is also higher from the Busco’s one, meaning that negative selection acts less and positive selection more on digestive proteins and SFPs than on Busco proteins. Note that the Busco proteins provide a good estimate of strong purifying selec- tion, as they come from single copy genes of C. maculatus that are conserved across the Arthropoda phylum. For the FRPs, I can infer that they have about the same distribution of omega values and thus the same strength of negative selection, as the medians of the two distributions become less different with more complicated modeling of omega values, when for example we include more eight and nine classes of sites. This means that the FRPs are actually conserved genes and for them to have an effect on mating as well as evolu- tionary rates comparable to the Busco proteins, implies that they should be pleiotropic. This follows, as a gene with many functions should be more con- strained and resistant to change compared to a gene with only one function, as a change that has a positive effect for one function is likely to have negative effects on the rest. This is also in agreement with the sex-specific expression of the FRPs summarised in Figure 3.9, which is not biased at all towards pos- itive LOGFC values and thus more female-biased expression, but instead is biased towards zero, meaning equal expression in both females and males. It is important to note that when transitioning from the M0 model alone to the M0 and M1a and M2a models, or from the M0 model alone to the M0 and M7 and M8 models, the positive sites are quite few in number, so much that if removed, it would not make any noticeable difference to the shape of any distribution. This would suggest that the observed differences at the shapes of the distributions, comparing across the three different cases of models se- lected, would be mainly due to differences in the strength of purifying selec- tion. This of course does not mean that positive selection is not important for driving the evolution of SFPs, FRPs and digestive proteins. It rather means that the overall distribution of sites is driven by purifying selection and that

49 differences in shape can provide information about the strength of purifying selection. The importance of positive selection is more difficult to highlight, as for each single protein found here to exhibit positive selection, experimental data would have to be available for the sites found to be selected. For example such data could come from mutating and evaluating the function of a protein or by comparing of function between the inferred ancestral molecules and the one suspected of being positively selected.

4.3 Linear model of ω values The linear model in this work predicts the ω values of the M0 model, a product of analysis on the sequences of all four beetle species, based only on expres- sion data from C. maculatus. This bias does imply that the linear model may not be accurate enough and that expression data from the rest of the species should be included when available, but the relationships identified here be- tween ω values and expression levels, sex-biased expression and protein type should be true. The model explain about 18% of the variation of ω values and I believe that if expression data from the rest of the species are taken into account, the explanatory power of the model will increase. From Tables 3.12 and 3.13 we see that as expression levels (FPKM) in- crease, ω values become lower. This negative correlation is well known as the E-R anticorrelation (E for expression and R for rate of evolution) and has been identified in both prokaryotes and eukaryotes [75]. It can be explained by the protein misfolding avoidance hypothesis, where misfolded proteins are cytotoxic and reduce fitness and thus genes with higher expression have a higher selective pressure to avoid misfolding of proteins, which can be caused by amino acid substitutions in our case. Another scenario is the protein mis- interaction avoidance hypothesis, where proteins that are highly expressed have larger concentrations in the cells and are more likely to have deleterious protein-protein interactions with no physiological function. Thus sites of such proteins will tend to evolve slower to avoid misinteractions. The mRNA fold- ing requirement hypothesis explains lower substitution rates by assuming that mRNAs of highly expressed genes are selected to have stronger RNA folds. So mutations disrupting this fold will be deleterious and thus the proteins will have reduced substitution rates. The last hypothesis explaining the E-R anti- correlation is the expression cost hypothesis, where the costs and benefits of producing a protein is increased with how abundant the protein is. Thus, when proteins with high expression and high abundance have amino acids replaced, the costs that come from that will be greater than in a less abundant protein, resulting in the E-R anticorrelation [75]. From the summary tables of the linear model one can see that sex-expression bias (LOGFC) does not affect the ω values, but interestingly the FPKM- LOGFC interaction is significant. This means that sex-biased expression seems

50 to affect the intensity of the E-R anticorrelation and this is clear in Figures 3.13 and 3.16. The more male-biased a protein is, the less intense is the E-R anti- correlation, while the more female biased, the more pronounced the anticorre- lation. My guess is that this holds, because most of the male-biased genes are SFPs and most of the female-based genes are FRPs and Busco genes, as shown in Figure 3.9. The SFPs are also exprected to be secreted outside of the cells, in the seminal fluid, while the majority of FRPs and Busco genes are probably found inside the cells. This means that the SFPs should have less deleteri- ous effects than the Busco and FRPs according to the protein misinteraction avoidance hypothesis. This is also corroborated by Figure 3.14, where the E- R anticorrelation is found in the FRPs, Digestive and Busco proteins, while it is not found in SFPs. This suggests that SFPs with higher expression have less evolutionary constraints than other protein types with the same expres- sion level. Additionally, negative effects according to the protein misfolding avoidance hypothesis could be less pronounced on the SFPs, as misfolding could allow for new interactions with female proteins and thus opportunities for males to increase their fitness by manipulating female physiology. This also means that costs for producing SFPs at high levels are also smaller com- pared to producing other proteins at high levels, as they potentially provide some advantage in fitness.

4.4 Correlations of branch omegas and protein coevolution The SFP-FRP pairs of Table 3.10 that show a significant correlation between their branch omega values are the best candidates that I have to look for coevo- lution. When looking at the annotations of these proteins, I see that the CAL- MACT00000024208 produces an SFP that is a Phosphoglycerate mutase-like of the Histidine phosphatase superfamily, so similar to an enzyme that cataly- ses a step of glycolysis, and interacts with CALMACT00000010734 that has an Ig-like, a Fibronectin type-III and a Neurofascin/L1/NrCAM C-terminal domain. Ig-like domains are involved in binding function. Ligands can be from small molecules (antigens, chromophores), hormones (growth hormone, inter- ferons, prolactin), up to giant molecules (muscle proteins). Proteins that have Ig-like domains can be anything from immunoglobulins and interleukins to myelins and galactose oxidases. Fibronectin type-III domains can be found in extracellular-matrix molecules, cell-surface receptors, enzymes, and muscle proteins [76]. They have a short stretch of amino acids with the sequence Arg-Gly-Asp (or RGD) that interacts with integrins and can modulate cell adhesion associated with thrombosis, in- flammation, and tumor metastasis [77].

51 The Neurofascin/L1/NrCAM C-terminal domain (IPR026966), has a highly conserved FIGEY motif. The domain’s function is not known and it is essen- tially the C-terminal region of several nervous system cell adhesion molecules, like neuroglian, neurofascin, neural adhesion molecule L1, NrCAM and Ng- CAM [78]. This means that this specific FRP could be one of these molecules. At this point I can only guess what the nature of the possible interaction between something like a glycolysis enzyme and a nervous system cell adhe- sion molecule would be. The Phosphoglycerate mutase-like SFP might not be participating in the glycolysis cycle, but to some other process that involves interacting with the FRP. The neuron-related FRP could promote signal trans- duction to pathways that affect female physiology. Then the male would get a reproductive advantage by altering its function, for example by deactivating the FRP by removing a phosphate group from it. A thing to question about the correlations between these proteins is how meaningful they are. Gene annotations are not perfect and provide only a rough idea of what could potentially be the real function of these not experi- mentally characterized proteins. Also, they are incomplete and without exper- imental data available one can only suspect the proteins’ functions. Addition- ally, even annotations are accurate up to a certain level. This means that it is necessary to identify protein function and verify any interaction suspected. Moreover, presence of confounding factors could affect the correlations of branch ω values, producing false positive and/or negatives. From the linear model created in this work, it is known that expression levels of a protein tend to influence the overall omega value of the average gene of the species. It is reasonable to assume that the same is true for branch-specific omegas. Thus the correlations produced could be somewhat influenced by expression levels and are not the true correlations only due to coevolution of sites because of an interaction that needs to be maintained through the species tree.

4.5 Limitations of the study There are certain factors limiting the power of this study, with the first being its nature, i.e. comparative. Making a comparative study requires that the thing being studied is found in all of the species being compared. So, the power of a comparative study is dependent on how effectively one can identify orthologues and orthogroups. This also relies on the nature of the target of interest. In this case, the SFPs and any coevolving FRP are rapidly evolving genes [36] and this means that there is great chance that some orthologues between species have diverged so much, that are no longer grouped together into orthogroups by the algorithms. This seems to hold true for the SFPs and FRPs, as starting from 185 SFPs, 126 FRPs and 741 digestive proteins of C. maculatus, I managed to identify orthogroups involving all four species for 68, 55 and 337 of them respectively. When considering orthogroups involving

52 three out of the four species, I found only a couple of new orthogroups of SFPs and some ten new orthogroups of digestive proteins. This would mean that the data produced here are representative of half or less of the proteins of interest. All genes examined here are assumed to be passed independently from par- ents to offspring, which might not be the case for some of them. A favourable mutation that leads to fixation will also lead to fixation of the variants that were associated with it originally [79], a phenomenon called genetic hitch- hiking. The same is true for neutral alleles that are linked to alleles under negative selection, only this time they are purged and not fixed. These sce- narios are cases of linked selection. This means that proteins found here to be positively selected, could be that are associated with some other locus not examined here and thus be fixed due to genetic hitchhiking. This association can only break with recombination, so data on the recombination rate of areas of the genomes could potentially expose such cases. Another issue is the number of species included in the analysis. Here I in- cluded four species because I had genome assemblies and proteomic data of good quality for them. If this kind of data were available for more closely re- lated species, then I suspect that the power of the codeML modeling of omega values would be more powerful than what it is now. This should be espe- cially true for the branch-site models, as the more branches are added, the more chances there are to find those in which positive selection has acted. Also, more branches in the analysis would mean more branch omega values with which to make correlations. So, this approach for identifying correlated evolution would also become more powerful. Also, there is always the possibility of data incompleteness. The proteomes used here are of great quality, as they have been validated by expression and proteomic data [26, 28]. But it could be that there are proteins that were not identified by this process. For example, SFPs could be small peptides [36], which could be missed in large scale analyses. Finally, a limitation is data availability. Little is known on the molecular mechanisms of how SFPs work and even less is known for the FRPs. This means that annotations are incomplete and that some are false. So, it is difficult to explain patterns of selection based on cellular and biological processes. Also, population and expression data for this work were available only for one species of the four, C. maculatus, so results based on those should be intepreted carefully.

53 5. Conclusions

To summarise, proteomes from four species of seed beetles and expression data of proteins in C. maculatus allowed for the identification of seminal fluid proteins (SFPs), female reproductive proteins (FRPs), digestive proteins and conserved across the Arthropod phylum proteins (Busco) in all four species of seed beetles studied. These groups of orthologous genes, called orthogroups, were provided as input to the codeML program of the PAML package. This allowed for maximum likelihood estimation of the ω values under various models like the site models, branch models and branch-site models. The site models, M0, M1a, M2a, M7 and M8, showed that FRPs evolve as slowly, or even slower than the conserved Busco proteins. This could mean that FRPs are conserved and are probably pleiotropic. Also, the site models showed that SFPs and digestive proteins have reduced purifying selection compared FRPs, making them evolve faster than FRPs. The site models also detected cases of positive selection at 92 digestive proteins, 9 FRPs and 26 SFPs of which 63, 8 and 20 of them respectively been annotated. Moreover, I tried to detect patterns of correlated evolution, which could lead to cases of protein co-evolution due to interactions. By doing so, I found that SFPs and FRPs produced the highest correlations among protein types for branch ω values, which agrees with the expectation of SFPs and FRPs co-evolving. 8 pairs of SFPs and FRPs show significantly correlating ω values. To account for factors, other than protein interaction, that affect ω values I created a linear model with expression levels, sex-biased expression and protein type as predictors. High expression levels were found to reduce the overall ω values of genes, a phenomenon called E-R anticorrelation. Sex-biased expression does not affect the ω values, but it does affect the E-R anticorrelation, with male-biased genes showing it less and female biased genes showing it more.

54 References

[1] Brian D. Farrell and Andrea S. Sequeira. EVOLUTIONARY RATES IN THE ADAPTIVE RADIATION OF BEETLES ON PLANTS. Evolution, 58(9):1984–2001, September 2004. [2] Jesús Gómez-Zurita, Toby Hunt, and Alfried P. Vogler. Multilocus ribosomal RNA phylogeny of the leaf beetles (Chrysomelidae). Cladistics, 24(1):34–50, February 2008. [3] T. Hunt, J. Bergsten, Z. Levkanicova, A. Papadopoulou, O. St. John, R. Wild, P. M. Hammond, D. Ahrens, M. Balke, M. S. Caterino, J. Gomez-Zurita, I. Ribera, T. G. Barraclough, M. Bocakova, L. Bocak, and A. P. Vogler. A Comprehensive Phylogeny of Beetles Reveals the Evolutionary Origins of a Superradiation. Science, 318(5858):1913–1916, December 2007. [4] M. Tuda, J. Rönn, S. Buranapanichpan, N. Wasano, and G. Arnqvist. Evolutionary diversification of the bean beetle genus Callosobruchus (Coleoptera: Bruchidae): traits associated with stored-product pest status: EVOLUTION OF CALLOSOBRUCHUS STORED-BEAN PESTS. Molecular Ecology, 15(12):3541–3551, October 2006. [5] Gaël J Kergoat and Alex Delobel. Seed-beetles in the age of the molecule: recent advances on systematics and host-plant association patterns. page 32, 2008. [6] Invasive Species Compendium, 2021. Wallingford, UK: CAB International. [7] Ambayeba Muimba-Kankolongo. Leguminous Crops. In Food Crop Production by Smallholder Farmers in Southern Africa, pages 173–203. Elsevier, 2018. [8] Ferede Negasi. Studies on the Economic Importance and Control of Bean Bruchids in Haricot Bean, Phaseolus Vulgaris L., in Eastern and Southern Shewa. PhD thesis, Alemaya University of Agriculture, Ethiopia, 1994. [9] T Abate and J K O Ampofo. Pests of Beans in Africa: Their Ecology and Management. page 31, 1996. [10] A.N. Afonin, S.L. Greene, N.I. Dzyubenko, and A.N Frolov. AgroAtlas - Pests - Acanthoscelides obtectus Say - , 2008. http://www.agroatlas.ru/en/content/pests/Acanthoscelides_obtectus/. [11] Satu Paukku and Janne S. Kotiaho. Female Oviposition Decisions and Their Impact on Progeny Life-History Traits. Journal of Insect Behavior, 21(6):505–520, November 2008. [12] Mazarin Akami, Hamada Chakira, Awawing A. Andongma, Kanjana Khaeso, Olajire A. Gbaye, Njintang Y. Nicolas, E.-N. Nukenine, and Chang-Ying Niu. Essential oil optimizes the susceptibility of Callosobruchus maculatus and enhances the nutritional qualities of stored cowpea Vigna unguiculata. Royal Society Open Science, 4(8):170692, August 2017. [13] Antoine Sanon, Ilboudo Zakaria, Dabire-Binso Clémentine L, Ba Malick Niango, and Nébié Roger Charles Honora. Potential of Botanicals to Control

55 Callosobruchus maculatus (Col.: Chrysomelidae, Bruchinae), a Major Pest of Stored Cowpeas in Burkina Faso: A Review. International Journal of Insect Science, 10:1179543318790260, 2018. [14] Álvaro Rodríguez-González, Samuel Álvarez García, Óscar González-López, Franceli Da Silva, and Pedro A. Casquero. Insecticidal Properties of Ocimum basilicum and Cymbopogon winterianus against Acanthoscelides obtectus, Insect Pest of the Common Bean (Phaseolus vulgaris, L.). , 10(5), May 2019. [15] Jelica Lazarevic,´ Stojan Jevremovic,´ Igor Kostic,´ Miroslav Kostic,´ Ana Vuleta, Sanja Manitaševic´ Jovanovic,´ and Darka Šešlija Jovanovic.´ Toxic, Oviposition Deterrent and Oxidative Stress Effects of Thymus vulgaris Essential Oil against Acanthoscelides obtectus. Insects, 11(9), August 2020. [16] Álvaro Rodríguez-González, Alejandra J. Porteous-Álvarez, Mario Del Val, Pedro A. Casquero, and Baltasar Escriche. Toxicity of five Cry proteins against the insect pest Acanthoscelides obtectus (Coleoptera: Chrisomelidae: Bruchinae). Journal of Invertebrate Pathology, 169:107295, January 2020. [17] Nikola Tucic,´ Oliver Stojkovic,´ Ivana Gliksman, Agana Milanovic,´ and Darka Šešlija. LABORATORY EVOLUTION OF LIFE-HISTORY TRAITS IN THE BEAN WEEVIL (ACANTHOSCELIDES OBTECTUS): THE EFFECTS OF DENSITY-DEPENDENT AND AGE-SPECIFIC SELECTION. Evolution; International Journal of Organic Evolution, 51(6):1896–1909, December 1997. [18] Jelica Lazarevic,´ Mirko Dordevi¯ c,´ Biljana Stojkovic,´ and Nikola Tucic.´ Resistance to prooxidant agent paraquat in the short- and long-lived lines of the seed beetle (Acanthoscelides obtectus). Biogerontology, 14(2):141–152, April 2013. [19] Darka Šešlija Jovanovic,´ Mirko Dordevi¯ c,´ Uroš Savkovic,´ and Jelica Lazarevic.´ The effect of mitochondrial complex I inhibitor on longevity of short-lived and long-lived seed beetles and its mitonuclear hybrids. Biogerontology, 15(5):487–501, 2014. [20] Uroš Savkovic,´ Mirko ÐorÐevic,´ Darka Šešlija Jovanovic,´ Jelica Lazarevic,´ Nikola Tucic,´ and Biljana Stojkovic.´ Experimentally induced host-shift changes life-history strategy in a seed beetle. Journal of Evolutionary Biology, 29(4):837–847, April 2016. [21] József Vuts, Christine M. Woodcock, Lisa König, Stephen J. Powers, John A. Pickett, Árpád Szentesi, and Michael A. Birkett. Host shift induces changes in mate choice of the seed predator Acanthoscelides obtectus via altered chemical signalling. PloS One, 13(11):e0206144, 2018. [22] Uroš Savkovic,´ Mirko Ðordevi¯ c,´ and Biljana Stojkovic.´ Potential for Acanthoscelides obtectus to Adapt to New Hosts Seen in Laboratory Selection Experiments. Insects, 10(6), May 2019. [23] Aoife M. Leonard and Lesley T. Lancaster. Maladaptive plasticity facilitates evolution of thermal tolerance during an experimental range shift. BMC evolutionary biology, 20(1):47, April 2020. [24] Elina Immonen, David Berger, Ahmed Sayadi, Johanna Liljestrand-Rönn, and Göran Arnqvist. An experimental test of temperature-dependent selection on mitochondrial haplotypes in Callosobruchus maculatus seed beetles. Ecology and Evolution, 10(20):11387–11398, October 2020.

56 [25] Dariusz Krzysztof Małek, Maciej Jan Danko,´ and Marcin Czarnoleski. Does seed size mediate sex-specific reproduction costs in the Callosobruchus maculatus bean beetle? PloS One, 14(12):e0225967, 2019. [26] Elina Immonen, Ahmed Sayadi, Helen Bayram, and Göran Arnqvist. Mating Changes Sexually Dimorphic Gene Expression in the Seed Beetle Callosobruchus maculatus. Genome Biology and Evolution, 9(3):677–699, March 2017. [27] Liam R. Dougherty, Emile van Lieshout, Kathryn B. McNamara, Joe A. Moschilla, Göran Arnqvist, and Leigh W. Simmons. Sexual conflict and correlated evolution between male persistence and female resistance traits in the seed beetle Callosobruchus maculatus. Proceedings. Biological Sciences, 284(1855), May 2017. [28] Ahmed Sayadi, Alvaro Martinez Barrio, Elina Immonen, Jacques Dainat, David Berger, Christian Tellgren-Roth, Björn Nystedt, and Göran Arnqvist. The genomic footprint of sexual conflict. Nature Ecology & Evolution, 3(12):1725–1730, December 2019. [29] Antigone Kouris-Blazos and Regina Belski. Health benefits of legumes and pulses with a focus on Australian sweet lupins. Asia Pacific Journal of Clinical Nutrition, 25(1), January 2016. [30] Özgür Çakir, Cüneyt Uçarli, Çagatay˘ Tarhan, Murat Pekmez, and Neslihan Turgut-Kara. Nutritional and health benefits of legumes and their distinctive genomic properties. Food Science and Technology, 39(1):1–12, March 2019. [31] M. Wink. Evolution of secondary metabolites in legumes (Fabaceae). South African Journal of Botany, 89:164–175, November 2013. [32] Daniele Kunz, Gabriel B. Oliveira, Theo C. Brascher, Richard I. Samuels, Maria Lígia R. Macedo, Luiz F. de Souza, Alcir L. Dafré, and Carlos P. Silva. Phaseolin ingestion affects vesicular traffic causing oxidative stress in the midgut of Callosobruchus maculatus larvae. Comparative Biochemistry and Physiology Part B: Biochemistry and Molecular Biology, 228:34–40, February 2019. [33] Helen Bayram, Ahmed Sayadi, Elina Immonen, and Göran Arnqvist. Identification of novel ejaculate proteins in a seed beetle and division of labour across male accessory reproductive glands. Insect Biochemistry and Molecular Biology, 104:50–57, January 2019. [34] Göran Arnqvist and Locke Rowe. Sexual conflict. Monographs in behavior and ecology. Princeton University Press, Princeton, N.J, 2005. OCLC: ocm56129227. [35] Nicholas B Davies, John R. Krebs, and Stuart A. West. An Introduction to Behavioural Ecology. Wiley-Blackwell, John Wiley & Sons, Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK, fourth edition, 2012. [36] Laura K. Sirot, Alex Wong, Tracey Chapman, and Mariana F. Wolfner. Sexual Conflict and Seminal Fluid Proteins: A Dynamic Landscape of Sexual Interactions. Cold Spring Harbor Perspectives in Biology, 7(2):a017533, February 2015. [37] Amy L. Dapper and Michael J. Wade. The evolution of sperm competition genes: The effect of mating system on levels of genetic variation within and between species. Evolution, 70(2):502–511, February 2016.

57 [38] Heriberto Rodrıguez-Martınez, Ulrik Kvist, Jan Ernerudh, Libia Sanz, and Juan J Calvete. Seminal Plasma Proteins: What Role Do They Play? American Journal of Reproductive Immunology, page 12, 2011. [39] Cedric Gillott. Male Accessory Gland Secretions: Modulators of Female Reproductive Physiology and Behavior. Annual Review of Entomology, 48(1):163–184, January 2003. [40] Frank W. Avila, Laura K. Sirot, Brooke A. LaFlamme, C. Dustin Rubinstein, and Mariana F. Wolfner. Insect Seminal Fluid Proteins: Identification and Function. Annual Review of Entomology, 56(1):21–40, January 2011. [41] Nilay Yapici, Young-Joon Kim, Carlos Ribeiro, and Barry J. Dickson. A receptor that mediates the post-mating switch in Drosophila reproductive behaviour. Nature, 451(7174):33–37, January 2008. [42] Rosa Fernández, Toni Gabaldón, and Christophe Dessimoz. Orthology: definitions, inference, and impact on species phylogeny inference. page 17, 2019. [43] David M. Emms and Steven Kelly. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biology, 16(1):157, December 2015. [44] David M. Emms and Steven Kelly. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology, 20(1):238, November 2019. [45] Ziheng Yang. Computational Molecular Evolution. Oxford University Press, October 2006. [46] W. J. Becktel and J. A. Schellman. Protein stability curves. Biopolymers, 26(11):1859–1877, November 1987. [47] M. H. Hecht, J. M. Sturtevant, and R. T. Sauer. Effect of single amino acid replacements on the thermal stability of the NH2-terminal domain of phage lambda repressor. Proceedings of the National Academy of Sciences of the United States of America, 81(18):5685–5689, September 1984. [48] Andrew A. Pakula and Robert T. Sauer. Amino acid substitutions that increase the thermal stability of the lambda Cro protein. Proteins: Structure, Function, and Bioinformatics, 5(3):202–210, 1989. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/prot.340050303. [49] L. Ramdas, F. Sherman, and B. T. Nall. Guanidine hydrochloride induced equilibrium unfolding of mutant forms of iso-1-cytochrome c with replacement of proline-71. Biochemistry, 25(22):6952–6958, November 1986. [50] John A. Schellman, Margaret Lindorfer, Richard Hawkes, and Markus Grutter. Mutations and protein stability. Biopolymers, 20(9):1989–1999, 1981. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/bip.1981.360200921. [51] David Shortle and Alan K. Meeker. Mutant forms of staphylococcal nuclease with altered patterns of guanidine hydrochloride and urea denaturation. Proteins: Structure, Function, and Bioinformatics, 1(1):81–89, 1986. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/prot.340010113. [52] Andrew A Pakula and Robert T Sauer. Genetic analysis of protein stability and function. Annu. Rev. Genet., page 22, 1989. [53] Austin L. Hughes and Masatoshi Nei. Pattern of nucleotide substitution at major histocompatibility complex class I loci reveals overdominant selection.

58 Nature, 335(6186):167–170, September 1988. Number: 6186 Publisher: Nature Publishing Group. [54] A. L. Hughes, T. Ota, and M. Nei. Positive Darwinian selection promotes charge profile diversity in the antigen-binding cleft of class I major-histocompatibility-complex molecules. Molecular Biology and Evolution, 7(6):515–524, November 1990. [55] Ziheng Yang and Rasmus Nielsen. Estimating Synonymous and Nonsynonymous Substitution Rates Under Realistic Evolutionary Models. Molecular Biology and Evolution, 17(1):32–43, January 2000. [56] Z. Yang. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution, 24(8):1586–1591, April 2007. [57] Ziheng Yang. PAML: Phylogenetic Analysis by Maximum Likelihood. 2017. [58] John H. McDonald and Martin Kreitman. Adaptive protein evolution at the Adh locus in Drosophila. Nature, 351(6328):652–654, June 1991. [59] John H. MacDonald. Small numbers in chi-square and G–tests - Handbook of Biological Statistics, 2014. http://www.biostathandbook.com/small.html, This web page contains the content of pages 86-89 in the printed version. [60] D. M. Rand and L. M. Kann. Excess amino acid polymorphism in mitochondrial DNA: contrasts among genes from Drosophila, mice, and humans. Molecular Biology and Evolution, 13(6):735–748, July 1996. [61] J. Parsch, Z. Zhang, and J. F. Baines. The Influence of Demography and Weak Selection on the McDonald-Kreitman Test: An Empirical Study in Drosophila. Molecular Biology and Evolution, 26(3):691–698, December 2008. [62] F Tajima. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics, 123(3):585–595, November 1989. [63] G. A. Watterson. On the number of segregating sites in genetical models without recombination. Theoretical Population Biology, 7(2):256–276, 1975. [64] Fumio Tajima. EVOLUTIONARY RELATIONSHIP OF DNA SEQUENCES IN FINITE POPULATIONS. Genetics, 105(2):437–460, October 1983. [65] John N. Thompson. The Coevolutionary Process. University of Chicago Press, 1994. [66] S. C. Lovell and D. L. Robertson. An Integrated View of Molecular Coevolution in Protein-Protein Interactions. Molecular Biology and Evolution, 27(11):2567–2575, November 2010. [67] Kazutaka Katoh and Daron M. Standley. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution, 30(4):772–780, April 2013. Publisher: Oxford Academic. [68] Morgan N. Price, Paramvir S. Dehal, and Adam P. Arkin. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3):e9490, March 2010. [69] Jaime Huerta-Cepas, François Serra, and Peer Bork. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Molecular Biology and Evolution, 33(6):1635–1638, June 2016. Publisher: Oxford Academic. [70] R Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2020. [71] Hadley Wickham, Mara Averick, Jennifer Bryan, Winston Chang,

59 Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, Alex Hayes, Lionel Henry, Jim Hester, Max Kuhn, Thomas Lin Pedersen, Evan Miller, Stephan Milton Bache, Kirill Müller, Jeroen Ooms, David Robinson, Dana Paige Seidel, Vitalie Spinu, Kohske Takahashi, Davis Vaughan, Claus Wilke, Kara Woo, and Hiroaki Yutani. Welcome to the tidyverse. Journal of Open Source Software, 4(43):1686, 2019. [72] Max Kuhn, Simon Jackson, and Jorge Cimentada. corrr: Correlations in R, 2020. R package version 0.4.3. [73] Matthias Blum, Hsin-Yu Chang, Sara Chuguransky, Tiago Grego, Swaathi Kandasaamy, Alex Mitchell, Gift Nuka, Typhaine Paysan-Lafosse, Matloob Qureshi, Shriya Raj, Lorna Richardson, Gustavo A Salazar, Lowri Williams, Peer Bork, Alan Bridge, Julian Gough, Daniel H Haft, Ivica Letunic, Aron Marchler-Bauer, Huaiyu Mi, Darren A Natale, Marco Necci, Christine A Orengo, Arun P Pandurangan, Catherine Rivoire, Christian J A Sigrist, Ian Sillitoe, Narmada Thanki, Paul D Thomas, Silvio C E Tosatto, Cathy H Wu, Alex Bateman, and Robert D Finn. The InterPro protein families and domains database: 20 years on. Nucleic Acids Research, 49(D1):D344–D354, January 2021. [74] Manuel Miranda and Alexander Sorkin. Regulation of receptors and transporters by ubiquitination: new insights into surprisingly similar mechanisms. Molecular Interventions, 7(3):157–167, June 2007. [75] Jianzhi Zhang and Jian-Rong Yang. Determinants of the rate of protein sequence evolution. Nature Reviews Genetics, 16(7):409–420, July 2015. [76] D. J. Leahy, W. A. Hendrickson, I. Aukhil, and H. P. Erickson. Structure of a fibronectin type III domain from tenascin phased by MAD analysis of the selenomethionyl protein. Science (New York, N.Y.), 258(5084):987–991, November 1992. [77] D. J. Leahy, I. Aukhil, and H. P. Erickson. 2.0 A crystal structure of a four-domain segment of human fibronectin encompassing the RGD loop and synergy region. Cell, 84(1):155–164, January 1996. [78] Davis Jq and Bennett V. Ankyrin binding activity shared by the neurofascin/L1/NrCAM family of nervous system cell adhesion molecules. The Journal of Biological Chemistry, 269(44):27163–27166, November 1994. [79] N. H. Barton. Genetic hitchhiking. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 355(1403):1553–1562, November 2000.

60