Patterns of molecular evolution associated with repeatedly evolved traits

by

T. Fatima Mitterboeck

A Thesis Presented to The Faculty of Graduate Studies of The University of Guelph

In partial fulfillment of requirements for the degree of Doctor of Philosophy in Integrative Biology

Guelph, Ontario, Canada © T. Fatima Mitterboeck, August 2016 ABSTRACT

PATTERNS OF MOLECULAR EVOLUTION ASSOCIATED WITH REPEATEDLY EVOLVED TRAITS

T. Fatima Mitterboeck Advisors: University of Guelph, 2016 Sarah J. Adamowicz Jinzhong Fu

Molecular evolutionary rates vary dramatically across the tree of life. Using a phylogenetic comparative approach, this thesis explores rates and patterns of molecular evolution associated with three major repeated evolutionary transitions that have shaped life: shifts between marine and freshwater environments in diverse lineages of , shifts between freshwater and terrestrial environments in , and shifts between flying ability and lack of ability in insects. These studies were novel in taxonomic scope and in the evolutionary transitions examined, as well as by assessing trends in both directions of transition. While rates of molecular evolution were here observed to be relatively equal among habitat categories, freshwater eukaryotes tended to have higher rates than marine or saline eukaryotes, and terrestrial insects tended to have higher rates than freshwater insects. In flightless insects, certain categories of genes more commonly exhibited signatures of positive or relaxed selection than observed in flying insects, and these trends mirrored those previously reported for other flying and secondarily flightless groups (birds and bats). Overall, the broad-scale trends observed in these studies support a degree of predictability in molecular evolution in association with biological and ecological traits of organisms.

Key words: evolutionary transitions, repeated evolution, flight, flight loss, habitat shifts, terrestrial, freshwater, marine, insects, eukaryotes, molecular evolutionary rates, positive selection, relaxed selection, molecular convergence, comparative method

Acknowledgements

I am lucky to have had many gifted circumstances before and during my life that have enabled me the opportunity to pursue this study.

I give a huge thank you to my advisors Sarah Adamowicz and Jinzhong Fu, for the attention to these projects and toward my broader goals. I appreciated you providing and encouraging opportunities wherever possible, and for the freedom to pursue my research areas of interest. I thank my committee members Stephen Marshall and Daniel Ashlock for the thought given toward these projects. I am grateful to my collaborators Shanlin Liu, Rui Zhang, Wenhui Song, Lili Zhou, and especially to Xin Zhou, who brought me into the world of transcriptomics and the 1000 Insect Transcriptome Evolution project. I’m glad to have had such wonderful lab mates, Tzitziki Loeza-Quintana and Robert Young, who have made this time meaningful more than in just academic aspects. I thank my family and my husband Brad Hall for their encouragement. I appreciate those who have organized and contributed to the funding that I received through the government of Canada and Ontario to conduct this research. Finally, I’d like to thank the researchers around the world who have contributed and made available the data that has enabled this work.

Funding Timeline

September 2012 to April 2013, May to August 2014: University of Guelph Integrative Biology PhD award to T.F.M.

May 2013 to April 2014: Government of Ontario and University of Guelph Ontario Graduate Fellowship to T.F.M.

September 2014 to August 2016: Natural Sciences and Engineering Research Council of Canada Alexander Graham Bell Graduate Scholarship (CGS-D) and University of Guelph Dean’s Tri-council Scholarship to T.F.M.

September 2012 to August 2016: Natural Sciences and Engineering Research Council of Canada Discovery Grants to S.J.A. (386591-2010) and J.F. (400479).

iii

Declaration of Contributions

Specific acknowledgments for each individual data chapter, outside of the study authors, are given at the end of each chapter.

Chapter 2:

Mitterboeck, T. F., A. Y. Chen, O. A. Zaheer, E. Y. T. Ma, and S. J. Adamowicz. 2016. Do saline taxa evolve faster? Comparing relative rates of molecular evolution between freshwater and marine eukaryotes. Evolution (July).

Contributions: Conceived the experiment: S.J.A., T.F.M. Compiled the datasets and genetic data used: A.Y.C., O.A.Z., T.F.M. Performed PAML analysis: A.Y.C., T.F.M. Wrote the Python script: E.Y.T.M. Contributed ideas to the written document: T.F.M., S.J.A., A.Y.C. Performed pattern analysis, wrote the paper, generated figures and tables: T.F.M. Revisions for publication: T.F.M., S.J.A.

Chapter 3:

Mitterboeck, T. F., J. Fu, and S. J. Adamowicz. 2016. Rates and patterns of molecular evolution in freshwater vs. terrestrial insects. Genome (August).

Contributions: Conceived the experiment, input on analyses and concepts, and revisions for publication: T.F.M., S.J.A., J.F. Performed analyses, wrote the paper, generated figures and tables: T.F.M.

Chapter 4:

Mitterboeck*, T. F., S. Liu*, R. Zhang, W. Song, K. Meusemann, J. Fu, S. J. Adamowicz, and X. Zhou. 2016. Positive and relaxed selection in insect transcriptomes associated with the evolutionary gain and loss of flight. (In prep.). *planned shared first authorship and S.L. placed first for publication.

Contributions: Conceived the experiment: X.Z., T.F.M., S.L. Filtered the genetic data: S.L., K.M. Designed experiments: T.F.M., S.L., R.Z., W.S., J.F., S.J.A. Designed data sets and analyses: T.F.M., S.L. Bioinformatics for PAML analysis: S.L. Gene Ontology and HyPhy analysis: T.F.M. Wrote the paper, generated figures and tables: T.F.M. Revisions to written draft: T.F.M., S.J.A., J.F., S.L. [Note: Contributions may change by time of publication]

iv

TABLE OF CONTENTS

Acknowledgements ...... iii Funding Timeline ...... iii Declaration of Contributions ...... iv List of Tables ...... xi List of Figures ...... xii List of Abbreviations ...... xii List of Supplementary Materials ...... xiii Chapter 1: An introduction to molecular evolutionary rates and measures ...... 1 OVERVIEW OF THESIS ...... 2 Foreword on type of molecular evolution studied ...... 3 PART 1: BACKGROUND ON MOLECULAR EVOLUTIONARY RATES IN A GENOME- WIDE CONTEXT ASSOCIATED WITH EVOLUTIONARY TRANSITIONS ...... 4 Summary of Part 1 ...... 4 Background and scope of studies to be discussed ...... 4 Evolutionary transitions and homology ...... 5 Examples of evolutionary transitions ...... 6 Methodological approaches to studying transitions and molecular rates ...... 7 Measures of molecular evolutionary rates ...... 7 Synthesis of support for each biological/ecological parameter on rates ...... 8 Restrictions on transition directionality ...... 10 The gap: transition direction vs. trait state ...... 11 Examples and synthesis of studies suggesting some influence of ‘transitioning’ ...... 12 Methods: how could state- and direction-specific trends in molecular evolution be distinguished? ...... 12 Conclusions and future work ...... 13 Figures and Tables for Chapter 1 Part 1 ...... 14 PART 2: BACKGROUND ON GENOMIC-SCALE ASSESSMENT OF CONVERGENCE, POSITIVE SELECTION, AND/OR RELAXED SELECTION ...... 20 Summary of Part 2 ...... 20 Scope of studies being discussed ...... 20

v

Convergent molecular evolution ...... 20 Positive and relaxed selection and their measurement ...... 21 Methods of detecting convergent molecular evolution by gene type ...... 21 Synthesis of literature and gaps ...... 22 Concluding remarks ...... 23 Figures and Tables for Chapter 1 Part 2 ...... 24 THESIS OBJECTIVES: STUDY QUESTIONS, HYPOTHESES, AND PREDICTIONS ...... 26 Chapter 2: Do saline taxa evolve faster? Comparing relative rates of molecular evolution between freshwater and marine eukaryotes ...... 28 ABSTRACT ...... 29 Metabolic rate ...... 30 Salinity ...... 31 Ultraviolet radiation exposure ...... 32 Effective population size ...... 32 Diversification rate and novel niches ...... 33 Life history and other correlates of aquatic habitat type ...... 34 Do rates of molecular evolution differ across aquatic environments? ...... 34 MATERIALS AND METHODS ...... 35 Source studies and choice ...... 35 Sequence data ...... 36 Estimation of relative rates of molecular evolution ...... 37 Analysis of relative rates of molecular evolution in saline vs. freshwaters ...... 38 Tests for link with direction of habitat shift ...... 39 Re-analysis of data from previous studies ...... 40 RESULTS ...... 40 No difference in freshwater vs. saline rates when considering all genes together ...... 40 Higher molecular rates in freshwater lineages in protein-coding genes ...... 41 Some trends within organismal groupings ...... 41 Coding vs. non-coding genes ...... 42 Little evidence for genome-wide effect of habitat upon rates ...... 42 No single gene driving the pattern in protein-coding genes ...... 42

vi

No consistent pattern in continental saline vs. freshwater comparisons ...... 43 No relationship between ancestral habitat and overall rates ...... 43 Re-analysis of previous studies reporting habitat-specific rate differences ...... 43 Some variation in results with different molecular measures ...... 43 DISCUSSION ...... 44 No general difference between freshwater and saline rates ...... 45

Mutation rate differences vs. Ne effect ...... 46 Positive or relaxed selection ...... 46 Transition direction and the speciation-molecular evolution link ...... 47 Lack of consistency among molecular measures and gene types suggests multiple influences48 A comparison of marine patterns vs. continental saline lakes ...... 48 Conclusions ...... 49 ACKNOWLEDGEMENTS ...... 49 LIST OF SUPPLEMENTARY MATERIAL ...... 50 Figures and Tables for Chapter 2...... 51 Chapter 3: Rates and patterns of molecular evolution in freshwater vs. terrestrial insects 60 ABSTRACT ...... 61 INTRODUCTION ...... 61 MATERIALS AND METHODS ...... 63 Collection of data ...... 63 1) Analysis of relative molecular evolutionary rates ...... 65 a) Estimation of relative rates for freshwater and terrestrial lineages ...... 65 b) Analysis of habitat-linked patterns in molecular rates across sister pairs ...... 65 2) Exploration of convergence associated with habitat in COI ...... 66 a) Test of similar amino acid changes across all freshwater or all terrestrial lineages ...... 67 b) Test of amino acid sites under positive selection ...... 67 c) Examination of Song et al. (2014) amino acid changes in COI (3’ end) associated with habitat ...... 68 d) Test of convergence in pairs of aquatic lineages ...... 68 RESULTS ...... 69 1) Relative rates ...... 69

vii

2) Convergence in COI ...... 70 DISCUSSION ...... 71 Possible reasons for lack of trends ...... 72 Fit of relative molecular evolutionary rates with original hypotheses ...... 72 Hypothesis 1: effective population size ...... 72 Hypothesis 2: metabolic efficiency and oxygen use ...... 73 Molecular convergence in COI based on habitat ...... 74 Lotic vs. lentic freshwater habitats considered ...... 75 Parallels between aquatic habitat and other biological and ecological traits ...... 75 Implications ...... 76 Conclusions ...... 76 ACKNOWLEDGEMENTS ...... 76 LIST OF SUPPLEMENTARY MATERIAL ...... 77 Figures and Tables for Chapter 3...... 78 Chapter 4: Positive and relaxed selection in insect transcriptomes associated with the evolutionary gain and loss of flight ...... 84 ABSTRACT ...... 85 INTRODUCTION ...... 85 METHODS ...... 87 Source of genetic data ...... 87 1. Exploratory test of positive selection in lineage leading to Pterygota ...... 88 2. Exploring genes under positive selection with flight loss ...... 89 3. Exploring genes under relaxed and/or positive selection with flight loss ...... 90 4. Gene Ontology categories ...... 91 5. Positive selection in energy-related genes in Hexapoda ...... 92 RESULTS ...... 93 1. Positive selection associated with the origin of flight ...... 93 2. Positive selection associated with flight loss ...... 93 3. Relaxed selection associated with flight loss ...... 93 4. Gene Ontology analysis ...... 94 5. Positive selection in nuclear and mitochondrial OXPHOS genes in hexapod lineages ...... 94

viii

DISCUSSION ...... 94 Gene categories with signature of positive selection at the origin of Pterygota ...... 95 Gene categories with signatures of positive and relaxed selection associated with flight loss 96 Positive selection in the major lineages of hexapods ...... 97 Note on the exploratory analysis of gene categories ...... 98 Next steps ...... 99 Methodological comparisons ...... 100 Caveats ...... 101 Conclusions ...... 102 ACKNOWLEDGEMENTS ...... 102 LIST OF SUPPLEMENTARY MATERIAL ...... 102 Figures and Tables for Chapter 4...... 103 Chapter 5: Thesis integration and conclusions ...... 110 Summary of key findings ...... 111 Synthesis of findings ...... 112 Contributions to the field of molecular evolution and to addressing knowledge gaps ...... 113 Implications ...... 113 Limitations ...... 115 Future work and recommendations ...... 116 Final conclusions ...... 117 Literature Cited ...... 118 Appendix ...... 139 Supplementary Material Ch2_S1 ...... 140 A) Table of species analyzed, including habitats occupied and the list of published studies used as data sources ...... 140 B) Description of methods for data collation, including literature search terms employed and explanation of sequence inclusion/exclusion criteria ...... 150 Search terms for source studies ...... 150 Taxon inclusion criteria ...... 151 Exclusion criteria for genetic data in a sister pair ...... 151 Exclusion criteria for genetic data in a sister pair determined by distribution of relative overall substitution rates (OSRs) ...... 152

ix

C) Tables containing individual gene relative rates; figure presenting dN/dS ratios and dN and dS relative rates; and additional discussion of results ...... 154 No relationship between ancestral habitat and overall rates ...... 157 Some difference in behavior of molecular measures for total substitution rates ...... 158 Supplementary Material Ch4_S1 ...... 159 Location of flight losses for nuclear gene positive selection analysis ...... 160

x

List of Tables

Table 1.1. Summary of studies examining rates of molecular evolution associated with biological or ecological characteristics of organisms ...... 15 Table 1.2. Examples of some genomic studies examining evolutionary transitions in the context of positive and/or relaxed selection, or convergent sites ...... 24 Table 2.1. Summary of 148 freshwater-saline comparisons used in analysis ...... 55 Table 2.2. All relative rates, including dN and dS rates, by taxonomic breakdown ...... 58 Table 4.1. Gene Ontology (GO) categories from DAVID analysis of the positively selected genes in the lineage (‘P’) leading to Pterygota ...... 107 Table 4.2. PANTHER Biological Process categories from analysis of positively selected genes in the lineage (‘P’) leading to Pterygota ...... 108 Table 4.3. Gene Ontology or Pathway categories for genes detected to be under positive selection in 3 or more lineages with flight loss ...... 109 Table Ch2_S1_1. 150 comparisons with genetic data analyzed ...... 140 Table Ch2_S1_2. Summary of individual gene relative rates for Overall Substitution Rates (OSRs) analysis ...... 154 Table Ch2_S1_3. Summary of dN/dS ratios for individual genes ...... 156 Table Ch4_S1_1. Over- and under-representation of PANTHER Biological process categories by the set of 1476 genes as compared with a background genes from melanogaster genome ...... 159 Table Ch4_S1_2. Summary of flight losses and data analysed ...... 161 Table Ch4_S1_3. Gene Ontology (GO) categories from DAVID analysis of positively selected genes in the lineage (‘P’) leading to Pterygota ...... 162 Table Ch4_S1_4. Gene Ontology categories from DAVID analysis for genes detected to be under positive selection in 3 or more flight-loss lineages ...... 165 Table Ch4_S1_5. Gene Ontology categories from DAVID analysis for genes detected to be under positive selection in 3 or more fully flight-loss lineages (not including female-flightless lineages) ...... 166 Table Ch4_S1_6. PANTHER Biological Process categories for genes detected to be under positive selection in 3 or more fully flight-loss lineages (not including female-flightless) ...... 167 Table Ch4_S1_7. DAVID Gene Ontology categories for genes detected to be under positive selection in 3 or more flight-loss lineages, with those genes removed that were majority detected in parasitic flight-loss lineages ...... 167 Table Ch4_S1_8. PANTHER Biological Process categories for genes detected to be under positive selection in 3 or more flight-loss lineages, with those genes removed that were majority detected in parasitic flight-loss lineages ...... 168 Table Ch4_S1_9. Results of relaxed selection analysis via dN/dS ratios of mitochondrial OXPHOS genes in flightless vs flying lineages of pterygotes ...... 170

xi

List of Figures

Figure 1.1. Major underlying influences on genome-wide mutation or fixation rates ...... 14 Figure 2.1. Summary of expected association between biological or ecological parameters and rates of molecular evolution ...... 51 Figure 2.2. Relative saline: freshwater (SAL:FW) overall substitution rates (OSRs) across 148 comparisons ...... 52 Figure 2.3. Relative saline: freshwater (SAL:FW) dN/dS ratios across 71 comparisons ...... 53 Figure 2.4. Summarized overall substitution rates (OSRs) and dN/dS ratios by gene category . 54 Figure 3.1. Composite phylogeny including aquatic and terrestrial lineages used in analysis .... 80 Figure 3.2. Relative aquatic: terrestrial (AQ:TER) dN/dS ratios across 42 sister comparisons .. 82 Figure 4.1. Tree topology and species used in analysis of nuclear genes ...... 105 Figure 4.2. dN/dS ratios in flightless vs. related flying lineages for 13 mitochondrial protein- coding genes ...... 105 Figure 4.3. Positive selection in hexapod lineages in nuclear and mitochondrial genes of interest ...... 106 Figure Ch2_S1_1. Distribution of 402 pairs of OSR relative rate ratios ...... 153 Figure Ch2_S1_2. Relative dN/dS ratios, dN rates, and dS rates across 71 comparisons ...... 157 Figure Ch4_S1_1. Tree of 66 species used in the mitochondrial gene flight-loss relaxed- selection analysis ...... 169

List of Abbreviations

OSR: overall substitution rate dN: non-synonymous substitution rate dS: synonymous substitution rate dN/dS: non-synonymous-to-synonymous substitution ratio

FW: freshwater

SAL: saline

TER: terrestrial

Ne: effective population size

xii

List of Supplementary Materials

Chapter 2:

Supplementary Material Ch2_S1 is provided in the Appendix in this thesis document

Supplementary Material Ch2_S2 is available on the University of Guelph Data Research Repository (and through the Evolution website)

Input/output files are provided in the Dryad digital repository http://dx.doi.org/10.5061/dryad.fq684

Chapter 3:

Supplementary Material Ch3_S is available on the University of Guelph Data Research Repository (and through the Genome website)

Input/output files are provided in the University of Guelph Data Research Repository

Chapter 4:

Supplementary Material Ch4_S1 is provided in the Appendix in this thesis document

Supplementary Material Ch4_S2 is available on the University of Guelph Data Research Repository

xiii

Chapter 1

An introduction to molecular evolutionary rates and measures

1

OVERVIEW OF THESIS

While life appears amazingly diverse, there is some degree of repeatability and predictability in evolution. The enormous variation in rates of molecular evolution across life’s lineages is, to some extent, shaped by biological and ecological characteristics of those lineages. Evolutionary pathways are generally not expected to repeat in exactly the same way in separate lineages, due to differences in the phylogenetic history of those lineages, their ecological and genomic context, and contingency. Ecological and genomic contexts are arguably more likely to be shared among lineages that are already similar to begin with than those that are more disparate. However, among parallel instances of trait evolution across disparate life groups, there can be common biological, ecological, or genetic properties, and these may be associated with similar causes or effects on molecular evolution. Thus, organismal traits may correspond with trends in molecular evolution. Such associations have been supported empirically in many instances, reviewed below, aiding us to understand the underlying influences on molecular evolution as well as uncovering elements of predictability in genetic variation.

Many biological or ecological characteristics have not been explored for association with molecular evolution, including several important traits applicable to a wide diversity of taxonomic groups. While some organismal characteristics have been explored on small taxonomic scales, it is unknown whether trends apply broadly. Secondly, evolutionary transitions may occur in one or both directions; traits can be gained but also lost. However, such shifts have most often been examined uni-directionally. To understand particular causes of shifts in molecular evolutionary rates and patterns associated with shifts in traits, the distinction between ‘trait’ and ‘transition’-associated trends should be explored.

In this thesis, I explore molecular evolution associated with three major ecological and biological transitions that have occurred multiple times throughout life’s history: the transition between living in the marine or freshwater realm (using 150 independent comparisons), between living the freshwater or terrestrial realm (using 42 independent comparisons), and between living terrestrially or with the ability to take to the air (using 1 flight gain and 11 flight loss comparisons). As these transitions have occurred across a wide variety of life, the taxonomic foci of my thesis chapters are likewise diverse. The first chapter examines diverse aquatic eukaryotes,

2 ranging from mammals to single-celled eukaryotes, while the latter two chapters are focused on Insecta, a group representing a large proportion of the earth’s multicellular species diversity. Molecular evolution, in terms of DNA sequence substitutions, was characterized through molecular evolutionary rates (Chapters 2, 3, and 4), DNA sequence sites under convergent evolution (Chapter 3), and types of genes under positive and relaxed selection (Chapter 4). Evolutionary shifts were additionally examined in both ‘forward’ and ‘reverse’ directions to test for trait- or direction-specific associations. These works have uncovered large-scale trends in molecular evolution corresponding with major repeated transitions.

Foreword on type of molecular evolution studied

Molecular evolution is the change in genetic material through time. Mutations––errors arising in genetic material––form the substance for microevolution. Mutations can be categorised as point mutation, recombination, deletion, insertion, or inversion (Graur and Li 2000). This thesis focuses on point mutations and their fixation, together resulting in substitutions from the ancestral state, and the resulting observable differences in DNA sequences among lineages of extant organisms. Fixation occurs when one of two or more variants of a gene in the gene pool becomes the only variant present for that gene, and in this way fixation of new mutations become substitutions over evolutionary time. Although these various types of mutation are different, they may be causatively linked or the factors influencing their evolution can be the same. This is demonstrated by correlations in molecular measures: for example, in the rate of substitution and gene rearrangements (Shao et al. 2003, in insects) and the rate of substitution and changes in genome size (Bromham et al. 2015, in plants). Additionally, the rate of molecular evolution is linked to macroevolutionary causes and effects (e.g. Jobson and Albert 2002), which gives further significance to the study of rates at the molecular level.

3

PART 1: BACKGROUND ON MOLECULAR EVOLUTIONARY RATES IN A GENOME- WIDE CONTEXT ASSOCIATED WITH EVOLUTIONARY TRANSITIONS

Summary of Part 1

The pace of molecular evolution is variable and has been shown to be linked to characteristics of organisms. I review studies investigating relative or absolute genome-wide molecular evolutionary rates associated with repeatedly evolved biological or ecological traits.

The main purposes of this review are to:

1. Provide an overview of the literature on molecular evolutionary rates associated with traits that have evolved multiple times. 2. Synthesize findings for each individual molecular rate correlate and estimate the relative support for each correlate in relation to others. 3. Highlight the theoretical distinction between observations for transition direction vs. trait state. 4. Summarise existing findings for traits investigated bi-directionally. 5. Identify key knowledge gaps in the field.

Background and scope of studies to be discussed

The molecular clock (Zuckerkandl and Pauling 1962, Margoliash 1963, Zuckerkandl and Pauling 1965, Kumar 2005) does not tick at a constant pace. Organismal traits or habitats have been linked to variation in rates of molecular evolution in many evolutionary lineages (e.g. Bromham and Leys 2005, Thomas et al. 2010). Studies investigating links between traits and molecular rates, sometimes referred to as ‘molecular rate correlate’ studies, have examined traits that have evolved multiple times mainly though comparative frameworks (methods summarised by Lanfear et al. 2010a). These studies provide valuable information toward supporting or refuting proposed underlying parameters affecting molecular evolution as well as provide predictive knowledge on molecular evolution that can be applied across other groups. Information about systematic variability in rates could be helpful, for example, when a researcher is deciding upon whether a particular clock calibration could likely be applied for dating evolutionary events in a related taxonomic group. The evolution of many organismal traits 4 or environmental features may be consistently linked to changes in parameters such as generation time or population size, which are hypothesized to affect rates of molecular evolution via mutation rate or genetic drift, which influences the rate of fixation. For example, the life history characteristic of parasitism is often associated with decreased body size (among other biological variables), a factor hypothesized to influence rates of mutation. I summarise the relationships surrounding these underlying parameters that may affect the rate of mutation or fixation in Figure 1.1. Description of the theory behind the links between some of these key parameters (body size, metabolic rate, generation time, and effective population size [Ne]) and molecular rates are provided in reviews by Bromham et al. (1996) and Woolfit (2009).

The scope of this review includes studies of absolute or relative rates of molecular evolution as thought to be indicative of genome-wide patterns––even if few genes were investigated––where a specific biological or ecological trait (such as shifts in habitat) was investigated in the context of multiple independent originations. I also include studies where the underlying hypothesized mechanism, such as generation time, was investigated. I do not include studies that describe rate variation among larger taxa in a single instance (such as between rodents and primates) without framing the study in the context of single or multiple organismal traits of interest, which are hypothesized to influence rates of molecular evolution. After providing an overview of the key findings in the field, I will synthesize the strength of support for each parameter as well as locate some knowledge gaps.

Evolutionary transitions and homology

The term ‘evolutionary transition’ is used here to signify change in organismal traits along a lineage, for example, the transition from a free-living lifestyle to parasitism. Here, I refer to ‘trait’ as any biological or ecological characteristic of interest (similarly to Bromham et al. 2015). Biological convergence occurs when traits, such as morphological phenotype, behaviour, or function, arise in independent lineages. Multiple evolutionary transitions can be interpreted as ‘convergent’ for the purposes of molecular correlate studies if there is functional similarity. There need not necessarily be underlying homology of structures. For example, the ability to can be similar in function in bats and insects, even if the structures that make up the wings are

5 not homologous. Both cases affect population structure and metabolic demands in comparison with flightless relatives, which can influence the relative pace of molecular evolution. Repeated evolution has a greater probability of occurring within similar taxa; however, the probability depends on the particular characteristic of interest (Ord and Summers 2015). For example, convergent morphology is strongly more frequent among closely related taxa, while functionally redundant traits can occur to a greater degree among more distantly related taxa (Ord and Summers 2015). Deep homology––where genetic mechanisms are conserved across a wide range of lineages––can facilitate convergent evolution even among distantly related taxa, such as in the case of multiple evolution of ‘eyes’ (Gehring and Ikeo 1999). Similar outcomes in phenotype or function across diverse taxa may be associated with consistent patterns in molecular evolutionary rates.

Examples of evolutionary transitions

Evolutionary transitions in ‘traits’ are common in nature, some more dramatic—such as in the case of complete gains and losses of complex traits—and some more nuanced. Habitat shifts have occurred numerous times across life, on more broad scales––terrestrial, marine, and freshwater environments––or more narrow scales such as shifting plant hosts. Terrestrial to marine habitat shifts have occurred at least seven times in mammals (Vermeij and Dudley 2000, Foote et al. 2015), while freshwater to marine (or saline lake) shifts have occurred also many times (e.g. Betancur-R et al. 2010). Freshwater-terrestrial shifts have occurred at least dozens of times in insects (Dijkstra et al. 2014). Shifts in locomotive ability have occurred, with the more dramatic of these being the evolution and loss of powered flight. Powered flight has evolved at the origins of birds, bats, pterosaurs, and pterygote insects. It has been lost multiple (9+) times in birds (Shen et al. 2009), and in insects numerous times such as within at least 25 families of (Wiegmann et al. 2011) and within particular families at least 30 times (Kits et al. 2013). Feeding habit shifts are also common, such as in flies (Wiegmann et al. 2011) and beetles (Hunt et al. 2007). Parasitism has arisen twelve times in angiosperm plants (Bromham et al. 2013), and in insects multiple times (e.g. Wiegmann et al. 2011). Many evolutionary transitions and their influence on overall molecular evolutionary rates have been examined (Table 1.1).

6

Methodological approaches to studying transitions and molecular rates

The investigation of multiple independent comparisons, the use of the comparative method in evolutionary biology (Felsenstein 1985, Barraclough et al. 1998), increases our confidence that the correspondence between trait and molecular patterns observed is true on both a statistical level and biological level for that group of interest. Usually, the broader the taxonomic scope investigated, the greater confidence that the particular pattern is general. Traits can co-vary, whether correlated, necessary, or facilitative. The use of multiple comparisons that are as closely phylogenetically paired as possible can act to reduce confounding factors, as the same co-occurring factors may not be present in all instances of the evolution or loss of the trait of interest. Furthermore, noise can be reduced by careful selection of taxa to reduce severe confounding covariates (e.g. Mitterboeck and Adamowicz 2013) or through multi-variate analysis (e.g. Bromham et al. 2015).

A review of molecular rate analysis methods used in molecular-correlate studies is provided by Lanfear et al. (2010a); these methods include the use of sister pairs and whole-tree analysis. A limited amount of genetic data is not necessarily a detriment in studies of molecular correlates; however, potentially, only strong genome-wide patterns would appear using a small number of genes. Furthermore, interpretation can be confounded when those chosen genes are also strongly subjected to evolutionary pressures that are not acting genome wide, including positive or relaxed selection. Most commonly, mitochondrial genes are additionally expected to show trends in molecular evolution tied to locomotive ability or energy usage due to their function in energy production, and so, selection specific to one set of genes or one genome should be considered prior to concluding a “genome-wide” pattern.

Measures of molecular evolutionary rates

Molecular rates in molecular correlate studies are measured by overall substitution rates, non-synonymous-to-synonymous substitution (dN/dS) ratios, non-synonymous substitution rates (dN), and/or synonymous substitution rates (dS). The outcomes of these measures can lead to the generation of hypotheses about causes of the observed patterns.

7

The differences in expectation regarding rates of substitution between synonymous and non-synonymous sites stem from different selection pressures on changes at those sites. Neutrally-behaving mutations, i.e. those unaffected by selection, have their fixation determined by genetic drift (“the neutral theory of molecular evolution”, Kimura 1983). Synonymous changes (mainly changes at the 3rd codon position) result in the same amino acid in the resultant protein, and thus their fixation is more governed by genetic drift than selection. A change at a non-synonymous site (most mutations occurring at 1st and all at 2nd codon positions), by contrast, causes a change in the resultant amino acid coded for by those nucleotides, and thus dN sites are under greater selection pressures (positive, purifying, or balancing). However, smaller effective population size increases the strength of genetic drift relative to selection (due to sampling error), resulting in relaxed selection, which therefore mainly affects the rate of fixation of new mutations at 1st and 2nd codon positions. If most mutations that occur are deleterious or slightly deleterious and are normally eliminated by purifying selection, then reduced effective population size will increase the fixation of those slightly deleterious mutations due to their outcome being effectively neutral and their fixation governed by genetic drift (Ohta 1972, 1992). As a result, non-synonymous substitution rates are expected to increase with lower effective population size (“the nearly neutral theory of molecular evolution”, Ohta 1992).

Thus, different dN/dS ratios and dN between sister lineages more often indicate the type (positive, relaxed) and strength of selection, as partially governed by effective population size. By contrast, dS estimates for sister lineages can serve as a proxy for measuring relative mutation rate. Some studies of evolutionary shifts also include or focus exclusively on tests of gene- specific positive or relaxed selection (e.g. Shen et al. 2010, Foote et al. 2015); those are discussed in Part 2 of this thesis background (further below).

Synthesis of support for each biological/ecological parameter on rates

Many studies have uncovered trends in molecular rates across various groups of life (Table 1.1). While this in itself is a crucial component of the scientific endeavor, it can be difficult to elucidate causes. Often the trait of interest is hypothesized to associate with a particular cause of differential molecular evolutionary rates (such as increased mutation rate),

8 rather than testing for the particular parameter (mutation rate) directly. As well, multiple causes of differential molecular rates are present for a single trait of interest, complicating interpretation. Here, I summarise the relative support for the association between these organismal traits or parameters and molecular rates from these various studies, though multiple causes may exist within each individual study.

For those studies that indirectly capture the proposed parameter tied to molecular rates through investigating a particular biological or ecological trait (studies with non-purple cells in Table 1.1), the support is greatest for the traits of 1) parasitism, 2) those traits hypothesized to link to effective population size, and 3) latitude (tested and supported mainly in plants). Saline continental lake habitat is also supported, though through a small number of independent comparisons. The parameter of metabolic rate is supported by locomotion studies, but only weakly or not at all when empirical measurement of metabolic rate was used. However, latitude or similar correlates (e.g. environmental energy), which is hypothesized to influence metabolic rate (Gillooly et al. 2001), is moderately well supported.

From those studies where the parameter associated with molecular evolutionary rates was empirically measured for taxa (studies with purple cells in Table 1.1), little support is provided for the metabolic rate hypothesis. Overall, generation time appears to be the best-supported mechanism in terms of consistency, which likely signifies the magnitude of effect is also large enough to overcome potentially confounding factors in various lineages. This support is perhaps unsurprising, as generation time affects both mutation rate, via the generation of copy errors per round of germ-line replication, and fixation rate, due to an influence upon the number of opportunities for allele frequency shifts per unit time. As a result, generation time corresponds with evolutionary speed.

These studies provide the opportunity to assess true effects and the significance of environmental mutagens over evolutionary time. For example, increased salinity inside a cell may affect DNA replication and repair (Favre and Rudin 1996), or UV exposure leads to certain types of mutations in experimental settings; molecular correlate studies put such factors in an evolutionary context.

9

Despite much work in this field, large gaps exist in evaluating molecular patterns associated with major evolutionary shifts. Broad habitat category shifts, such as marine- freshwater-terrestrial, have not been evaluated on a large scale. These shifts impact a large proportion of life’s diversity, and so understanding general trends in molecular rates would be an important area for further study. In addition, much of the work (Table 1.1) has been conducted on mammal and plant taxa; invertebrates are underrepresented as compared with their diversity in terms of species numbers and diverse biological and ecological characteristics.

Restrictions on transition directionality

Transition direction may make a difference in the associated trends in molecular evolutionary rates; thus, it is important to consider whether both transition directions are equally likely to have occurred in nature. The evolution of specific organismal characteristics can occur in a single direction or in both forward and reverse directions (e.g. gain and loss). Typically, complex traits are gained rarely, but losses are more common. Similarly, generalization occurs less frequently than specialization, and specialization and loss of traits can occur together (Adamowicz and Purvis 2006). Dollo’s law (Dollo 1983) describes the improbability for a reversion to an ancestral form along a single path; however, cases of reversals exist (reviewed in Porter and Crandall 2003).

Many examples of differential gain and loss of traits in nature have been documented in the literature, some with estimates of direction frequency. For example, the forward transition from generalization to specialization was observed to be significantly more common than the reverse transition in phytophagous insects, with a forward to reverse transition ratio of 1.47 to 1.76 depending on the analysis method used (Nosil 2002). Loss of morphological characteristics was estimated to be about 2 times more common than gain in chydorid (Adamowicz and Sacherová 2006). Similarly, flight in insects is hypothesized to have been gained only once, but lost many times (Roff 1990); wing loss was statistically estimated to be at minimum 11-1400 times more likely than wing regain for the hypothesis of wing regain in stick insects to be rejected based on parsimony (Whiting et al. 2003). However, considering biological realities of the speciation process, e.g. more frequent movement to isolated environments, the actual

10 probability of flight loss may be higher than estimated statistically (Stone and French 2003). Finally, breeding system shifts in plants from outcrossing to self-fertilization appear to be irreversible, at least in some taxa (Igic et al. 2006, the family Solanaceae).

The gap: transition direction vs. trait state

Studies of molecular rate correlates are often not able to distinguish between ‘direction’ and ‘state’-specific trend. Often, an organismal transition will be associated with increased substitution rates (e.g. Lutzoni and Pagel 1997, Bromham et al. 2013, Mitterboeck and Adamowicz 2013), but the trait of interest itself was not investigated with reverse-directional transitions. Such investigations of both transition directions are limited in many cases due to differential prevalence of forward and reverse transition occurrences in nature.

Theoretically, there could be a difference between results observed for a transition in a specific direction, rather than for a trait itself. It is possible that increased molecular rates may associate with some trait shifts due to ‘transitioning’, rather than differential presence of parameters linked to molecular rates later in the clade. Transitions, to a new niche or habitat, involve some new aspect that could lead to rapid speciation and/or population bottlenecks. Thus, effective population size (Ne) effects may be common to many evolutionary transitions. Some of these Ne differences in lineages may indeed be prolonged, e.g. due to low dispersal ability of organisms in that lineage. However, the two possibilities––initial vs. long-term differences in molecular evolution––are not often distinguished. This can be a potential issue when the transitions examined are expected to increase the rate of molecular evolution, which is often the case. Additionally, this can be an issue when the trait of interest is hypothesized to be associated with reduced effective population size, which could confound properties associated with a specific trait with associated properties of the transitioning process. Traits that have evolved in bi-directional ways, such as shifts in habitat and flight ability, would be interesting cases to examine this question of transition vs. trait-specific trends.

11

Examples and synthesis of studies suggesting some influence of ‘transitioning’

Only a few studies exist to evaluate the effect of transition direction. McMahon et al. (2011) observed increased molecular rates early in the evolution of the parasitic Strepsiptera, but observed a slowdown in rates later in the clade, with rates similar to those elsewhere in the phylogeny. This suggests an increase in rates coincides with the transition to a new lifestyle. Of the studies citing direction of their transitions examined (in Table 1.1), only one (Korall et al. 2010) examined a transition to a trait expected to decrease the rate of molecular evolution for a single transition in their study. Interestingly, even in this case of a slowdown in rates, the initial change in rates was greater than that observed later in the clade. Furthermore, reverse direction transitions did not consistently show the opposite trend (an increase in rates), suggesting that the trait itself is not consistently linked to a differential molecular rate. Together, these results suggest that transition direction should be considered when studying shifts in rates of molecular evolution.

Methods: how could state- and direction-specific trends in molecular evolution be distinguished?

Some possible ways to distinguish trait vs. direction-specific trends include testing for commonalities between results for both transition directions; this will inform what is common to the trait of interest in either context of forward or reverse direction shifts. If higher rates associated with a trait are only observed in the transition toward that trait, or with transitions in both forward and reverse directions, then this could indicate that transitioning itself plays a role in the pattern observed. Secondly, the use of time calibration could be used to measure any difference in the molecular rates early in a clade or along the branch with the postulated change, and then later in the clade history (McMahon et al. 2011). This would aid to distinguish whether an initial increase in rates is also followed by relatively higher rates within the clade as compared with lineages that do not possess that trait. This option could be especially implemented when the transition is observed only or mainly uni-directionally within the particular taxa of interest. Finally, even without such tests, it could be considered whether the hypothesis of a trait/rate association involves shifts in effective population size (Ne) or relaxed selective constraints on multiple genes. Transitions that are expected to result in a decrease in rates could then be less of a concern for confounding effects in Ne than those expected to result in increased rates (though 12 consider Korall et al. 2010). In this case, if decreased rates are observed, then any (likely positive) effect of reduced Ne or transition to new environment on molecular rates would have been counteracted by the true lower rates associated with the particular trait.

Conclusions and future work

These studies aid us in understanding the consistency with which molecular evolution is predictable by observing characteristics of organisms. These micro-evolutionary processes impact and are impacted by evolution at multiple levels, such as at the level of the individual genome, population genetics, lineage-level, and clade-level; thus, they are important to understand. More directly, knowledge of systematic biases in rates is important for application of molecular clocks and to applications of molecular species delimitation.

Here, I have synthesized some key literature on molecular evolutionary correlates, though the body of literature continues to grow. Many commonly occurring traits or evolutionary shifts that could influence molecular evolution have not been examined, while data for testing the hypotheses are currently available. Secondly, I suggest consideration of a possible effect of ‘transitioning’ on molecular rate patterns that could be further explored, especially with the increasing scope and depth of genetic and phylogenetic information available.

13

Figures and Tables for Chapter 1 Part 1

Figure 1.1. Major underlying influences on genome-wide mutation or fixation rates, together producing substitutions in DNA sequences between populations or species. Arrows indicate the direction of cause. Double-ended arrows indicate bi-directional causation. The main mechanisms linked to organismal traits discussed here are body size, effective population size, generation time, and metabolic rate, as well as ‘mutagens’ such as UV exposure, stress, salinity, and temperature; though these mutagens are more similar to biological or ecological traits, they are the explanatory link from traits to molecular evolutionary change. Generation time is one parameter expected to affect both the rate of mutation as well as fixation.

14

Table 1.1. Summary of studies examining rates of molecular evolution associated with biological or ecological characteristics of organisms where the association to rates is expected to be genome-wide (by causes such as mutation rate or Ne difference). Where associations with molecular rates may not be genome-wide, e.g. due to positive or relaxed selection acting upon specific genes or gene categories, we include qualifications in brackets beside the hypotheses. All parameters in the left-most column are expected to be positively associated with rates of molecular evolution. Studies will cells highlighted in purple examine a particular parameter expected to be directly linked to the mechanism causing trends in molecular rates, while studies with non-highlighted cells examine biological or ecological traits which indirectly relate to the mechanism(s) through its hypothesized correspondence with the mechanism(s). Ne = effective population size, nuc=nuclear, mit=mitochondrial, chlor = chloroplast, prot=protein coding, rib=ribosomal. Methods: ‘C’ indicates phylogenetically independent comparisons through sister pairs, ‘W’ through whole-tree analysis (all branches of a given trait are coded as the same type and assigned a single rate, in the context of analyzing a full phylogenetic tree). ‘N’ indicates that lineages with similar traits were grouped together such that there were not multiple independent comparisons analyzed. Biological or Study Taxa # comparisons, Genes Higher overall Proposed cause ecological direction of substitution rates parameter, and evolutionary or dN/dS ratios link to underlying transition, observed associated mechanism method with parameter?

Parasitic lifestyle: McMahon et Insects - 1C Mit COI, NADH1, Yes, early in clade Relaxed selection early Reduced Ne, faster al. 2011 Strepsiptera non-endoparasitic 16S; nuc18S (overall substitution in clade generation time, to endoparasitic rate) smaller body size, Kaltenpoth et Insects - 1C Mit genome (37 Yes, mit genes Unknown (relaxed constraints al. 2012 Hymenoptera non-parasitic to genes), nuc 18S, (overall substitution on resource parasitic Wgl, ArgK, PEPCK rates) production, positive Bromham et Plants - 12C Mit matR, COI, Yes, 3 genomes Mutation rate, potential selection tied to al. 2013 angiosperms non-parasitic to NADH1, ATP1; nuc (both dN and dS generation time host-parasite arms parasitic 18S, 26S; rates) race) chloroplast rbcL, matK, 16S

15

Endosymbiotic Woolfit and Bacteria and 13C Mit 16S Yes (overall Reduced Ne lifestyle: Bromham fungi non-endosymbiotic substitution rates) Reduced Ne, 2003 to endosymbiotic (relaxed constraints on resource production) Benthic habitat Foltz 2003 Gastropod 2W Mit COI, cytB Yes (dN/dS ratios) Smaller Ne (dispersal): mollusks benthic vs. pelagic Smaller Ne larval stage [unknown direction] Lack of Flight Shen et al. Birds 9N flying to 13 mit prot, nuc Yes, mit (dN/dS). Reduced Ne, (and (dispersal): 2009 flightless (35 EGR1, BDNF, NGF, In nuc alone, relaxed selection on mit) Reduced Ne, phylogenetically NTF3 marginally signif (relaxed selective overlapping dN, no for dN/dS constraints on contrasts) energy-related Mitterboeck Insects 49C Including COI, Yes, mit genes Reduced Ne, (and genes) and flying to flightless COII, EF1a, IDH, (dN/dS ratios, relaxed selection on mit (in birds, also Adamowicz RpS5, GAPDH, overall substitution genes) confounded with 2013 RpS5, Wgl, CAD, rates) island living) NF1, MDH, DDC, H3

Island living: Johnson and Birds 5N mainland to mit cytB, ND2 Yes (dN/dS ratios) Reduced Ne Reduced Ne Seger 2001 island (9 phylogenetically overlapping contrasts) Woolfit and Eukaryotes 70C mainland to Including COI, No for overall Reduced Ne Bromham island [unknown cytB, 12S, ITS substitution rates, 2005 direction] yes for dN/dS ratios Saline lake habitat: Hebert et al. Crustaceans 5W freshwater to Including mit 12S, Yes (overall Salinity and variation Increased ultraviolet 2002 saline lake 16S, COI; substitution rates) causing stress radiation (UV) Nuc 18S, 28S exposure, salinity Colbourne et Crustaceans 3C freshwater to Mit 12S, 16S, COI Yes (overall UV and salinity stress [effect on al. 2006 saline lake substitution rates)

16

DNA replication and repair]

Lower latitude: Bromham and Birds 45C high vs. low Mit prot: cytB, No support (overall No latitudinal effect increased Cardillo 2003 latitude NADH2 substitution rate, dN, observed environmental dS) energy - UV Davies et al. Plants 86C high vs. low Nuc rib 18S; plastid Yes (overall Latitude best predictor radiation, 2004 latitude prot rbcL, atpB substitution rate) evapotranspiration, Gillman et al. Mammals 130C Mit prot cytB Yes (for overall Thermal environment temperature 2009 warm vs. cooler substitution rate, no affecting metabolic rate (other potential habitats for dN/dS ratios) or indirect ‘Red Queen’ factors: generation effect (co-evolution) time, Ne, metabolic Wright et al. Plants 45C Nuc non-coding Yes (overall Temperature or linked rate) 2006 and temperate vs. ITS, 18S substitution rate) variable (unclear Gillman et al. tropical regions mechanism) 2010 Rolland et al. Reptiles Unclear# C Nuc BDNF, c-mos, No support (overall Rates not associated 2016 (Squamates) Differing by 5.3ºC NT3, PDC, R35, substitution rate) with temperature or RAG1, RAG2, mit absolute latitude 12S, 16S, cytB, ND2, ND4 Herbaceousness: Smith and Plants - 13C Nuc rib: 18S Yes (overall Generation time effects shorter generation Donoghue angiosperms non-herbaceous to substitution rate) times 2008 herbaceous

Loss of Korall et al. Plants - ferns 6W: chlor: atpA, atpB, Yes, lower rates Unknown, generation Arborescence: 2010 1 non-arborescent rbcL, rps4 with arborescent but time effects shorter generation to arborescent; 5 not consistently time arborescent to non- higher rates for the arborescent; reverse direction (overall substitution rates) Asexuality: Hollister et al. Plant 10C+W sexual to transcriptomes Yes (dN/dS ratios) Lack of selection against accumulation of 2015 asexuality deleterious alleles deleterious mutations

17

*Individual parameters isolated* Faster generation Thomas et al. Invertebrates 54 for mit, 20 for Mit genomes, nuc Yes (dN and dS) Generation time effect time 2010 nuc (may overlap) rib 18S, 28S Higher metabolic Lanfear et al. metazoans 107 comparisons Including mit rib No support (dN and No metabolic rate effect rate 2007 12S, 16S; mit prot dS) observed (expected: COI, COII, CytB, increase in DNA- ND2, ND5; nuc rib damaging metabolites) 18S, 28S; nuc prot Body-size effect H3A, VWF, ATP7A observed in mammals, not in invertebrates Smaller body size Thomas et al. Invertebrates 22 data sets with Various including No support (dN and No body size effect 2006 multiple pairs each COI, 18S, 28S dS) observed in invertebrates *Multiple parameters or traits examined in one taxonomic group* Faster generation Bromham et Mammals 16+C cytB, 12S, Beta- Yes, for 2 of 3 More DNA replications time al. 1996 globin (introns and protein-coding genes (introducing mutations) exons), Epsilon- (overall substitution globin (introns and rate, no dN exons) correlations for any parameters) Higher metabolic No evidence No metabolic rate effect rate observed Smaller body size Yes, for 2 nuc genes Potential lower selective pressure on DNA replication fidelity and repair efficiency Social lifestyle: Bromham and Eukaryotes, 25C non-social to Including COI, No for all Ne not reduced and/or Smaller Ne, greater Leys 2005 including social EF1a, rhod, cytB, comparisons, yes for cell divisions not replications per insect taxa, 28S, 16S, 12S, TTR, most extreme social increased, except generation , mole COI, COII, 18S (overall substitution enough in extreme cases rats rates) Parasitic lifestyle 8C non parasitic to COI, COII, cytB, Yes, significant Reduced Ne (hypotheses listed parasitic EF1a, 18S (overall substitution above) rates) Gene short forms: ArgK=arginine kinase, PEPCK=phosphoenolpyruvate carboxykinase, matR= maturase R, COI=cytochrome c oxidase subunit I, COII = cytochrome c oxidase subunit II, cytB = cytochrome b, ND2/ND4/ND5 = nicotinamide adenine dehydrogenase subunits 2/4/5, NADH1/NADH2 = NADH dehydrogenase subunit 1/2, ATP1=ATPase F1 alpha subunit, 12S = 12S long subunit RNA, 18S = 18S small subunit

18 ribosomal RNA, 26S = 26S ribosomal RNA, 28S = 28S ribosomal RNA, 16S = 16S ribosomal RNA, 18S = small subunit ribosomal RNA, rbcL = ribulose-1, 5-bisphosphate carboxylase/oxygenase large subunit, matK = maturase K, EF-1a = elongation factor 1, Wgl = wingless, IDH = isocitrate dehydrogenase, GAPDH = glyceraldehyde-3-phosphate, dehydrogenase; RpS5 = ribosomal protein S5, CAD = carbamoyl-P synthetase/aspartate transcarbamylase/dihydroorotase, H3 = histone 3, NF1 = neurofibromin, BDNF = brain-derived neurotrophic factor, c-mos = oocyte maturation factor, NT3 = neurotrophin-3, PDC = phosducin, R35 = G protein-coupled receptor 35, RAG1/RAG2 = recombination- activating genes 1/2, H3A = Histone 3 alpha, VWF = von Willebrand factor, ATP7A = ATPase 7 alpha, ITS = internal transcribed spacer

19

PART 2: BACKGROUND ON GENOMIC-SCALE ASSESSMENT OF CONVERGENCE, POSITIVE SELECTION, AND/OR RELAXED SELECTION

Summary of Part 2

Genomic-level investigations of biological convergence support repeatability in evolution. Here, I summarise key studies (often including many genes) that investigate positive or relaxed selection pressures and/or convergence at amino acid sites tied to evolutionary transitions, with a focus on repeated evolutionary transitions.

The main purposes of this section are to:

1. Provide an overview of main literature findings on positive and/or relaxed selection, and convergent evolution, associated with parallel evolutionary shifts 2. Synthesize findings for convergent transitions from different studies and taxonomic groups 3. Identify some knowledge gaps in the field

Scope of studies being discussed

With the increase in genomic data available within and between species, increasing opportunity is available to investigate the functional associations with evolutionary shifts. In the case of multiple independent lineages examined, the theme is to discern in what aspects molecular evolution is convergent, corresponding with observed convergent phenotypic or functional evolution of organisms.

Convergent molecular evolution

Convergence is common at the phenotypic level but also occurs at the genetic level. On a broad scale of molecular trends, convergence can occur in the selection pressures acting on gene types. On a fine scale, molecular ‘convergence’ encompasses both parallel and convergent substitutions, where the former signifies the same end state (amino acid or nucleotide, etc.) from the same ancestral state in independent lineages, and the latter signifies same end state from

20 different beginning states (Zou and Zhang 2015). Where convergence occurs more than expected by chance, this can signify adaptive evolution through positive selection (Graur and Li 2000, Zou and Zhang 2015).

Positive and relaxed selection and their measurement

Positive selection, a directional change in allele frequency caused by the favourability of that allele, is often examined as a way to identify what adaptation occurred in a particular lineage. Relaxation of purifying selection can also occur, such as when constraint on that gene is no longer as strong, such as due to a change in ecological circumstances. Thus, both positive and relaxed selection in association with major evolutionary shifts are of great interest to researchers.

Branch-site statistical analysis methods, which allow selection pressures to vary among amino acids as well as among lineages, are currently commonly used for detection of positive selection. While positive selection may involve more dramatic molecular and phenotypic changes, via substitution to advantageous amino acids, relaxed selection allows changes to other, often slightly deleterious alleles. Thus, positive selection may act on few sites (Hughes 2007), and so gene-wide measures (such as dN/dS ratios>1, used in the past) are less sensitive for detecting positive selection. Relaxed selection is expected to lead to increased dN/dS ratios at more sites, and thus, increased gene-wide dN/dS ratios on branches are often interpreted as relaxed selection (e.g. Foltz et al. 2003, Shen et al. 2009). Recently, more detailed relaxed selection methods have been developed examining dN/dS ratios across sites and branches; relaxed selection on a gene is detected by a certain proportion of sites having moderate dN/dS ratios, as compared to positive selection, whereby both very high and very low dN/dS ratios are observed across sites (RELAX, Wertheim et al. 2014).

Methods of detecting convergent molecular evolution by gene type

Genes showing evidence of convergent selective pressures can be identified via overlap in the genes with evidence of positive selection in the independent lineages, which could be followed by identification of sites that are convergent (e.g. Foote et al. 2015). Alternatively, the

21 identification of which genes show convergent molecular evolution can be performed via comparing the tree support when the gene tree is forced to match the (multi-gene) species tree vs. the support for the gene tree when it is forced to match a tree where the phenotypically (or functionally) convergent lineages form one clade (e.g. Parker et al. 2013).

Some baseline degree of convergent evolution is expected by chance, and so this baseline should be taken into consideration when testing for biologically-significant convergence. For example, in the study by Foote et al. (2015) of marine mammals, there was more genomic convergent evolution detected in the terrestrial lineages (no habitat transition) than in the marine lineages; a similar phenomenon occurred for bats and non-echolocating mammals as compared with bats with dolphins (Zou and Zhang 2015). Caution is needed not to over-‘fit’ results to expectations; convergence in bats and dolphin lineages have been suggested linked to each echolocation, locomotive ability, or brain size:body size ratio, in at least 3 separate studies (Table 1.2). As long as traits of interest are not always co-occurring with others, the use of multiple phylogenetically independent pairs can aid in reducing the confounding effects of co- occurring traits.

Synthesis of literature and gaps

The comparative nature of this work allows the inference of what genes or specific sites are involved in the evolution of certain organismal characteristics, outside of performing laboratory studies. While studies of molecular convergence are numerous, studies performing ‘genome-wide scans’ for convergence are fairly new due to the relatively recent and rapid advance in genomics. Multiple studies on independent lineages have examined similar transitions, enabling us to look for generalizations (Table 1.2): habitat shifts including marine- terrestrial-freshwater and altitude shifts, shifts in metabolic demands associated with locomotive ability or brain size evolution, and echolocation.

Each evolutionary transition reveals some degree of novelty in the types of genes under positive selection due to differing adaptations or context of those adaptations; however, commonalities are present in the gene types under differing selection pressures among similar transitions. Habitat shifts, such as from terrestrial to marine environments and from low- to high-

22 altitude living, share some similarity in the detection of positive selection in genes related to tolerating lower oxygen conditions. Flight is one transition that is of great interest to researchers broadly and has been tested for trends in positive or relaxed selection in multiple studies. Through these investigations, a consistent pattern emerges: ‘energy-related genes’ are often implicated as experiencing a change in selective pressures associated with flight gain or loss. This trend has been similarly observed for lineages with otherwise differing metabolic demands or locomotive ability. However, most of the studies investigating flight utilize energy-related genes almost exclusively. Thus, such an examination including a realistic null hypothesis and exploratory context is still lacking in order to assess comprehensively the genes important in flight-ability shifts.

Taxonomically, much attention has been given to cetaceans and bats; however, the choice of studies to be included here is biased toward those with similar themes as this thesis and is not a comprehensive sampling of all studies in this field. The choice of taxa or questions to investigate in the field is likely to some degree driven by genomic and sequence data availability, as well as (or due to) the ‘charismatic’ nature of certain taxa and research questions.

Concluding remarks

This overview only includes a fraction of the studies available, with a focus on those subjects relevant to this thesis. Many repeated evolutionary shifts could provide an interesting opportunity for the exploration and assessment of convergence in positive and relaxed selection.

23

Figures and Tables for Chapter 1 Part 2

Table 1.2. Examples of some genomic studies examining evolutionary transitions in the context of positive and/or relaxed selection, or convergent sites. Tests of molecular rates may overlap with those in Table 1.1 (genome-wide rates), but the primary hypothesis or conclusion was positive or relaxed selection tied to function in specific genes. Mit=mitochondrial, nuc=nuclear, prot=protein-coding. The full meanings of gene and other abbreviations are listed below the table.

Biological or Study Taxa # comparisons Genes Findings, link to ecological and direction original hypotheses trait Energy Flight gain and Shen et al. Bats 1 flight gain Mit genomes, Positive selection linked loss: adaptive 2010 nuc OXPHOS to energy metabolism or relaxed and non- selection, OXPHOS respectively, genes (7164 evolution of genes total) energy-related Shen et al. Birds 9 flight losses Mit genomes, 4 Relaxed selection in genes 2009 nuc genes: energy-related genes in (also in EGR1, BDNF, flightless lineages Table 1.1) NGF, NTF3 Mitterboeck Insects 49 flight losses Including COI, Relaxed selection (higher and COII, EF1a, dN/dS ratios) associated Adamowicz IDH, RpS5, with flightless lineages in 2013 GAPDH, mit genes (also in RpS5, Wgl, Table 1.1) CAD, NF1, MDH, DDC, H3 Yang et al. Insects 1 flight gain 13 mit prot Positive selection 2014b 4 flight loss genes associated with flying lineages and relaxed selection associated with flightless lineages Larger Ai et al. Dolphin, 3 lineages: IDH2 gene Parallel evolution linked brain:body size 2014 bat, toward greater to energy demand in 3 ratio primate brain:body size groups ratio New Shen et al. Dolphin 2 lineages: genomes Higher convergence/ locomotive 2012 and bat toward ‘new’ parallelism in energy style locomotive metabolism genes, than in style other genes Locomotive Strohm et al. Whole-tree (3 13 mit prot Increased selective speed 2015 high- genes, nuc prot constraints in mit genes, performance PLAGL2, with high energy clades) RIPK4, ZIC1 expenditure

24

Habitat Terrestrial to Sun et al. Dolphin 1 11,838 genes Positive selection in marine 2012 terrestrial to certain gene categories marine (e.g. lipid transport) McGowen Dolphin 1 10,025 genes Positive selection in et al. 2012 terrestrial to certain gene categories marine (e.g. nervous system) Foote et al. Mammals 3 terrestrial to genomes Positive selection and 2015 marine convergent site in certain genes in marine genomes (likewise in terrestrial genomes) Marine to Jones et al. Fish – 3+ (unclear) genomes Regulatory changes freshwater 2012 stickle- marine to common with adaptation backs freshwater to freshwaters populations Low to high Wang et al. Yak and 2 low to high 7,950 genes One gene with altitude 2015 Tibetan altitude convergent evolution antelope Echolocation Echolocation Parker et al. Bats and 3 non- 2,326 coding Convergence in genes 2013 dolphins echolocating to genes linked to hearing and echolocating (2 vision gains,1 loss in bats) Echolocation Zou and Bats and 3 non- 2,326 coding No support for Zhang 2015 dolphins echolocating to genes (from convergence echolocating (2 Parker et al. gains,1 loss in 2013) bats) Convergent sites C4 Besnard et Plants - 5+ PEPC- Convergent amino acid photosynthesis al. 2009 sedges non-C4- encoding genes changes photosynthesis to C4- photosynthesis Toxin Ujvari et al. Insects, 5 (4 gains, 1 Na+/K+- Convergent amino acid resistance 2015 amphi- loss of toxin ATPase gene changes bians, resistance) reptiles, mammals OXPHOS = oxidative phosphorylation genes; PEPC = phosphoenolpyruvate carboxylase; Na+/K+- ATPase = sodium-potassium adenosine triphosphatase; IDH2 = isocitrate dehydrogenase 2; EGR1 = early growth response protein 1; BDNF = brain-derived neurotrophic factor; NGF = nerve growth factor; NTF3 = neurotrophin-3, PLAGL2 = pleiomorphic adenoma gene-like 2, RIPK4 = receptor-interacting serine- threonine kinase 4, ZIC1 = zic family member 1

25

THESIS OBJECTIVES: STUDY QUESTIONS, HYPOTHESES, AND PREDICTIONS

All thesis chapters focus on 1) the observations of trends in molecular evolutionary substitutions corresponding with trait evolution and 2) in understanding the causes of any trends observed.

Chapter 2 examines whether relative rates of molecular evolution in marine or saline lake eukaryotes are generally higher than those in related freshwater lineages, as is suggested by some previous work using a small number of phylogenetic comparisons (e.g. Hebert et al. 2002). Available genetic data are used. I hypothesize that molecular rates are increased in freshwater lineages relative to marine lineages due to smaller effective population size, potentially higher metabolic rate, and any effect of transitioning into a novel environment; alternatively, I hypothesize that molecular rates are higher in marine waters due to effects of salinity and ultraviolet radiation exposure. I hypothesize that inland saline waters combine all these effects relative to freshwaters and thus show the highest relative molecular evolutionary rates. This is a primarily exploratory analysis of patterns due to the potential existence of multiple causes of relatively higher molecular rates in each environment (marine, inland saline, or freshwater).

Chapter 3 examines whether relative rates of molecular evolution are generally higher for freshwater or terrestrial insects. Available genetic data are used. I hypothesize that molecular rates are increased in freshwater lineages relative to terrestrial lineages due to the greater patchiness of the landscape reducing gene flow and leading to smaller effective population size. Conversely, I hypothesize that molecular rates are higher in terrestrial environments due to greater diet specialization, leading to smaller effective size. This is a primarily exploratory analysis due to the potential existence of causes of relatively higher molecular rates in freshwater or terrestrial lineages. Furthermore, I test whether an amino acid site in the mitochondrial cytochrome c oxidase subunit I (COI) gene is convergent based on aquatic-terrestrial habitat in a range of insect taxa, as has been reported in a group of beetles (Song et al. 2014). I hypothesize that COI will not show strong convergence at amino acid sites related to freshwater habitat caused by selection for greater metabolic efficiency, due to varying (and successful) methods of obtaining oxygen in freshwater taxa.

Chapter 4 examines the genes and gene categories under differing selection pressures with the evolution and loss of flight in insects. Transcriptome genetic data were used in

26 collaboration with the Beijing Genomics Institute and the 1000 Insect Transcriptome Evolution Project (http://www.1kite.org/). I hypothesize that flight evolution is associated with positive selection in energy-related genes (among other gene categories) and flight loss associated with relaxed selection in energy-related genes; the former has been suggested by studies of bats (Shen et al. 2010), and the latter has been suggested in previous work in birds (Shen et al. 2009) and in insects (Mitterboeck and Adamowicz 2013), though in insects using only a small number of energy-related and non-energy-related genes. I expect to detect other novel categories of genes not related to energy production, which could signify additional genes, processes, or pathways important in the evolution or function of flight or flight loss.

In sum, this thesis presents three novel studies that advance knowledge of the correlation between evolutionary transitions and molecular evolution. Throughout these studies, the association between transition directionality and trends in molecular evolution are also investigated, which is uncommonly performed but which may provide further insight into the correlates of molecular evolution. I conclude with a synthesis of findings across the studies and provide next steps for further research.

27

Chapter 2

Do saline taxa evolve faster? Comparing relative rates of molecular evolution between freshwater and marine eukaryotes

T. Fatima Mitterboeck1,2, Alexander Y. Chen1,2, Omar A. Zaheer1,2, Eddie Y. T. Ma1,3, Sarah J. Adamowicz1,2

1Biodiversity Institute of Ontario, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada 2Department of Integrative Biology, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada 3School of Computer Science, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada

28

ABSTRACT

The major branches of life diversified in the marine realm, and numerous taxa have since transitioned between marine and freshwaters. Previous studies have demonstrated higher rates of molecular evolution in crustaceans inhabiting continental saline habitats as compared with freshwaters, but it is unclear whether this trend is pervasive or whether it applies to the marine environment. We employ the phylogenetic comparative method to investigate relative molecular evolutionary rates between 148 pairs of marine or continental saline vs. freshwater lineages representing disparate groups, including bony fish, elasmobranchs, cetaceans, crustaceans, mollusks, , algae, and other eukaryotes, using available protein-coding and non-coding genes. Overall, we observed no consistent pattern in nucleotide substitution rates linked to habitat across all genes and taxa. However, we observed some trends of higher evolutionary rates within protein-coding genes in freshwater taxa—the comparisons mainly involving bony fish—compared with their marine relatives. The results suggest no systematic differences in substitution rate between marine and freshwater organisms.

Keywords: relative rates, marine, saline, freshwater, comparative method, eukaryotes

The major lineages of life diversified in the marine realm, yet their subsequent evolution has involved crossing the barrier among saline, freshwater, and terrestrial environments multiple times independently. Of the 27 animal phyla with marine species, 16 phyla also possess freshwater species, though terrestrial environments may have been an intermediate stage within some of these phyla (Little 1990). The transition from marine to freshwater environments is relatively common in certain major animal groups (Lee and Bell 1999), while freshwater to marine movements are less common (McDowall 1997). For example, shifts from marine to freshwaters have occurred multiple times in fish (Sezaki et al. 1999, Lovejoy and Collette 2001, Yamanoue et al. 2011, Bloom et al. 2013) and invertebrate lineages (Rousset et al. 2008, Hou et al. 2011, Graf 2013, Botello and Alverez 2013). Some groups of organisms, such as crustaceans, may be predisposed to more easily transitioning between aquatic habitat types, as supported by recent invasions of freshwater (Lee and Bell 1999). By contrast, eleven phyla of marine have failed to enter the freshwater environment (Little 1990). Among microorganisms, including unicellular eukaryotes, Archaea, bacteria, and viruses, switching between marine and freshwaters

29 appears to be relatively infrequent (Logares et al. 2009, Logares et al. 2010), which initially seems somewhat surprising considering their generally large population sizes, short generation times, and often large dispersal potential (Logares et al. 2007). However, many factors may constrain the rate of transition between major habitats, especially physiological barriers (Lee and Bell 1999) but also low dispersal ability and niche saturation (Logares et al. 2009). Freshwater and saline organisms may have different biological properties and experience different environmental pressures, which can exert influence upon their genome-wide rate of molecular evolution. The balance of these various pressures may produce predictable differences in molecular evolutionary rates in lineages inhabiting different aquatic realms. Saline waters have been previously associated with elevated rates of molecular evolution (Hebert et al. 2002, Wӓgele et al. 2003, Colbourne et al. 2006, Logares et al. 2010; discussed further below); however, this finding has not been explored on a large phylogenetic scale. In this study we consider relative molecular rates in three broad habitat categories: marine, freshwater, and inland saline environments. We first introduce several primary biological and environmental parameters that are expected to generally differ among these habitat categories and to influence rates of molecular evolution (summarized in Figure 2.1).

Metabolic rate

Since freshwater continental waters have lower ionic concentrations than saline waters, transition between these habitat types requires the appropriate osmoregulatory adjustments. The greater osmotic difference that exists between freshwater organisms and their environment, as compared with marine organisms, is expected to have a greater metabolic cost for regulation of ion concentration and ionic pressure (Lee et al. 2012). While vertebrates are generally osmoregulators in both saline and freshwater settings, invertebrates are generally osmoconformers in the marine environment and osmoregulators in freshwaters (Evans 2008). Thus, we expect freshwater invertebrates to have greater metabolic expenditure associated with osmotic regulation, as compared with vertebrates, than their marine counterparts. By contrast, larval-stage marine fish may have greater metabolic requirements than their freshwater relatives (Houde et al. 1994); however, overall for bony fish, there is no consensus on whether freshwater or marine species have greater metabolic costs associated with osmoregulation (Evans 2008).

30

Inland populations additionally experience greater chemical and physical variability of their habitat than marine organisms do, which can have implications for osmotic balance and metabolic rate. Inland saline lakes have greater variation in their salt concentrations than the marine environment across time (Bowman 1956, Frey 1993), which can place osmotic stress on organisms inhabiting the changing environment (Herbst 2001). As well, hypersalinity in inland saline lakes can increase the metabolic rate of organisms due to the cost of ion pumping (Hebert et al. 2002). Freshwaters and inland waters additionally have greater variation in temperatures, with temperature affecting metabolic rate (Gillooly et al. 2001). Increased metabolic rate is expected to increase the genome-wide mutation rate over evolutionary time (Martin and Palumbi 1993). The effect may be stronger for the mitochondrial genome than the nuclear genome due to the proximity of mitochondrial DNA to DNA-damaging metabolites; however, this idea is controversial (Galtier et al. 2009). Despite possible reasons to expect a correlation between metabolic rate and molecular evolutionary rate, there is currently minimal evidence linking metabolic rate itself with general substitution rate differences among lineages (e.g. Bromham et al. 1996, Lanfear et al. 2007). Although there are many variables that differ based on habitat that could influence metabolic rate, overall we expect invertebrates in inland saline lakes to incur the greatest relative effects on their molecular evolution due to metabolic costs, followed by freshwater, and then marine invertebrates.

Salinity

In addition to an effect mediated by metabolic rate, the salinity of the aquatic environment may directly impact the pace of molecular evolution (Hebert et al. 2002, Colbourne et al. 2006). Specifically, higher intracellular salinity is proposed to reduce the fidelity of DNA replication, as salinity impacts binding between protein and DNA and alters the performance of DNA polymerases (Favre and Rudin 1996). We expect that organisms inhabiting hypersaline inland waters would show the greatest molecular rate effects due to mutagenic effects of salinity itself, especially given fluctuations in salinity in the environment. We expect this effect to be genome-wide, based on observations from both nuclear and mitochondrial genomes in hypersaline taxa (Hebert et al. 2002). While intracellular salinities in both freshwater and marine fish are maintained within a regulated range for cell metabolism (Fiol and Kültz 2007), the concentrations of many electrolytes/osmolytes in extracellular plasma are higher in marine than

31 freshwater fish, e.g. Ca2+, Mg2+ (Kapoor and Khanna 2004), Na+, and K+ (Volkenstein 1994). If salinity of environment itself leads to increased rates of molecular evolution, then marine organisms would be expected to show this effect relative to freshwater taxa.

Ultraviolet radiation exposure

The movement toward more saline environments may increase an organism’s exposure to ultraviolet (UV) radiation. UV penetration is affected by dissolved organic carbon (DOC), which acts as a UV barrier but is precipitated by high salt concentrations. Within the marine realm, coastal waters—having higher DOC concentrations—have lower UV penetration than the open ocean (Tedetti and Sempere 2006). For similar dissolved organic carbon (DOC) concentrations in continental waters, UV penetration is higher with greater salinity levels (Arts et al. 2000). Saline lake organisms may be further exposed to UV radiation, as UV radiation decreases with water depth (Tedetti and Sempere 2006), and saline lakes tend to be shallower than marine waters. UV radiation is known to have a direct mutagenic effect on DNA, and thus exposure may lead to increased molecular evolutionary rates in both nuclear (Smith et al. 1992, Lutzoni and Pagel 1997) and mitochondrial genomes (Colbourne et al. 2006). Although individual habitats can differ from general trends in UV prevalence (e.g. translucent oligotrophic freshwaters), we generally expect UV effects on molecular rates to be most prevalent in inland saline organisms, followed by marine, and then freshwater organisms, and especially in smaller-bodied organisms.

Effective population size

Inland populations experience reduced physical stability of habitats, increased vulnerability to population bottlenecks, and higher rates of extinction and colonization, as compared to those inhabiting marine environments, which tend to be more stable in the long term (Frey 1993, Grosberg et al. 2012). Along with greater habitat subdivision across the continental landscape and generally reduced gene flow among populations within a species, this greater habitat instability is expected to reduce the effective population sizes (Ne) of inland aquatic organisms (freshwater or saline) as compared with marine organisms (Whitlock and Barton

1997). Numerous studies suggest a trend of higher Ne of marine fish species than freshwater fish, including higher levels of gene flow (Ward et al. 1994, DeWoody and Avise 2000, Yi and

32

Streelman 2005). Likely, not all marine species have larger Ne than related freshwater species, as some marine organisms are specialized to habitats occupying small geographic areas, such as on coral reefs (e.g. Underwood et al. 2012).

Smaller Ne increases the role of genetic drift relative to selection, in effect relaxing selective constraints, and is expected to lead to increased fixation of nearly neutral mutations

(Ohta 1973, 1992, Woolfit and Bromham 2003). Thus, signatures of reduced Ne are expected to be increased overall substitution rates and increased non-synonymous-to-synonymous substitution ratios (Woolfit 2009). Ne-mediated influences upon substitution rates are expected to act genome wide but more strongly for mitochondrial DNA due to its approximately four-fold smaller Ne compared to nuclear DNA (Schmitz and Moritz 1998). As a general trend, we expect freshwater and inland saline organisms to display increased molecular rates attributable to reduced Ne when compared with marine species.

Diversification rate and novel niches

Evolutionary transitions in habitat expose transitioning lineages to novel environmental and biological conditions, which can result in increased speciation (Bloom et al. 2013). Inland waters, and especially freshwaters, are most often the recipient environment in saline-freshwater shifts. Speciation or net diversification rates may be increased in freshwaters (Hou et al. 2011), particularly in fish; freshwaters occupy less than 0.01% of the aquatic volume but account for over 41% of described fish species (Horn 1972). Studies testing this hypothesis within a comparative context have concluded various associations between habitat and relative diversification rates. For example, there was no difference in net diversification rates between freshwater and marine clades of ray-finned (Vega and Wiens 2012). Higher rates of both speciation and extinction were estimated in freshwater vs. marine silverside fish (Bloom et al. 2013). Finally, higher speciation rates were estimated for marine pufferfish than their freshwater or brackish relatives, contrary to predictions relating to colonization of novel habitats (Santini et al. 2013). A link between speciation and molecular evolutionary rates (e.g. Lanfear et al. 2010b) could manifest in many ways (Barraclough and Savolainen 2001), either in genome-wide or gene-specific manners. Similarly, the direction of the habitat transition may influence molecular rate patterns. These associations could occur, for example, by a decrease in effective population

33 size during the process of speciating or transitioning, or by increased substitutions in genes involved with the adaptation to novel niches (Rieseberg and Blackman 2010). We expect inland environments, both freshwater and saline, to show the greatest molecular rate effects due to transition direction or novel niche space, as these are most often the recipient environments in freshwater-marine/saline shifts.

Life history and other correlates of aquatic habitat type

Major developmental and lifestyle changes can additionally accompany saline-freshwater habitat shifts, such as changes from active-feeding to non-feeding larvae when entering freshwaters (Lee and Bell 1999, Kupriyanova et al. 2009). Marine fish are, on average, larger as adults than freshwater fish (Olden et al. 2007), although marine and freshwater fish larvae have similar growth rates (Houde et al. 1994). Body size itself is proposed to be negatively correlated with molecular evolutionary rates (Martin and Palumbi 1993). Multiple biological and life- history traits may differ, on average, based on habitat category, and these may co-vary. Given these complexities, including a large sample size of phylogenetically independent comparisons— to aid in control for confounding factors—is important for testing whether there is an association between a particular environment and rates of molecular evolution.

Do rates of molecular evolution differ across aquatic environments?

Previous studies of habitat-specific molecular rates have provided evidence of trends on small taxonomic scales, with higher molecular rates observed in saline habitats. Two studies have compared molecular rates between branchiopod crustaceans inhabiting continental saline waters and their freshwater counterparts, involving a total of 5 evolutionarily independent habitat transitions (Hebert et al. 2002, Colbourne et al. 2006). These studies hypothesized UV exposure (Hebert et al. 2002), or UV and salinity independently (Colbourne et al. 2006), as causes for the observed higher molecular rates in saline vs. freshwater lake taxa. Similarly, Wӓgele et al. (2003) observed faster relative molecular rates in marine than in freshwater Asellota isopods; their study involved a single sister clade pair. In prokaryotes, Logares et al. (2010) observed 7.5 times less molecular diversity within a freshwater clade of the bacterial SAR11 group than in its sister marine/brackish clade. This pattern was hypothesized to be due to higher

34 dispersal in freshwaters, resulting in genetic homogenization, and/or to higher molecular evolutionary rates in the marine/brackish taxa. The observed higher rates in saline taxa were observed across all genes in these studies, ranging from 1 to 5 genes examined, in both mitochondrial and nuclear genomes, and including coding (mitochondrial only: COI) and non- coding regions (12S, 16S, 18S, 28S). It remains unknown whether these elevated rates previously observed for continental saline branchiopods and for several marine organisms hold for saline environments broadly. In this study we employ the comparative method to investigate whether a generalized relationship exists between salinity of aquatic habitat and molecular evolutionary rate. We analyzed 148 sister pairs of freshwater vs. marine or continental saline taxa (Table 2.1) for overall nucleotide substitution rates (OSRs) and non-synonymous-to-synonymous substitution (dN/dS) ratios in nuclear and mitochondrial non-coding and protein-coding genes. Contrary to previous expectations from studies linking saline habitat to higher molecular rates, we observed no general habitat-linked difference in OSR when considering all genes. However, we observed some exceptions, with freshwater lineages having more often higher substitution rates in protein- coding genes compared with saline lineages. Upon considering multiple mechanisms that could influence rates of molecular evolution, we could not pinpoint a clear mechanism to explain the trends observed in all genetic regions, considered together. The trends in protein-coding genes may reflect smaller effective population size and/or higher metabolic rate in freshwater lineages, possibly in combination with positive selection in select taxa and genes.

MATERIALS AND METHODS

Source studies and species choice

We searched Web of Science for published studies including molecular phylogenies of aquatic eukaryotes inhabiting waters of varying salinities. The following habitat descriptions were sought within the source papers or from other works on the same taxa: ‘marine’, ‘saline’, ‘brackish’, ‘euryhaline’, and ‘freshwater’ (search details and taxon inclusion criteria in Supplementary Material [SM] Ch2_S1). Phylogenetically independent sister clades or lineages, or paraphyletic paired lineages in some instances, were chosen so as to represent a salinity difference between sisters. Each entire source study tree, including outgroup lineages, was used

35 in analysis. Comparisons were mainly between marine, continental saline, or brackish vs. freshwater categories. However, euryhaline species were occasionally used within either the “saline” (SAL) or “freshwater” (FW) category type, given a contrast in salinity of inhabited environment between the two sister lineages. We initially excluded comparisons that had already been analyzed for molecular evolutionary rate differences (Hebert et al. 2002, Wӓgele et al. 2003, Colbourne et al. 2006, Logares et al. 2010) so as not to pseudoreplicate results from prior studies that already tested for (and observed) habitat-specific rates. We separately analyzed all of these comparisons using the methods outlined here. The source studies’ single or multi-gene tree topology was used, with preference for topologies created using maximum likelihood or Bayesian methods over maximum parsimony or neighbor joining (Hall 2005, Ogden and Rosenberg 2006). If multiple topologies were present within the same paper or taxon, we used the phylogeny that was built using the most genetic data or that included the most habitat information. We performed interspecific comparisons; however, considerations were made to maximize sample size of organisms per clade, especially in the case of unicellular organisms where species boundaries are difficult to define (details in SM Ch2_S1). We included the same number of terminal taxa in each sister clade to minimize the node density effect (Robinson et al. 1998), while maximizing genetic data available, between-clade salinity difference, and phylogenetic diversity (Robinson et al. 1998); after these considerations, sequences were selected from major sub-clades using the random number generator in R (R v2.11.1, R Development Core Team 2010). All input source study names, species names, habitat categorizations, postulated ancestral habitats, and sources of genetic data are given in the SM.

Sequence data

Molecular data were obtained from GenBank, Dryad Digital Repository, Treebase, the source study supplementary material, online links, or directly from the source study authors (sources noted in SM), with preference for original alignments used in the construction of the phylogeny presented in the source study. Unaligned sequences were aligned using ClustalW in MEGA version 5.2 (Tamura et al. 2011) or 6.0 (Tamura et al. 2013) or using the EMBL-EBI ClustalOmega (Sievers et al. 2011) online tool (http://www.ebi.ac.uk/Tools/msa/clustalo/). Protein-coding alignments were verified using amino acid translations to be free of stop codons and indels of 1-2 base pairs. Alignments for non-protein-coding genes, including those from the

36 authors, were run through the online server Gblocks v 0.91b (Castresana 2000) in order to eliminate regions containing many gaps and uncertainty of homology, using the “less stringent” setting. Input and output files (alignment, topology, and PAML output) are available on the Dryad Digital Repository (http://dx.doi.org/10.5061/dryad.fq684).

Estimation of relative rates of molecular evolution

The program baseml within the package PAML version 4.4 (Yang 2007) was used to estimate relative overall nucleotide substitution rates (we abbreviate ‘OSRs’), and the program codeml was used to calculate non-synonymous-to-synonymous substitution (dN/dS) ratios. Each sister lineage or clade was coded differently, so that a rate would be estimated for each separately from their point of divergence onward. There were often multiple sister clades coded in the tree, with the rest of the lineages assigned to the background rate category (for example, 7 sister clades would have 15 rate categories coded in the tree). Due to different species being available for the various genes within the same sister pair, and in order to examine patterns for each gene, we estimated relative rates for each gene separately. For OSRs, models of molecular evolution for each gene were estimated in MEGA using the source topology, and the best model by Bayesian Information Criterion without +G or +I parameters was used in PAML. We combined the relative rates (OSRs or dN/dS ratios) obtained for the separate genes for each sister pair in order to analyze overall molecular patterns across multiple phylogenetically independent habitat transitions. Concatenation of gene sequences was not possible due to different species having different genes available. Clade-wise rates are not provided by PAML for clades assigned rate classes. We therefore created a Python script to calculate estimates for dN and dS substitution rates for clades of interest from codeml outputs through averaging rates for sister lineages and adding internal lineages, starting from the lineage tips (script also provided on Dryad repository). With these values per gene, we concatenated the dN rates, dS rates, and dN/dS ratios across all genes per sister comparison by adding estimated substitution counts for individual genes (Mitterboeck and Adamowicz 2013). Total branch lengths were also similarly calculated for protein-coding genes using these estimated dN and dS substitution counts. We display the dN/dS ratios as relative by using a formula (1 – smaller/larger rate) and signing the metric, based on which habitat displayed the larger ratio (SAL>FW positive, FW>SAL negative) (as in Wright et al. 2006). However, for OSRs, relative

37 rates are provided, and we could not concatenate the rates in the same way across genes using absolute counts of estimated substitutions. Instead, for each individual gene within a sister pair, the FW and SAL relative rates were divided larger over smaller. We subtracted 1 from these ratios, since if the sister rates were equal, a ratio of 1 would be produced; a distribution centered around zero was preferable for further analysis. This difference from 1 was next signed based on direction. Based upon sequence length, a weighted average of the signed differences across genes was calculated for each sister pair. To ensure that the displayed ratios are intuitive to interpret, we revert the summarized differences to be relative, i.e. by adding 1 to the positive and subtracting 1 from the negative overall result. We examined each gene for each sister pair against our minimum inclusion criteria, separately for OSRs and dN/dS ratios, in order to avoid including genes lacking information and those producing extreme rate differences (details in SM Ch2_S1). Therefore, the genetic data represented by each sister pair may not be exactly the same between OSR and dN/dS analyses.

Analysis of relative rates of molecular evolution in saline vs. freshwaters

We tested for an overall habitat-related trend in the summarized relative OSRs and dN/dS ratios by two-tailed binomial tests (all tests are 2-tailed) with null expectation of 50% positive values (SAL>FW) and 50% negative values (FW>SAL). Statistical tests were performed in R (R v2.13.0, R Development Core Team 2010). Similarly, we tested specific gene categories. For OSRs we grouped the genes into two broad categories, ‘protein-coding’ and ‘non-coding’, as well as into five more specific categories: mitochondrial protein-coding, nuclear protein-coding, mitochondrial non-coding, nuclear non-coding, and chloroplast protein-coding. For dN/dS ratios, we grouped the genes into three categories of protein-coding genes: mitochondrial, nuclear, and chloroplast. Where we tested multiple gene categories, we consider the p-values without correction. We additionally repeated the above tests excluding summarized relative OSRs that were close to equal (<5% deviation between the two values). We furthermore parsed the data into three major groupings of organisms: vertebrates (cetaceans, elasmobranchs, bony fish), invertebrates (crustaceans, mollusks, annelids), and ‘micro-eukaryotes’ (we include algae and other small/unicellular eukaryotes). These categories were delineated based on general expected differences in body size, population size, and osmoregulation mechanism or capacity. Note that gene categories were differentially represented

38 by these groupings. We tested for patterns within each gene category for these three independent organismal groupings by binomial test, whenever the gene category was represented by 6 or more sister pairs. We corrected for the number of tests per gene category (up to 3 organismal groupings) by sequential Bonferroni correction (Holm 1979). Summarized relative rates of protein-coding genes were additionally subjected to Wilcoxon signed-rank tests where the minimum sample size was 10; this was possible for protein-coding genes since branch length information was available. The data were verified via the procedures of Welch and Waxman (2008) and Garland et al. (1992) to prevent inclusion of low-information-content pairs or greater rate difference with more divergent pairs (excluded rate pairs given in SM2). A Wilcoxon signed-rank test, with null expectation of a median value of zero, was performed on the rate differences standardized by the square root sum of branch lengths.

We examined the concordance in habitat-linked relative rates across the genome by testing whether the directions (i.e. SAL>FW, FW>SAL) of the protein-coding mitochondrial and nuclear relative rates matched significantly more often than expected by chance, by binomial test. Additionally, individual genes represented by 6 or more sister pairs were tested for consistently higher rates in either habitat category by binomial test; multiple testing of individual genes were corrected by sequential Bonferroni correction (Holm 1979). Finally, we performed a binomial test on relative rates of inland-saline vs. freshwater comparisons alone.

Tests for link with direction of habitat shift

Since the act of transitioning into a novel environment may influence the relative molecular rates, we tested whether relative rates differed by transition direction. The direction of the habitat transition was inferred based upon either the postulated ancestral habitat information or phylogenetic mapping of habitats in the source studies. Comparisons were not included here if ancestral information was equivocal. We tested for directional patterns in OSRs and dN/dS ratios within each habitat direction category (i.e. FW to SAL, or SAL to FW) by binomial test, and between transition directions using Fisher’s exact test.

39

Re-analysis of data from previous studies

In order to test the conservativeness of our methods, we re-analyzed the data from previous studies that reported habitat-specific differences in relative rates of molecular evolution. We analyzed: the five crustacean comparisons present and overlapping in Hebert et al. (2002) and Colbourne et al. (2006), as well as re-analyzed the Daphnia comparisons using an updated phylogenetic hypothesis (Adamowicz et al. 2009); the single comparison from Raupach et al. (2004) in isopods, which contained the same transition and gene as Wӓgele et al. (2003); and a single comparison from Logares et al. (2010) in SAR11 bacteria. We analyzed these studies consistent with our above methods, including choosing a balanced number of terminal taxa for our sister comparisons.

RESULTS

No difference in freshwater vs. saline rates when considering all genes together

One-hundred and forty-eight comparisons, each representing an independent transition in habitat state, had at least one gene that passed the inclusion criteria for OSR analysis (all comparisons are given in SM Ch2_S1 Table Ch2_S1_1). In total, across these sister pairs, 396 pairs of relative rates, each representing a single gene for OSRs, were included in subsequent statistical testing. Seventy-one sister pairs had at least one protein-coding gene that passed the inclusion criteria for dN/dS ratios; these contained a total of 236 pairs of dN/dS ratios for single genes. We present the results as follows: number of comparisons with SAL (saline clade) rate greater (positive direction), number of comparisons with FW (freshwater clade) rate greater (negative direction), (number of sister pairs exhibiting equal rates, if present), total N used for binomial test, p-value from binomial test. Bonferroni correction was applied to p-values where stated below, in cases where more than one set of data (a set having 6 or more data points) existed for that category of test. Across 148 sister comparisons (Fig. 2.2), neither the saline nor freshwater habitat category had higher OSRs, considering all genes together (67, 81, 148, p=0.29) (Fig. 2.4a). The median relative overall substitution rate was -1.04 across all genes (i.e. the freshwater lineage

40 rate was >4% higher than the saline rate in half of the sister pairs). Across the 71 sister comparisons analyzed for dN/dS ratios (Fig. 2.3), neither habitat category more often had higher dN/dS ratios when considering all genes together (33, 38, 71, p=0.64) (Fig. 2.4b), with the median relative ratio being -1.05.

Higher molecular rates in freshwater lineages in protein-coding genes

Protein-coding genes tended to have faster OSRs in freshwater taxa. Among the gene categories tested for OSRs (Fig. 2.4a), all protein-coding genes together (26, 45, 71, p=0.032) and nuclear protein-coding genes (13, 26, (1), 39, p=0.053) displayed generally higher rates in the freshwater taxa. The median relative overall substitution rate was -1.10 across all protein- coding genes and -1.27 across the nuclear protein-coding genes. Mitochondrial protein-coding OSRs had a weaker tendency towards higher rates in freshwater taxa, while non-coding and chloroplast genes exhibited no difference in rates between habitat categories. These trends remained when excluding sister pairs displaying near-equal relative rates (SM Ch2_S2). dN and dS rates were each not significantly linked to habitat across the 71 comparisons (both 32, 39, 71, p=0.48). Among the three protein-coding gene categories, dN/dS ratios (Fig. 2.4b), dN, and dS rates each exhibited relatively even results between habitats (Table 2.2 and SM Ch2_S1 Figure Ch2_S1_2).

Some trends within organismal groupings

For OSRs, the vertebrate nuclear protein-coding genes had the most pronounced direction (12, 24, (1), 36; p=0.065) (Table 2.2); here, only one organismal category (vertebrates) had large enough sample size, and so no Bonferroni correction was applied. These were all bony fish comparisons. Mitochondrial protein-coding genes had more often, but not significantly, higher relative dS rates in freshwaters for vertebrates (19, 33, 52, p=0.070, p=0.14 corrected). Within these major taxonomic groupings, some other trends existed. Of the nuclear protein-coding OSR comparisons, the silverside fish group (Bloom et al. 2013) appears to have a different tendency than other fish (supported by a Fisher’s exact test, p=0.036), having a majority of comparisons SAL>FW (6, 3, 9, p=0.51). Excluding the silversides, and considering all remaining sister pairs, the nuclear protein-coding OSRs are strongly more often higher in FW across all taxa (7, 24, 31,

41 p=0.0033) and for bony fish alone (6, 22, 28, p=0.0037). Within ‘micro-eukaryotes’, the algae and other eukaryotes (Figure 2.2) also appear to have different tendencies in OSRs across genes (there was mainly one gene available for those comparisons), with algae (7, 16, 23, p=0.093) and other eukaryotes (26, 14, 40, p=0.081) differing by Fisher’s exact test (p=0.010).

Coding vs. non-coding genes

The lack of a habitat-linked pattern in relative rates of evolution within non-coding genes does not appear to be due to a difference in availability of sequences among taxa. For comparisons having both protein-coding and non-coding OSR data, protein-coding gene rates were again more often higher in FW, while non-coding genes showed no association to habitat (protein-coding: 16, 30, 46, p=0.054; non-coding: 20, 26, 46, p=0.46). This was similar for bony fish comparisons alone (protein-coding: 8, 22, 30, p=0.016; non-coding: 13, 17, 30, p=0.58).

Little evidence for genome-wide effect of habitat upon rates

Directions of relative rates (e.g. SAL>FW or FW>SAL) were not strongly consistent between nuclear and mitochondrial protein-coding genes from the same taxa. The nuclear and mitochondrial relative rates were higher in the same habitat in about half of the sister pairs (OSRs: 21 of 39, and dN/dS ratios: 17 of 40), which would be the random expectation (SM Ch2_S2).

No single gene driving the pattern in protein-coding genes

Individual genes did not show significantly higher rates in either habitat. The most consistent habitat-linked pattern was in the recombination activating gene 2 (Rag 2), with generally higher rates in freshwater lineages (dN/dS ratios 6, 16, 22, p=0.052 uncorrected, p=0.52 corrected). Those genes with the most data points (>40 sister pairs) for OSRs were 18S, 16S, and CytB; for dN/dS ratios, the gene with the highest data availability was CytB. Each of these genes had relatively even directional results (SM Ch2_S1). The majority of individual genes more often had the FW lineage as having the higher rates (OSRs p=0.015, dN/dS p=0.36); however, gene relative rates are not independent as several genes are from the same organisms.

42

No consistent pattern in continental saline vs. freshwater comparisons

Only eight comparisons included inland saline habitats as the ‘saline’ category. These inland habitats vary in size and salinity and include inland brackish ‘seas’. For these eight comparisons alone, neither habitat had the higher OSRs (2, 6, 8, p=0.29) or dN/dS ratios (2, 1, 3, p=1.0). Two of the continental saline taxa were most similar to the types of comparisons used in previous literature—crustacean comparisons from small saline lakes. These two comparisons did not have consistently higher rates in the saline taxa (OSRs: 1, 1, 2, p=1.0).

No relationship between ancestral habitat and overall rates

There was no significant trend in either habitat having higher relative OSRs or dN/dS ratios when the comparisons were divided based on direction of the habitat shift (FW to SAL or SAL to FW directions) (SM Ch2_S2).

Re-analysis of previous studies reporting habitat-specific rate differences

All studies re-analyzed (Hebert et al. 2002, Colbourne et al. 2006, Adamowicz et al. 2009; Wӓgele et al. 2003/Raupach et al. 2004; Logares et al. 2010) gave higher molecular rates in the saline habitat category as compared with the freshwater habitat category for each of the comparisons when considering all genes together (SM Ch2_S2), in accordance with the results presented in those original studies. In a minority of the individual genes (from the studies of Hebert et al. 2002 and Colbourne et al. 2006), the freshwater lineage was calculated as having the higher rate. These minor differences from the original study results could be due to our use of a balanced number of tip lineages for each habitat category as well as to our method of estimating branch lengths.

Some variation in results with different molecular measures

Two methods of estimating relative branch lengths were used for protein coding genes — 1) relative overall substitution rates (OSRs) from PAML baseml (Figures 2.2 and 2.4, Table 2.2), and 2) branch lengths in PAML codeml from estimated dN and dS substitution counts (Table 2.2 BLn columns). Branch lengths from baseml have been used more commonly in the literature (e.g. Woolfit and Bromham 2005, Bromham et al. 2013), while substitution counts can be useful 43 to allow concatenation of rates across genes where different species are available (Mitterboeck and Adamowicz 2013). However, here, the raw relative rates (before any summary across genes) from these two methods corresponded only 78% of the time in terms of which habitat had the higher estimated rate in over 200 individual gene pairs tested (SM Ch2_S1). Part of the difference in methods could stem from differing weighting of substitution types; we observed from simple datasets the latter method may estimate more substitutions at synonymous sites compared to non-synonymous sites (as compared with the former method), though this would need rigorous testing. Further difference between these methods can be added when summarizing rate estimates across multiple genes per sister pair, as our summary of relative OSRs may more equally weigh different gene results, while summary using substitution counts may give more weight to those genes that are evolving faster (Bromham and Leys 2005). While vertebrates had consistently higher freshwater than saline habitat rates using both methods, discrepancies existed within invertebrates (Table 2.2), and due to this we do not delve into biological interpretation of the (non-significant) invertebrate results. We focus interpretation on the OSRs, which were available for all gene types, along with the dN and dS rates alone.

DISCUSSION

Our results indicate that prior findings of higher molecular rates in continental saline lakes are not mirrored by an overall pattern of higher rates in saline environments when including the marine realm. Rather, we observed no general difference in relative rates between habitats, with some exceptions of higher rates in freshwaters for protein-coding genes— specifically for vertebrates and bony fish as a subset. While we did not directly measure differences among our included taxa in terms of key parameters that may influence their relative rates—relative salinity, UV, metabolic rate, effective population size, or diversification—here, we discuss the findings in the context of possible causative factors. After considering these various parameters and their influence on molecular rates, we cannot attribute the observed weak trend in protein-coding genes combined with lack of trends overall to a single cause. The genes included in our analysis represent a small and biased, yet suitable, portion of the genome. The genes were those chosen for previous phylogenetic studies of the source taxa, likely selected with the intention of them being more conserved in function and sequence,

44 providing phylogenetic resolution. Therefore, most of these genes are unlikely to have large differences in molecular rates due to positive selection based on habitat. For the purposes of our analyses investigating genome-wide rate differences, we were interested in patterns in molecular rates due to mutation rate differences or differences in the relative influence of genetic drift (vs. selection). We are not addressing specific genes under positive selection nor mechanisms of habitat adaptation here.

No general difference between freshwater and saline rates

The more conserved nature of the genes included could contribute to the observed lack of higher rates in either freshwater or saline organisms. However, this does not specifically explain the lack of trends as some of the same genes analyzed here have previously displayed systematic rate variability in association with environmental or biological parameters (e.g. COI, 16S, 18S) (e.g. Hebert et al. 2002, Mitterboeck and Adamowicz 2013). Furthermore, our methods do not appear to be conservative, as we obtained the same trends as the original authors upon re- analyzing studies reporting higher molecular rates in saline environments (Hebert et al. 2002, Wӓgele et al. 2003, Colbourne et al. 2006, Logares et al. 2010). In this study the non-coding DNA regions, including rRNA genes and several introns, do not follow the same trends as protein-coding genes. The included ribosomal gene regions may be more conserved across taxa as compared with the protein-coding sequences; however, our application of minimum exclusion criteria as well as visual inspection of alignments suggest that these sequences exhibit variability and substantially so in some cases, even after removing regions of uncertain homology. Both higher coding (e.g. COI) and non-coding (e.g. 16S and 18S) rates have been observed in studies investigating the hypothesis of mutation rate differences relating to habitats or organismal traits (e.g. Hebert et al. 2002, Bromham et al. 2013). Given the larger number of independent comparisons used here, we expected patterns in both coding and non-coding loci if consistent differences existed in genome-wide influences on molecular rates across habitats (e.g. Ne or mutation rate). Furthermore, although a large proportion of our total number of taxonomic comparisons had only one (non-coding) gene available, comparisons with multiple non-coding genes (in vertebrates) also did not show significant trends for those genes. Upon considering these points, it is unclear why there were no trends in non-coding genes, while patterns in protein-coding

45 genes existed. The trends observed in this exploratory study would need further detailed investigation to fully delve into all causal mechanisms; nevertheless, here we consider the possible parameters acting to produce the observed trends in protein-coding genes.

Mutation rate differences vs. Ne effect

Patterns in non-synonymous-to-synonymous (dN/dS) ratios can be useful in indicating differences in positive or relaxed selection between lineages. Generally, synonymous sites should be freer to vary and approximate relative mutation rates, while non-synonymous sites can indicate selection or an impact of effective population size (Ne). For vertebrates, the slightly more often higher protein-coding overall substitution rates and dS rates in freshwater taxa suggest more often higher mutation rate in freshwater than saline lineages. Higher molecular rates in coding and non-coding mitochondrial loci have been observed in fish living in warmer waters—warmer due to either depth or latitude differences— compared with fish living in cooler waters (Wright et al. 2011). This observation was proposed to be due to higher metabolic rate in warmer waters. Higher mutation rates related to metabolism or temperature is a possible contributing influence on molecular rates in freshwaters, whether due to some inherent difference between freshwater and saline environments or to a latitudinal bias in the occurrence or sampling of freshwater vs. saline lineages (not tested here). Higher protein-coding overall substitution rates in freshwaters may also be due to reduced Ne in freshwaters. However, the effect of differences in Ne would be expected mainly in dN/dS ratios and dN rates, and furthermore, mitochondrial genes would be expected to show greater Ne effects than nuclear genes. Here, nuclear protein-coding dN relative rates are not strongly higher in freshwaters, while the mitochondrial protein-coding dN relative rates showed no statistical pattern. Thus Ne alone does not seem like a likely mechanism for the protein-coding patterns that are strongest in nuclear genes.

Positive or relaxed selection

Relaxed or positive selection on relevant genes may be associated with entering a new ecological niche, such as upon transitioning between habitats or lifestyles (e.g. Shen et al. 2009). We did not explicitly investigate positive selection: it may act on only one or a few sites (Hughes

46

2007) and thus be difficult to detect using gene-wide measures of molecular rates. The most consistent habitat association for a single gene was observed in the recombination-activating gene 2 (Rag2), which plays a role in the adaptive immune response (Jones and Gellert 2004). Many evolutionary changes may be selected for upon changing environments (Jones et al. 2012), including immune function (e.g. Sun et al. 2013). It is possible that positive selection may have impacted patterns of molecular evolution for some individual genes included in our study. Relaxed selection can act upon gene categories, such as mitochondrial protein-coding genes involved with energy production. Higher dN/dS ratios, likely due to relaxed selection, have been observed in mitochondrial (but not nuclear) genes of specific groups of fish, insects, and birds with hypothesized lower energy requirements (Shen et al. 2009, Mitterboeck and Adamowicz 2013, Strohm et al. 2015). Given that here we did not observe trends in mitochondrial gene dN/dS ratios based on habitat, there may not be strongly different selection regimes related to metabolic costs between freshwater and marine fish.

Transition direction and the speciation-molecular evolution link

Our primary analyses considered habitat occupancy; however, the act of transitioning may additionally influence the rates of molecular evolution (e.g. McMahon et al. 2011). The comparisons included in this study reflect the reality of the freshwater environment being more often the newly colonized environment in saline-freshwater shifts. The trends of higher freshwater molecular rates are consistent with the hypothesis of a link between transition direction and molecular rates. However, the sample size for the reverse direction transition (freshwater to saline) was small, and no significant trends based on transition direction were detected. Further work investigating the distinction between rates associated with a trait vs. trait shift direction would improve our predictive power for molecular patterns across the tree of life. The results for bony fish do not seem likely due to the speciation-molecular rate hypothesis. Our obtained nuclear relative rates from Bloom et al. (2013) were higher in the marine environment, but they observed a greater speciation rate for freshwater than marine fish. Additionally, we observed greater molecular rates in freshwater pufferfish, while Santini et al. (2013) demonstrated higher speciation rate in marine lineages. However, this remains an area for further investigation using a larger sample size, particularly given that such a link has been

47 demonstrated in several other taxonomic groups, such as plants (Barraclough and Savolainen 2001).

Lack of consistency among molecular measures and gene types suggests multiple influences

The results do not support a single strong genome-wide influence as the cause of the observed habitat-linked trend in protein-coding genes; the protein-coding gene trends were present more strongly when rates were estimated across all sites than when dN or dS estimates were examined separately; secondly, nuclear and mitochondrial protein-coding relative rates were not found to be concordant in direction. Rather, there may be multiple causes of habitat- linked differences in overall rates, likely varying among taxa. We suggest a possible combination of general influences in vertebrates or bony fish that match some of the nuances in the protein- coding results, but may be present weakly or only for certain taxa: 1) lower effective population size in freshwaters, leading to higher dN rates or dN/dS ratios genome wide (note: expected to be greater effect for mtDNA than nDNA. This difference was not observed, but matches with expectation #3); 2) higher metabolic rate in freshwaters producing greater mutation rate, leading to mainly higher dS rates genome-wide (note: possible greater effect for mtDNA, which was consistent with results); 3) if higher metabolic rate is present in freshwaters, then this may also lead to tighter selective constraints specifically on mtDNA, thereby reducing the dN or dN/dS ratios of mitochondrial genes in freshwater taxa; and 4) possible positive selection in some cases, e.g. in nuclear genes related to immune response. These potential influences, excepting gene- specific positive or relaxed selection, however, do not explain the lack of trend in non-coding genes. Given the complexity of the results overall—relative rates varying among genes and the subtle habitat-linked trend in some gene types and taxa—more detailed investigation of select taxa is needed to determine the genetic consequences of these aquatic habitat transitions.

A comparison of marine patterns vs. continental saline lakes

Prior studies reported strongly higher molecular rates in continental saline vs. freshwater branchiopod crustaceans, in contrast to our findings including the marine realm. This difference may arise due to differences in salinity, UV exposure, environmental variability, or possibly due to unique biological features of branchiopod crustaceans. The conductivities of the inland saline 48 environments in previous studies (Hebert et al. 2002, Colbourne et al. 2006) ranged broadly (from 20,000 to 100,000 uS cm-1), while sea water has a conductivity nearer the lower end of this range (45,000 uS cm-1). On the other hand, parallels do exist between our study of mainly marine saline taxa and previous studies of inland saline taxa. In both cases, the higher rates are observed in the more recently colonized environment, and the new environment is also postulated to be more variable or harsh than the ancestral environment. Thus, if environmental variability or transition direction play a role in influencing rates of molecular evolution, such a mechanism would harmonize the findings of previous studies with the results from our study.

Conclusions

Through a comparative overview, we have demonstrated that the saline and freshwater environments have generally equal relative rates of molecular evolution across a range of taxa, with some exceptions of faster rates in freshwaters for protein-coding genes. Some of the causes of the previously observed patterns of higher rates in inland saline lakes vs. freshwater may be common to inland freshwater vs. marine waters as well. This study contributes to a growing body of knowledge of the environmental and biological correlates of molecular rate variability.

ACKNOWLEDGEMENTS

We thank these authors who were contacted and provided alignments from their studies: R. Betancur-R., A. Whitehead, Y. Yamanoue, A. Audzijonytė, M.J. Raupach, R. Väinölä, T.A. Richards, M. Carr, I. Cepicka, J. Bråte, K. Hoef-Emden, and L. Jardillier. We thank Tamara Tukhareli for assistance with literature review during the early stages of this project; Jinzhong Fu, Robert Hanner, Tzitziki Loeza-Quintana, Stephen Marshall, and Daniel Ashlock for helpful advice on the ideas or manuscript prior to submission; Robert Young for help toward an earlier version of a script to calculate rates from PAML; and three anonymous reviewers for their thoughtful input. This work was supported by the University of Guelph (Integrative Biology PhD Award and Dean’s Tri-council Scholarship to T.F.M., CBS Undergraduate Summer Research Scholarship to A.Y.C., Undergraduate Research Assistantship to O.A.Z.) and by the Natural Sciences and Engineering Research Council of Canada (Alexander Graham Bell Canada

49

Graduate Scholarship to T.F.M., Undergraduate Student Research Award to O.A.Z., Discovery Grants 386591-2010 to S.J.A. and 400479 to Jinzhong Fu).

LIST OF SUPPLEMENTARY MATERIAL

SM Ch2_S1. Data collection methods and additional tables and figures summarizing the results A) Table of species analyzed, including habitats occupied and the list of published studies used as data sources B) Description of methods for data collation, including literature search terms employed and explanation of sequence inclusion/exclusion criteria C) Tables containing individual gene relative rates; figure containing distribution of dN/dS ratios and dN and dS relative rates

SM Ch2_S2. (MS Excel). All molecular evolutionary rates results and tests conducted Table Ch2_S2_1. All molecular rates including overall substitution relative rates, dN/dS ratios, dN and dS rates, habitat information, genetic data sources, postulated ancestral habitat states, genes included in each comparison, and all rates summarized by gene category Table Ch2_S2_2. Testing subsets of data: comparing protein-coding vs. non-coding relative rates, excluding near-equal summarized overall substitution rates, and data points included in Wilcoxon signed-rank tests Table Ch2_S2_3. Testing direction of habitat transition vs. relative rates Table Ch2_S2_4. Testing concordances in relative rates between mitochondrial vs. nuclear protein-coding genes

50

Figures and Tables for Chapter 2

Figure 2.1. Summary of expected association between biological or ecological parameters and rates of molecular evolution in the mitochondrial and nuclear genomes for marine, inland saline, and freshwater aquatic habitat categories. Presence of arrows indicate that increased molecular rates are expected in association with that parameter in the habitat; size of arrows indicate expected relative strength of effect on molecular rates, relative to other habitat categories or between mitochondrial and nuclear genomes within a habitat category. Numbers indicate the following qualifications (see text for more information): 1) metabolic rate differences based on habitat may be greater for invertebrates, 2) the effect may be more likely for smaller organisms, 3) the effect on dN rates and dN/dS ratios is expected to be greater than the effect upon dS rates, and 4) a given effect may be limited to specific genes or be approximately equivalent to the Ne parameter. These are our general hypotheses for the habitat categorizations and likely do not apply to all individual taxa analyzed.

51

Figure 2.2. Relative saline: freshwater (SAL:FW) overall substitution rates (OSRs) across 148 comparisons. The bars represent the relative OSRs summarized across all genes analyzed for each phylogenetically independent habitat transition. The symbols represent the separately summarized mitochondrial and nuclear (protein-coding and non-coding separately) and chloroplast protein-coding gene results. Those bars or symbols above the x-axis indicate comparisons in which the summarized SAL relative rate was greater than the FW rate, and those below the x-axis indicate where the summarized FW rate was greater than the SAL. Note: shown are ‘relative rates’ (starting at 1) with direction, and so there are no values between -1 and +1.

52

The overall SAL rate was greater than FW in 67 sister pairs; the FW rate was greater than the SAL rate in 81 sister pairs (p=0.29). Taxa labels: Ceta. = cetacean; Elasm. = elasmobranch.

Figure 2.3. Relative saline: freshwater (SAL:FW) dN/dS ratios across 71 comparisons. For ease of viewing, we display 1 minus the smaller dN/dS ratio over the larger dN/dS ratio. Bars represent the overall dN/dS ratios across all genes for a given sister pair, and symbols represent the mitochondrial, nuclear, and chloroplast gene results. Those bars or symbols above the x-axis indicate comparisons in which the SAL dN/dS ratio was greater than the FW, and those below the x-axis indicate where the FW dN/dS ratio was greater than the SAL. The overall dN/dS ratios were higher in the SAL clade in 33 comparisons and higher in the FW clade in 38 comparisons (p=0.64). Taxa labels: Ceta. = cetacean; Elasm. = elasmobranch, Crustac. = crustacean, Algae = eukaryotes, algae.

53

Figure 2.4. Summarized overall substitution rates (OSRs) and dN/dS ratios by gene category. Rates within gene categories are summarized separately for each comparison, where the number of data points is the number of independent sister comparisons. These boxplots show the distribution of relative rates shown in Figures 2.2 and 2.3. The first boxplot in Part A ‘overall’ is the distribution of bar heights in the ‘overall’ gene category in Figure 2.2; the first boxplot in Part B is the distribution of bar heights in the ‘overall’ gene category in Figure 2.3; and other boxplots represent the distribution of symbols in those figures. Data among boxplots are not independent since plots overlap in SAL-FW sister pairs; i.e., multiple gene types are often available for the same sister pair. P-values are from 2-tailed binomial tests. The greatest SAL-FW differences are observed for OSR protein-coding genes overall (p=0.032) and nuclear protein-coding genes (p=0.053). Some results shown here overlap with those presented in Table 2.2. Gene categories: Nuc = nuclear, Mit = mitochondrial, Chlor = chloroplast, PC = protein- coding, NC = non-coding.

54

Table 2.1. Summary of 148 freshwater-saline comparisons used in analysis. Saline habitat type: M = Marine; Br = Brackish marine; Eu = Euryhaline; S = Saline continental; SBr = Large inland ‘Seas’ with brackish waters, HS= Hypersaline; “+” (e.g. ‘M+Br’) signifies the individuals’ habitats are known and a mixed-habitat clade was used, while “/” (e.g. ‘M/Br’) signifies the exact habitat state for each individual in a clade was not specified in the source study. Bolded gene names signify protein-coding genes. Comparisons listed are those that passed the inclusion criteria (see SM Ch2_S1). Comparisons refer to evolutionarily independent habitat shifts; these are numbered in accordance with the ordering of results in Figures 2.2 and 2.3 within this paper. SM Ch2_S1 notes where habitat information was obtained from sources other than the ‘source study’ (for the molecular data) given here.

Comparison # . Taxa group Taxa Source study Locia Saline habitat Fig. 2.2 Fig. 2.3 type Cetaceans Toothed whales (Odontoceti, infraorder) Hamilton et al. 2001 CytB, 12S, 16S M 1-3 1-3 Elasmobranchs Sharks (Selachimorpha, superorder) Vélez -Zuazo and COI, ND2 M 4 4 Agnarsson 2011 Whiptail stingrays (Dasyatidae, family) Sezaki et al. 1999 CytB M 5-7 5-7

Bony Fish Sea (, family) Betancur-R. et al. 2012 CytB, ATP6, ATP8, Rag1.1, Rag1.2, M 8-17 8-17 Rag2, MYH6, 12S, 16S, Rag1.Int (Engraulidae, family) Bloom and Lovejoy 2012 CytB, Rag1, Rag2, 16S M 18-24 18-24 Terapontid grunters (, Davis et al. 2012 CytB, Rag1.1, Rag1.2, Rag1.3, Rag2, M+Eu, Eu 25-26 25-26 family) Rag1.Int1, Rag1.Int2 , herrings & relatives Lavoué et al. 2013 COI, CytB, ATP6, ATP8, 12S, 16S M+Eu, M, Eu 27-30 27-30 (Clupeoidei, suborder; Engraulidae excluded) (Belonidae, family) Lovejoy and Collette 2001 CytB, Rag2, Tmo, 16S M 31-35 31-35 Tubenose goby (Proterorhinus, genus) Neilson and Stepien 2009 COI, CytB, Rag1 SBr 36 36 ( genus) Whitehead 2010 CytB, Gylt, Rag1 M 37-40 37-40 Pufferfish (Teraodontidae, family) Yamanoue et al. 2011 ATP6, ATP8, COI, COIII, CytB, ND1, M/Br/Eu 41-43 41-43 ND2, ND3, ND4, ND4L, ND5, 12S, 16S Sculpins (, family) Yokoyama and Goto 2005 12S, CR M 44 New World silversides (Menidiinae, Bloom et al. 2013 CytB, ND2, Rag1, Tmo M 45-53 44-52 subfamily) Crustaceans (Copepoda family) Adamowicz et al. 2010 16S, 28S M,S 54-56 Mysis ( genus) Audzijonytė et al. 2005 COI, CytB, 16S, ITS2, 18S SBr 57 53

55

Gammarus (Amphipoda genus) Hou et al. 2011 COI, 18S, 28S M 58-60 54-56 Gammaracanthus (Amphipoda genus) Väinölä et al. 2001 COI SBr, Br 61-62 57-58 Palaemoninae ( subfamily) Botello and Alverez 2013 16S M, M+Br 63-66 Typhlatya (Decapoda genus) Hunter et al. 2008 COI, CytB,16S M 67 59 Mollusks Bivalves (Bivalvia, class) Bieler et al. 2014 COI, H3, 16S, 18S, 28S M 68-70 60-62 ( superfamily) Strong et al. 2011 16S, 28S M/Br 71-73 Hydrobiidae (Gastropoda family) Haase 2005 COI, 16S Br 74-76 63-65 Annelids Earthworms, & allies (, Rousset et al. 2008 18S M, Br, M+Br 77-85 class) Eukaryotes, Cyclotrichiida (Ciliophora order) Bass et al. 2009 18S M 86 other Choanoflagellatea (class) Carr et al. 2008 18S M 87 Centrohelida (Heliozoa class) Cavalier-Smith and von der 18S M 88-96 Heyden 2007 Dinoflagellata (phylum) Logares et al. 2007 18S M 97-109 Bicosoecida, Placididea & relatives Park and Simpson 2010 18S M, M+HS 110-112 (within phylum Heterokonta) Euglyphida (Cercozoa order) Heger et al. 2010 18S M 113-116 ( family) Smirnov et al. 2007 18S M, Br 117-119 Heterolobosea (Excavata class) Pánek et al. 2012 18S M+Br, M 120-121 (phylum) Bråte et al. 2010 18S M 122-125 Eukaryotes, Thalassiosirales (Heterokontophyta Alverson et al. 2007 psbC, rbcL, 18S, 28S M 126-131 66-71 Algae order) Raphidophyceae (Heterokontophyta Figueroa and Rengefors 18S M 132 class) 2006 ( class) Henley et al. 2004 18S M/S 133 (Cryptophyta genus) Hoef-Emden 2008 18S, 28S M 134-136 (Cryptophyta class; Shalchian-Tabrizi et al. 18S M 137-140 Chroomonas excluded) 2008 Haptophyta (division) Simon et al. 2013 18S M, M+S 141-148 aMitochondrial protein-coding genes: ATP synthase F0 subunit 6 (ATP6) and 8 (ATP8); cytochrome c oxidase subunit I (COI) and III (COIII); cytochrome b (CytB); NADH dehydrogenase subunit 1 (ND1), 2 (ND2), 3 (ND3), 4 (ND4), 4L (ND4L), 5 (ND5). Nuclear protein-coding genes: glycosyltransferase (Gylt); histone 3 (H3); myosin heavy chain 6 (MYH6); recombination activating gene 1 (Rag1) part 1 (Rag1.1), part 2 (Rag1.2), part 3 (Rag1.3), and 2 (Rag2); toluene monooxygenase (Tmo). Chloroplast protein-coding genes: photosystem II CP43 protein (psbC); ribulose bisphosphate carboxylase large chain (rbcL). Mitochondrial non-coding genes:

56

12S (12S) and 16S (16S) ribosomal RNA; mitochondrial control region (CR). Nuclear non-coding genes: Rag 1 intron 1 (Rag1.Int1) and 2 (Rag1.Int2); second internal transcribed spacer (ITS2); 18S (18S) and 28S (28S) ribosomal RNA

57

Table 2.2. All relative rates, including dN and dS rates, by taxonomic breakdown. P-values presented in the table are raw (those less than 0.1 are bolded). In cases where correction for multiple tests were performed (see Methods), significance value is indicated with a footnote. Abbreviations: Vert = vertebrates; Invert = Invertebrates; Micro = ‘micro-eukaryotes’; Chl = chloroplast; Mit = mitochondrial; Nuc = nuclear; BLn = branch lengths calculated from dN and dS substitution counts.

Overall substitution rates dN/dS ratios – Protein-coding only Sister Direction Protein-coding Non-coding All protein-coding genes Mitochondrial Nuclear Chloroplast pairsa All All Mit Nuc Chl All Mit Nuc dN dS dN/dS BLn dN dS dN/dS BLn dN dS dN/dS BLn dN dS dN/dS BLn

SAL>FW (+) 67 26 27 13 3 61 23 45 32 32 33 31 31 28 32 28 16 18 19 17 2 3 2 2

All: FW>SAL (-) 81 45 38 26 3 62 22 46 39 39 38 40 34 37 33 37 23 21 21 23 4 3 4 4 1-148 binomial p 0.29 0.032 0.21 0.053 1.0 1.0 1.0 1.0 0.48 0.48 0.64 0.34 0.80 0.32 1.0 0.32 0.34 0.75 0.87 0.43 0.69 1.0 0.69 0.69

Wilcoxon p - 0.038 0.12 0.063 - - - - 0.054 0.48 0.76 0.14 0.24 0.33 0.86 0.46 0.44 0.60 0.80 0.61 - - - -

SAL>FW (+) 22 20 24 12 - 16 15 4 25 21 27 19 26 19 28 18 15 17 17 16 - - - -

Vert: FW>SAL (-) 31 32 28 24 - 18 17 3 27 31 25 33 26 33 24 34 21 19 20 21 - - - - 1-53 binomial p 0.27 0.13b 0.68 0.065 - 0.86 0.86 1.0 0.89 0.21 0.89 0.070c 1.0 0.070b 0.68 0.036b 0.41 0.87 0.74 0.51 - - - -

Wilcoxon p - 0.046b 0.24 0.075 - - - - 0.19 0.021b 0.73 0.054b 0.48 0.11 0.63 0.087 0.48 0.60 0.80 0.65d - - - -

SAL>FW (+) 12 3 3 1 - 12 8 8 5 8 4 10 5 9 4 10 1 1 2 1 - - - -

Invert: FW>SAL (-) 20 10 10 2 - 14 5 13 8 5 9 3 8 4 9 3 2 2 1 2 - - - - 54-85 binomial p 0.22 0.092c 0.092b 1.0 - 0.85 0.58 0.38 0.58 0.58 0.27 0.092b 0.58 0.27 0.27 0.092 1.0 1.0 1.0 1.0 - - - -

Wilcoxon p - 0.22 0.24b - - - - - 0.19 0.092 0.17 0.064 0.22 0.064b 0.15 0.077b ------

SAL>FW (+) 33 3 - - 3 33 - 33 2 3 2 2 ------2 3 2 2 Micro: FW>SAL (-) 30 3 - - 3 30 - 30 4 3 4 4 ------4 3 4 4 86-148 binomial p 0.80 1.0 - - 1.0 0.80 - 0.80 0.69 1.0 0.69 0.69 ------0.69 1.0 0.69 0.69

SAL>FW (+) 18 17 20 12 - 13 12 4 21 18 23 16 22 16 24 15 15 17 17 16 - - - -

Bony FW>SAL (-) 28 28 25 24 - 18 17 3 24 27 22 29 23 29 21 30 21 19 20 21 - - - - fishe: 8-53 binomial p 0.18 0.14b 0.55 0.065 - 0.47 0.46 1.0 0.77 0.23 1.0 0.072c 1.0 0.072b 0.77 0.036b 0.41 0.87 0.74 0.51 - - - - Wilcoxon p - 0.032b 0.23 0.075 - - - - 0.23 0.027b 0.77 0.076b 0.61 0.16 0.65 0.25 0.48 0.60 0.80 0.65d - - - - aSister comparisons as numbered in the second right-most column in Table 2.1 and in Figure 2.2 bCorrected p-value would be the given p-value multiplied by 2

58 cCorrected p-value would be the given p-value multiplied by 3 dThe Wilcoxon signed-rank p-value is for the opposite direction (i.e. FW>SAL or SAL>FW) than the direction of the binomial test eBony fish category p-values were corrected by the same factors as vertebrate p-values (as shown)

59

Chapter 3

Rates and patterns of molecular evolution in freshwater vs. terrestrial insects

T. Fatima Mitterboeck1,2, Jinzhong Fu2, Sarah Adamowicz1,2

1Biodiversity Institute of Ontario, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada 2Department of Integrative Biology, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada

60

ABSTRACT

Insect lineages have crossed between terrestrial and aquatic habitats many times, for both immature and adult life stages. We explore patterns in molecular evolutionary rates between 42 sister pairs of related terrestrial and freshwater insect clades using publicly available protein- coding DNA sequence data from the orders Coleoptera, Diptera, Lepidoptera, Hemiptera, Mecoptera, Trichoptera, and Neuroptera. We furthermore test for habitat-associated convergent molecular evolution in the cytochrome c oxidase subunit I (COI) gene in general and at a particular amino acid site previously reported to exhibit habitat-linked convergence within an aquatic beetle group. While ratios of non-synonymous-to-synonymous substitutions across available loci were higher in terrestrial than freshwater-associated taxa in 26 of 42 lineage pairs, a stronger trend was observed (20 of 31, pbinomial=0.15, pWilcoxon=0.017) when examining only terrestrial/aquatic pairs including fully aquatic taxa. We did not observe any widespread changes at particular amino acid sites in COI associated with habitat shifts, although there may be general differences in selection regime linked to habitat.

Keywords: molecular evolution, habitat transition, relative rates, Insecta, DNA barcoding

INTRODUCTION

Hexapods first colonized land approximately 479 million years ago (Misof et al. 2014). Since then, insects have transitioned into freshwater habitats an estimated 50+ times (Dijkstra et al. 2014), and within those taxa some lineages have reverted to terrestrial habitats. Aquatic insects, here defined as having at least one life stage in water, make up approximately 9% of all described insect species (Foottit and Adler 2009). The vast majority of aquatic insects inhabit freshwaters, while marine insects are rare. Some aquatic insects are closely associated with aquatic habitats, such as living around the margins of water bodies, while others can remain below the surface of water and have various methods of obtaining oxygen, such as via diffusion, use of air bubbles, gills, storage of oxygen in hemolymph, or use of breathing tubes (Merritt and Cummins 1996). Transitions to aquatic life have occurred at various times during the evolutionary history of insects, such that aquatic specialist groups range from order-level taxa to species or species

61 groups that are phylogenetically nested within predominantly terrestrial genera. Habitat shifts occur for both immature (e.g. egg, larval) life stages and, less commonly, for adults. The majority or all species have aquatic larvae in the following orders: Ephemeroptera (mayflies), Odonata (dragonflies and damselflies), Plecoptera (stoneflies), Trichoptera (caddisflies), and Megaloptera (alderflies, dobsonflies, and fishflies). Adult-life-stage habitat shifts (with the larvae inhabiting various environments, but mainly aquatic) have occurred within Coleoptera at least 10 times (Hunt et al. 2007), including some reversions; at least five adult shifts have likely occurred in the Hemiptera as well (Carver et al. 1991). Dipteran lineages have experienced many (20+) terrestrial-to-aquatic larval-stage transitions (plus many reversions) (Chapman et al. 2010; Wiegmann et al. 2011), as have Hymenoptera (wasps, bees, ants) (Bennett 2008) and Lepidoptera (moths and butterflies) (Rubinoff and Schmitz 2009, Foottit and Adler 2009).

There is growing evidence that life-history characteristics and habitats are associated with relative rates of molecular evolution (e.g. Woolfit and Bromham 2003, Foltz 2003, Korall et al. 2010, Mitterboeck and Adamowicz 2013). Freshwater and terrestrial lineages may generally differ in their relative pace of molecular evolution due various biological or ecological parameters that potentially differ, on average, between these broad habitat categories. Freshwater environments are thought to be more subdivided across a landscape, creating

‘islands’ of life (Grosberg et al. 2012). This may result in a smaller effective population size (Ne) for species inhabiting freshwater environments as compared to terrestrial environments, leading to increased rate of mainly non-synonymous substitutions (Ohta 1992, Woolfit 2009); this effect is expected to be stronger for mitochondrial than nuclear genes (Neiman and Taylor 2009). However, freshwater environments range in permanence and connectivity, and aquatic insects may have various attributes or adaptations that could enhance their dispersal potential (Dijkstra et al. 2014). Specialist terrestrial species––such as parasitoids with high host specificity (Smith et al. 2007, 2008) or specialized herbivorous insects (Hebert et al. 2004)––may also be limited to habitat patches containing suitable host species. Lower predation intensity and higher nutrient supply in freshwater environments (Vermeij and Dudley 2000, Dijkstra et al. 2014) may relax selective constraints on metabolic efficiency, while terrestrial organisms also experience greater thermal stress over space and time than aquatic organisms, whose environment buffers such fluctuations (Lancaster and Downes 2013). On the other hand, limited oxygen supply is one

62 constraint for organisms living beneath the surface of water and may accordingly influence the relative pace of evolution of genes related to obtaining and using oxygen. The shifts to a novel environment may also impact the rate of molecular evolution via initial reduction in Ne, relaxed selective constraints, or positive selection associated with ecological opportunity (e.g. Shen et al. 2009, McMahon et al. 2011). Confounding the investigation of a single habitat, multiple co-occurring parameters—such as feeding method (e.g. hematophagy), lifestyle, latitude, or generation time—can co-vary with either habitat type. Given these two habitat classes are along a continuum, and much variability in biological and ecological characteristics exists within each class, organisms inhabiting each habitat may not differ systematically in any measure of molecular evolution. Thus, we conduct a primarily exploratory analysis of habitat-linked trends in molecular evolutionary rates. In addition to any general difference in relative pace of molecular evolution between terrestrial and freshwater insects, specific molecular changes may be associated with either occupancy of a specific habitat or shifting between habitats. Within water scavenger beetles (Hydrophiloidea), habitat-linked convergent molecular evolution in the cytochrome c oxidase subunit I (COI) sequence was proposed from the observation of a single postulated terrestrial to freshwater shift associated with an amino acid change, along with an amino acid change reversion with the shift back to a fully terrestrial habitat within the clade (Song et al. 2014). However, given that multiple genes with multiple amino acid positions were scanned to suggest this association, it is unclear whether the co-occurrence was due to chance alone. Here, we compare relative rates of molecular evolution between aquatic and terrestrial insects using 42 phylogenetically independent comparisons (Figure 3.1) from within 7 insect orders, for available protein-coding genes. Secondly, we test for habitat-associated convergent evolution in the COI gene across multiple habitat transitions.

MATERIALS AND METHODS

Collection of data

We searched for published molecular phylogenetic studies using protein-coding genes (those published up to August 2015) of insect groups containing both aquatic and terrestrial

63 species, with habitat information either specified in the source study or obtained elsewhere (source study and sister clade details in Supplementary Material Ch3_S). We defined as aquatic those species having their adult and/or larval life stage living in or on waters; we included taxa with fully aquatic (under water) life stages, but also included semi-aquatic and aquatic-associated species in the ‘aquatic’ category. A single comparison (#31) had aquatic lineages associated with both marine and freshwater environments, but for simplicity we consider the aquatic habitat category as ‘freshwater’. We used phylogenetic relationships from the published phylogenies, with preference for those topologies constructed using a larger number of genes and model-based methods (Hall 2005, Ogden and Rosenberg 2006). Molecular sequences of protein-coding genes were obtained from GenBank using the accession numbers provided in the source studies. We additionally sought out COI sequences for the same taxa through the Barcode of Life Data Systems (BOLD; Ratnasingham and Hebert 2007) or GenBank, when the original source study did not include that gene.

Sequences were aligned in MEGA version 6 (Tamura et al. 2013) by translating in frame into amino acids and then aligning using the Muscle function with default settings. The sequences were verified to be free of stop codons or gaps of 1-2 base pairs. We chose an equal number of species from each sister clade within a terrestrial/freshwater pair, in a phylogenetically dispersed fashion (Robinson et al. 1998), to reduce node density effects (Hugall and Lee 2007, Bromham et al. 2015). Molecular rates for the clades of interest were often estimated within the context of a larger tree having multiple sister clades, and the number of clade pairs was maximized such that the connections through branches among differing pairings were not overlapping, thus maintaining phylogenetic independence of sister lineages. Where variability in larval habitat occurred within adult-aquatic clades (in one case), we gave preference to include all species in the adult-stage transitions, regardless of habitat occupancy of the immature stage, in order to increase the sample size for that adult-stage sister pairing. Data were not included where taxonomic groups contained high variability in habitat state among related lineages yet limited species sampling on the phylogeny, such that we could not hypothesize well the direction of transitions in habitat state (e.g. we initially investigated but did not include here the coleopteran groups Curculionidea and Luciolinae nor the dipteran groups Tipulidae and Syrphidae).

64

1) Analysis of relative molecular evolutionary rates

a) Estimation of relative rates for freshwater and terrestrial lineages

The relative rate of molecular evolution in each pairing of freshwater and terrestrial lineages was estimated; here, we included only habitat shifts occurring within insect orders, as multiple confounding factors are less likely to differ among more closely related lineages. Rates were estimated separately for each gene. Non-synonymous-to-synonymous substitution (dN/dS) ratios were obtained for terrestrial and sister aquatic clades using the package PAML version 4.4 (Yang 2007) program codeml. Non-synonymous substitution rates (dN) and synonymous substitution rates (dS) for the clades of interest were also calculated separately using a Python script to perform the calculation using the PAML output files (Chapter 2). dN, dS, and dN/dS ratios were concatenated per sister pair across genes by using estimated counts of substitutions (calculated by rates and sequence lengths). Total branch lengths were similarly calculated using counts of substitutions by these estimates. Gene data were summarised altogether across each sister pair, as well as separately for mitochondrial and nuclear genes.

b) Analysis of habitat-linked patterns in molecular rates across sister pairs

The data were first examined for suitability for inclusion in binomial (and also Wilcoxon signed-rank) tests by first testing for significant negative correlations between the rate differences vs. the square root sum of branch lengths for the pairs; application of the exclusion criterion recommended by Welch and Waxman (2008) enables the removal of low-information pairs. We analysed whether relative molecular rates were higher in the freshwater or terrestrial environment by binomial test across the sister pairs, with the null expectation of 50% of pairs being higher in each habitat. The data were next examined for suitability for the Wilcoxon signed-rank test by checking for positive correlations between the rate difference for each pair standardised by the square root sum of branch lengths (Garland et al. 1992) against the standardisation factor. In the case of significant positive correlations, comparisons with the largest square root sum of branch lengths were removed until the data fit the specifications (this occurred for dS and total branch lengths). Wilcoxon signed-rank tests were performed on the

65 standardised rate differences. All tests were repeated with the mitochondrial and nuclear genes separately.

A subset of the data was created of sister pairs that included only the organisms that enter or can live below water, for at least one life stage, as these might be expected to show the strongest habitat-linked trends as compared to aquatic-associated or semi-aquatic species. Lastly, the direction of transition was considered, with history of transitions obtained from the source studies. Relative rates were compared between sister pairs with aquatic->terrestrial vs. terrestrial- >aquatic shifts using binomial tests within each of these categories separately and with a Fisher’s exact test between the two categories.

2) Exploration of convergence associated with habitat in COI

The clades compared for evidence of habitat-associated convergence were those used in the above sister pair analysis of relative rates plus an additional five taxonomic comparisons spanning sister clades at or above the order level; these were included since amino acid variation is more constrained than nucleotide variation. Out of these 47 possible sister pairs, 44 were represented by COI data. A global alignment of COI sequences was created. The topologies of all constituent taxonomic groups were combined into a single tree (‘compiled tree’) with the order-level backbone topology informed by Misof et al. (2014), similar to Fig. 3.1 here, but with the addition of some COI data to represent multiple lineages in aquatic and terrestrial orders that were not previously tested (Supplementary Material Ch3_S Table Ch3_S_3). The ancestral amino acid sequence for each node on the tree, including terrestrial-aquatic lineage pairings, was obtained using FastML (Ashkenazy et al. 2012) with the JTT (default) model of substitution. Substitutions in the tip lineages that differed from the ancestral condition in the aquatic or terrestrial lineages were recorded. In cases of multiple tip lineages present within a sister clade, a single tip sequence was needed for the subsequent tabulation of convergent amino acid changes; therefore, the most recent common ancestor of the multiple tip lineages was used as the ‘tip’, and this was compared with the common ancestor of the terrestrial-aquatic sister lineages. Convergent substitutions are here defined as those where the ancestral condition was not the same amino acid but the end state is the same among independent lineages, while parallel

66 substitutions have both the same ancestral and end-state amino acids across independent lineages.

a) Test of similar amino acid changes across all freshwater or all terrestrial lineages

Each amino acid site in COI was examined across the entire set of 44 comparisons for evidence of sites showing convergent or parallel changes in either the terrestrial lineages or the aquatic lineages. Amino acids were categorised into five physico-chemical categories (as in Bromham 2016). At each amino acid site in the alignment, the number of changes across all aquatic lineages (from each associated ancestor) to a particular amino acid or amino acid category was compared to the number of other changes at that site in the aquatic lineages, as a proportion. This was repeated for terrestrial lineages alone. These proportions at each site for each amino acid or amino acid category were compared between aquatic and terrestrial lineages, with a 2-tailed Fisher’s exact test. This was repeated for terrestrial->aquatic and aquatic- >terrestrial comparisons separately, to evaluate convergent or parallel changes associated with shift to the new aquatic or terrestrial habitat, respectively. Here, only differences detected in the environment of the direction of shift were considered; thus, a 1-tailed Fisher’s exact test was used. Raw p values are provided and were also corrected with the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995) (False Discovery Rate of 0.05) based on the total number of sites with changes.

b) Test of amino acid sites under positive selection

We identified amino acids sites in COI with evidence of positive selection among multiple lineages (as in Besnard et al. 2009) where the transition between habitats is likely to have occurred. Given that substitutions can be similar among multiple independent lineages by chance, if numbers of convergent changes exceed this expectation, it is likely that their existence is caused by positive selection (Graur and Li 2000). The inclusion of lineages within the ‘compiled tree’ was modified for this analysis depending on the transition direction by eliminating sub-clades representing habitat reversions within the clade of interest, or minimizing lineages of those clades in the case of a subsequent re-reversion (to the habitat of interest). These two trees with branches labeled are provided in Supplementary Materials Ch3_S.

67

Branch-site models (Zhang et al. 2005) of positive selection in PAML were used. Twenty-nine separate internal or tip lineages where terrestrial->aquatic transitions were reconstructed were labeled to the same ‘foreground’ class, with all other branches as background. In a second tree, 16 internal or tip lineages were labeled where aquatic->terrestrial transitions were postulated. Note COI data were not available for all transitions marked in Fig. 3.1. Since the taxa included (all insects) are diverse, and differing regions of COI were available for different taxonomic groups, a strict cut-off for indication of positive selection was not applied. Instead, amino acid sites under the highest probability of positive selection—here with a Bayes Empirical Bayes (BEB) (Yang et al. 2005) value greater than 0.5—were further considered. Among these, the number of changes to the same amino acid or amino acid category was tallied across the independent aquatic and terrestrial lineages, with the lineages also grouped by direction of transition.

c) Examination of Song et al. (2014) amino acid changes in COI (3’ end) associated with

habitat

We tallied the changes to the same amino acid across the independent aquatic or terrestrial lineages at the COI position proposed by Song et al. (2014). The position is located at 1381-1383 (amino acid site 461) in the 1531 total base pairs the COI gene of Tropisternus sp. (Insecta, Coleoptera, Hydrophilidae) (GenBank Accession # NC_018349). All sister pairs with that region of COI available were considered, with the exception of those pairs derived from Song et al. (2014), as these formed the basis of the hypothesis tested. In this study we report all amino acid sites according to this same reference sequence.

d) Test of convergence in pairs of aquatic lineages

Finally, the program CONVERG2 (Zhang and Kumar 1997) (JTT model) was used to test pairs of phylogenetically independent aquatic lineages for evidence of higher levels of convergent evolution than would be expected by chance. Not all possible pairwise combinations of aquatic lineages with COI data were run. Rather, specific pairs of aquatic lineages were tested,

68 selected with the following criteria: 1) the aquatic lineages represent terrestrial->aquatic transitions, 2) aquatic lineages occur within the same insect order and have similar taxonomic rank (e.g. family compared to a family), 3) the aquatic clades possess the same or a similar number of tip lineages (single-tip comparisons paired only with single-tip comparisons, etc.), 4) the available shared COI sequence length is at least 200 amino acids, and 5) each pair of aquatic lineages together displays a minimum of 8 changes from their respective ancestors. For each pair of aquatic lineages chosen, the phylogeny of the entire taxon group (e.g. order, or genus) containing both transitions was used in the CONVERG2 test. For comparison, the corresponding pairs of sister terrestrial lineages were tested in the same way. All input and output files (e.g. alignments) for all analyses are available on the University of Guelph Research Data Repository.

RESULTS

1) Relative rates

Forty-two independent sister taxon pairs (Fig. 3.1), each having sequences for 1 to 17 protein-coding genes available, were analysed. These 42 comparisons represented a total of 148 pairs of terrestrial-freshwater estimated relative rates for single genes. Terrestrial lineages had higher dN/dS ratios than their paired freshwater lineages in 26 cases, while the freshwater sister lineages had higher dN/dS ratios than the terrestrial lineages in 16 cases (pbinomial [pb]=0.16, pWilcoxon [pW]=0.056) (Figure 3.2). The median larger/smaller dN/dS ratio was -1.31. Neither mitochondrial dN/dS ratios alone (23 vs. 16, pb=0.34, pW=0.41) nor nuclear dN/dS ratios alone

(16 vs. 9, pb=0.23, pW=0.061) were significantly higher in terrestrial lineages. Non-synonymous substitution rates (dN), synonymous substitution rates (dS), nor overall branch lengths differed significantly between habitat categories for all genes together or within nuclear and mitochondrial gene categories alone (full results in Supplementary Material Ch3_S). We also did not detect any significant difference in relative ratios based on transition directionality; dN/dS ratios in the freshwater->terrestrial direction (9 pairs with the terrestrial lineage higher vs. 7 pairs with the freshwater lineage higher) and the terrestrial->freshwater direction (17 vs. 9) did not differ significantly (Fisher’s exact p=0.74).

69

The trend of higher dN/dS ratios in terrestrial environments became stronger when aquatic-associated and semi-aquatic comparisons were removed, leaving only those comparisons including organisms that enter or live below water in at least one life stage. Almost two-thirds of dN/dS ratios were higher in the terrestrial environment (20 vs. 11, pb=0.15, pW=0.017)

(blue/plain bars in Fig. 3.2), which was similar for dN (22 vs. 9, pb=0.030, pW=0.091), while dS were equally often higher in both habitat types (16 vs. 15, pb=1.0, pW=0.28 [freshwater direction]).

2) Convergence in COI a) Three amino acid sites had suggestive differences (uncorrected 2-tailed p=0.048 to 0.088) in their frequencies of amino acid or amino acid category changes between freshwater and terrestrial lineages (sites 170, 461, and 480 in the reference sequence). Similarly, this was true for 4 amino acid sites (uncorrected 1-tailed p=0.048 to 0.071) across comparisons from the terrestrial->freshwater direction only (sites 334, 354, 413, and 480). However, none of these differences remained statistically significant after correction for the number of sites tested. The most extreme difference (in terms of counts) appeared to be one site (site 354) that had 5 independent postulated changes to a ‘V’ across 21 lineages that entered the freshwater environment, while the terrestrial lineages inhabiting the ancestral habitat type exhibited 2 changes to an ‘I’. Full results are provided in Supplementary Material Ch3_S.

b) Ten sites (Table Ch3_S_5) had high probabilities of positive selection with the transition to aquatic habitat, compared to just one site for the transition to terrestrial habitat. Of these sites, there was a maximum of 3 more changes to the same amino acid in the new environment than was reconstructed in the lineages that were postulated to have stayed in the same environment.

c) The habitat-linked change previously reported in Hydrophiloidea beetle COI sequences (Song et al. 2014)––of changes to ‘A’ amino acids in terrestrial and ‘S’ or ‘G’ amino acids in freshwaters––was not consistently detected when using a broad phylogenetic range of terrestrial

70 vs. freshwater paired lineages. The target region of COI was available for 31 freshwater- terrestrial clade pairings (excluding the Song et al. 2014 data); among these, 5 differences were observed at this site from their reconstructed ancestral sequences to the freshwater or terrestrial tips. A single change to a ‘G’ (from S) corresponded with one shift to freshwaters, while changes to both ‘A’ and ‘S’ occurred within terrestrial clades, but not in conjunction with shifts to terrestrial habitat (i.e. these changes occurred within a background of terrestrial ancestry). Both freshwater and terrestrial taxa exhibited ‘G’, ‘S’, and/or ‘A’ amino acids at that position.

d) Seven non-overlapping pairs of independent freshwater lineages were tested for greater numbers of convergent or parallel substitutions than would be expected by chance: three pairs within Coleoptera (sister pairs 6 vs. 10, 5 vs. 9, and 7 vs. 11), two pairs within Diptera (27 vs. 30, and 24 vs. 29), one within Lepidoptera (15 vs. 17), and one within Hemiptera (1 vs. 3). Each of these four taxonomic groups had one comparison with greater levels of convergence or parallelism than was expected by chance in the aquatic lineages (all with p<0.01; full details in Supplementary Material Ch3_S Table Ch3_S_6). This included one convergent change in each of Lepidoptera and Coleoptera (pair 5 vs. 9), 3 parallel changes in Diptera (pair 27 vs. 30), and 2 parallel and 2 convergent changes in Hemiptera. However, two of these seven pairs also had significant results for the sister terrestrial lineages.

DISCUSSION

Based on multiple phylogenetic contrasts, we observed neither freshwater nor terrestrial insects with significantly more often higher rates of molecular evolution. However, terrestrial insects had significantly higher dN/dS ratios than freshwater lineages with semi-aquatic or aquatic-associated comparisons removed.

71

Possible reasons for lack of trends

Some methodological factors could lead to lack of observable trends in relative rates. Low sampling of species compared to true diversity—as is common in insect taxa—could potentially cause habitat shifts to be reconstructed as more ancient than they actually are. Similarly, a portion of the postulated habitat history of a lineage of interest may belong to the other habitat category, if the true closest sister of a contrasting habitat state was not available for analysis or if paraphyletic lineage pairs were used. These occurrences could lead to the underestimation of rate differences between lineages belonging to separate habitat categories. Finally, any errors in phylogenetic reconstruction, as compared to actual lineage history, could introduce noise into the dataset (bootstrap values for nodes of interest given in Supplementary Material Ch3_S); indeed, multiple phylogenetic hypotheses were often available. However, we do not think measured rates would be biased toward any particular habitat category unless there was strong sequence convergence or rate acceleration based on habitat causing systematic, incorrect phylogenetic reconstruction. Overall, we do not expect the methodology to cause a directional bias in rates, while we acknowledge that rate differences between categories may be underestimated here.

Fit of relative molecular evolutionary rates with original hypotheses

Hypothesis 1: effective population size

The more often higher dN/dS ratios in terrestrial lineages do not support generally smaller effective population size (Ne) in freshwater-living or freshwater-associated lineages. Even considering possible counter-acting influences on mitochondrial dN/dS ratios in freshwaters (Hypothesis 2, below), the nuclear genes still more often exhibit higher ratios in terrestrial environments. The results suggest that consistent differences in Ne may not exist, or if so, that terrestrial organisms more commonly have smaller Ne. Insects from primarily terrestrial clades are hypothesized to have more specialised herbivorous diet breadth than insects from primarily freshwater clades, and this has been supported empirically (e.g. Cronin et al. 1998). Thus, the results could reflect narrower food source range in terrestrial than aquatic realms, despite more extensive overall biome distribution. We did not directly measure any biological or

72 ecological parameters of organisms in these environments. Large barcoding datasets, complemented by other markers, could be used to estimate relative Ne among terrestrial and freshwater insects and to help tease apart influences from selection on COI specifically vs. Ne effects operating genome-wide.

Hypothesis 2: metabolic efficiency and oxygen use

The included genetic data in this study was a small and biased portion of the genome, in that the genes were obtained from source studies that conducted phylogenetic tree reconstruction. These data were obtained with the intention of addressing trends in molecular evolutionary rates genome-wide (e.g. effective population size), rather than for detecting specific genes or gene types that are under positive selection linked to habitat. However, mitochondrial genes are not strictly neutrally evolving (Ballard and Kreitman 1995) and may specifically be under selection relating to occupancy of freshwater habitats, and thus are further considered here.

We expected that mitochondrial oxidative phosphorylation (OXPHOS) genes might possess lower dN/dS ratios in aquatic lineages––especially for those organisms living below water––due to tighter constraints in freshwaters relating to oxygen availability and hence purifying selection to maintain metabolic efficiency. However, the support for this hypothesis in our data is at most weak: although the habitat association was stronger for overall dN/dS ratios and also dN with the inclusion of only fully aquatic species, the mitochondrial dN/dS ratios were not consistently lower in freshwater lineages. This contrasts with apparent molecular trends in mitochondrial genes between organisms with more consistently and presumably highly different energy requirements, such as flightless insects vs. flying insects (Mitterboeck and Adamowicz 2013) or more sedentary vs. locomotive fishes (Strohm et al. 2015). However, our intention was to look for trends based on broad habitat category itself; the use of empirically measured low- vs. high-oxygen environments would test this hypothesis more directly.

73

Molecular convergence in COI based on habitat

Molecular convergence at sites in COI may be expected if aquatic environments were consistently associated with a difference in oxygen levels or metabolic activities as compared with terrestrial organisms. However, this is not reflected in the molecular rates results (discussed above) or results of convergence tests. Specific amino acid changes associated with high energy expenditure in lower-oxygen conditions have been observed in other genes of the cytochrome c oxidase complex, such as in the COIII gene in high-altitude bar-headed geese (Scott et al. 2011). While some degree of greater convergence in aquatic than terrestrial lineages was observed, particular amino acid changes in COI were not prevalent across most comparisons here (5 of 21+ comparisons at most). Thus, the observed changes are likely not key to the successful habitat transition, as might be expected in the case of genes that are directly and necessarily linked with a specific organismal trait (e.g. Besnard et al. 2009). Furthermore, the most extreme difference in terms of counts of changes observed (between ‘I’ and ‘V’) was between amino acids in the same physico-chemical category and at a site with low amino acid diversity across the lineages, and thus likely reflects exchangeability rather than adaptation. Using genomic data, more biological information on the genetic basis of habitat shifts could firstly be obtained through consideration of convergence and similarities in gene types under differential selection pressures between habitats (e.g. Foote et al. 2015).

We secondly did not observe support for the proposed habitat-convergent amino acid site in water scavenger beetles when examining a range of insect taxa. Although the particular amino acid site is in a COI region expected to be slower evolving (transmembrane region of the protein) (Tourasse and Li 2000, Morrill et al. 2014), the three amino acids observed at that position (S, G, and A) possess similar physico-chemical properties (small, non-polar amino acids) (Bromham 2016), and the codons are separated by single nucleotide changes. These were the only three amino acids present at that position in the alignment of over 200 phylogenetically diverse insect species. The lack of the same observation in other taxa here suggests the original observation may not be due to adaptation relating to habitat shifts, although there could be multiple molecular changes that would produce the same functional result. Statistically incorporating baseline substitution frequencies into future convergence analyses would improve our ability to detect functionally relevant convergence based on habitat. In addition, functional studies would

74 provide an independent perspective on the biological significance of any postulated amino acid changes linked to habitat.

Lotic vs. lentic freshwater habitats considered

Within the freshwater realm, the type of environment—lotic vs. lentic—may vary with molecular evolution in predictable ways. Lotic insects, i.e. those inhabiting running water, have higher habitat stability, evolutionary longevity (Dijkstra et al. 2014), smaller geographic range size on average (Ribera and Vogler 2000), and less-dispersive flying adult forms than lentic insects, i.e. those living in standing water (Ribera et al. 2001, Dijkstra et al. 2014). These differences in dispersal tendencies have been observed to lead to greater population differentiation in running water than standing water (Marten et al. 2006). Indeed, higher lineage- level molecular rates have been observed in lotic than lentic species of water beetles, thought to be due to smaller Ne in the lotic species (Fujisawa et al. 2015). Most of the freshwater clades included in our analysis were of mixed habitat designations but predominantly lotic. Thus, it is possible that trends in dN/dS ratios observed here, if due to relative differences in Ne, may be stronger for lentic than lotic freshwater insects as compared with their terrestrial relatives.

Parallels between aquatic habitat and other biological and ecological traits

Another area for further investigation would be to analyze the correlation between relative rates and additional biological parameters that display parallels with the transition to an aquatic lifestyle. For example, endoparasitoids could also be considered as ‘aquatic’, and some have similar methods of obtaining oxygen as aquatic larvae––such as the use of tunnels to the outside of the host organism (e.g. Tachinidae fly endoparasitoid larvae), akin to breathing tubes in some aquatic larvae (e.g. fly Eristaline larvae) (Marshall 2012). Although other parameters common to endoparasites are likely not common to freshwater insects generally, comparing the two groups may be helpful to discern common genetic patterns associated with ‘aquatic’ lifestyle separate from freshwater aquatic habitat.

75

Moreover, there is undoubtedly much variation in terms of biology, ecology, and molecular evolution among lineages within both the freshwater and terrestrial habitat categories, such as different feeding styles (herbivorous, predatory, and parasitoid) and dispersal tendencies. Future studies may incorporate multivariate analysis (e.g. Bromham et al. 2015) to reduce noise in trends of molecular evolution linked to individual characteristics of lineages.

Implications

The gene with highest availability across the sister comparisons was COI (in 38 sister pairs below the order rank), regions of which are often employed in identifying specimens to the species level as well as delineating sequencing into species-like units with the use of DNA barcoding (Hebert et al. 2003, Ratnasingham and Hebert 2013). Although some interesting trends in dN/dS ratios and dN were detected, overall nucleotide models (for which we observed no habitat-linked pattern) are more commonly used for molecular operational taxonomic unit (MOTU)-based delineation. Thus, no adjustment to insect MOTU delineation on the basis of habitat occupancy alone needs to be recommended based upon our results presented here.

Conclusions

This study overviews molecular rate differences and molecular patterns relating to freshwater vs. terrestrial habitat across a wide range of insect taxa. The subtle habitat-linked trends—higher dN/dS ratios and dN in terrestrial lineages—observed here using a limited number of genes may be indicative of stronger trends when considering molecular rates and patterns genome-wide.

ACKNOWLEDGEMENTS

We thank Stephen Marshall for input on the biology and systematics of the taxa; Patrick Schmitz and Daniel Rubinoff for providing information on their study; Tzitziki Loeza-Quintana and Robert Young for input on an earlier draft of the manuscript; two anonymous reviewers for

76 their helpful comments; and all the researchers who work to produce and make their genetic data publicly available. This work was supported by the University of Guelph (Integrative Biology PhD Award and Dean’s Tri-council Scholarship to T.F.M.), the Government of Ontario (Ontario Graduate Fellowship to T.F.M.), and by the Natural Sciences and Engineering Research Council of Canada (Alexander Graham Bell Canada Graduate Scholarship to T.F.M., Discovery Grants 400479 to J.F. and 386591-2010 to S.J.A.).

LIST OF SUPPLEMENTARY MATERIAL

SM Ch3_S. (MS Excel). Summary of molecular evolutionary rates and COI convergence analysis

Table Ch3_S_1. Source study and habitat information on the 42 sister pairs analysed for relative rates, all rates for each sister clade and gene, and binomial and Wilcoxon signed-rank testing

Table Ch3_S_2. GenBank accession numbers for sequences not obtained from source studies

Table Ch3_S_3. List of species used in COI convergence tests and reconstructed ancestral states

Table Ch3_S_4. Postulated amino acid changes in aquatic and terrestrial lineages and Fisher’s exact tests for convergence at each site

Table Ch3_S_5. Positive selection sites summary

Table Ch3_S_6. Aquatic lineage pairs used in CONVERG2 analysis

77

Figures and Tables for Chapter 3

78

79

Figure 3.1. Composite phylogeny including aquatic and terrestrial lineages used in analysis. Blue, or lighter-toned, lineages represent species living in or that are associated with aquatic environments at the larval life stage. Where indicated in the order names with ‘*’, the taxa generally live in aquatic habitats at the adult life stage as well, with some variation in larval habitat category existing within Coleoptera. Labels after species names, for both aquatic (AQ) and terrestrial (TER) lineages, indicate the sister comparison number as matching with Fig. 3.2. Unlabeled branches are either terrestrial or mixed in habitat state and were included as outgroups. The habitat state of internal blue/lighter or black/darker lineages are the postulated ancestral states from the source studies or are hypothesized here via the parsimony criterion. Lineages representing terrestrial to aquatic habitat transitions are labeled with a blue circle on the

80 lineage, while aquatic to terrestrial habitat transitions are labeled with a brown square. Not all possible habitat transitions are shown––only those included in at least one of the analyses conducted in this study. Taxa labels: Neur. = Neuroptera, Trichopt. = Trichoptera, Mc = Mecoptera. Source studies used for phylogenetic topologies: Insecta backbone and Mecoptera: Misof et al. 2014; Coleoptera: Bernhard et al. 2006, Hunt et al. 2007, Song et al. 2014; Diptera: Brammer and von Dohlen 2007, Wiegmann et al. 2011, Curler and Moulton 2012, Chapman et al. 2012; Trichoptera: Hayashi et al. 2008; Lepidoptera: Regier et al. 2012, Rubinoff and Schmitz 2010; Hemiptera: Li et al. 2012; Neuroptera: Aspöck et al. 2012. Complete information on the source studies used for topologies and habitat information, and sources of molecular data, is provided in Supplementary Material Ch3_S.

81

Figure 3.2. Relative aquatic: terrestrial (AQ:TER) dN/dS ratios across 42 sister comparisons. Relative dN/dS ratios are displayed as 1 minus the smaller dN/dS ratio over the larger dN/dS ratio, signed based on which habitat had the larger ratio (AQ>TER is positive, TER>AQ is negative). The overall dN/dS ratios (bars) were higher in the terrestrial clade in 26 comparisons and higher in the aquatic clade in 16 comparisons (pbinomial=0.16, pWilcoxon=0.056). Aquatic type designations (colours of bars): Blue/plain = below water––when at least one life stage can live/enter below water; Green/diagonal stripes = semi-aquatic––when the term ‘semi- aquatic’ or ‘wet’ was used to describe the taxa or habitat, respectively; purple/horizontal stripes = aquatic-associated taxa––when the habitat was described as marginal (e.g. river banks) or organisms are surface dwelling. A general classification of aquatic type was given (below, semi, or associated), but variation in aquatic type existed within the aquatic clades; the comparison was labeled based on the specific taxa used in analysis and the most detailed habitat information obtained, with greater weight in assigning the category given to those taxa with the greatest

82 amount of genetic data available (details provided in Supplementary Material Ch3_S). Taxa labels: Hemipt. = Hemiptera, Nr = Neuroptera, Tr = Trichoptera; Mc = Mecoptera.

83

Chapter 4

Positive and relaxed selection in insect transcriptomes associated with the evolutionary gain and loss of flight

T. Fatima Mitterboeck*1,2, Shanlin Liu*3,4, Rui Zhang3, Wenhui Song3, Karen Meusemann5, Jinzhong Fu2, Sarah J. Adamowicz1,2, Xin Zhou3,6,7

*planned shared first authorship and S.L. placed first for publication. 1Biodiversity Institute of Ontario, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada 2Department of Integrative Biology, University of Guelph, 50 Stone Road East, Guelph, Ontario N1G 2W1, Canada 3China National GeneBank-Shenzhen, BGI-Shenzhen, Shenzhen, Guangdong Province, China 4Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, Øster Voldgade 5–7, 1350 Copenhagen, Denmark 5National Research Collections , Australian National Insect Collection, Canberra, ACT, Australia 6Current address: Beijing Advanced Innovation Center for Food Nutrition and Human Health, China Agricultural University, Beijing 100193, China 7Current address: College of Food Science and Nutritional Engineering, China Agricultural University, Beijing 100083, China

84

ABSTRACT

The evolution of powered flight is a major innovation that has facilitated the success of insects. We explore whether specific gene categories exhibit positive selection in the lineage leading to the flying insects, and test for positive and relaxed selection in lineages that have lost flight, in a data set of over 1000 nuclear and mitochondrial protein-coding genes obtained from transcriptomes. Previously, studies of birds, bats, and insects have detected molecular signatures of differing selection regimes in energy-related genes associated with flight evolution and/or loss. Here, with the inclusion of genes not directly related to energy production, we further test what genes are under differing selection pressures associated with the evolutionary gain and loss of flight. We detected a number of categories of nuclear genes more often under positive selection in the lineage leading to the flying insects, related to catabolic processes such as peptidase activity and hydrolysis, as well as related to RNA splicing. These categories did not overlap with genes under positive selection in those branches where flight was lost, involving various metabolic processes. Oxidative phosphorylation genes were most often detected as being under positive selection in the holometabolous insects as compared with other insect lineages. Consistent with previous studies of flight loss, we observed higher dN/dS ratios indicative of relaxed selection in energy-related genes in flightless lineages. Overall, this study supports some convergence in gene-specific selection pressures in insects, as in other flying and flightless taxa, while the exploratory analysis provided some new insights into gene categories potentially associated with the gain and loss of flight. This study is a step toward elucidating the molecular evolution correlated with an important and unique trait that has played a major role in shaping the diversity of life.

Keywords: insects, flight, flight loss, nuclear, mitochondrial, positive selection, transcriptomes, 1KITE project

INTRODUCTION

The evolution of powered flight in insects has likely positively impacted the species diversity of this group (Mayhew 2007). Flight, having arisen multiple times in animals, arose earliest in insects approximately 406 million years ago and characterizes the clade Pterygota

85

(Misof et al. 2014). By increasing dispersal ability, flight facilitates food and mate finding as well as the avoidance of unfavourable habitats or predators (Grimaldi and Engel 2005). While the apterygote insects have no major metamorphosis stages after hatching, pterygotes additionally have at least basic metamorphosis, which involves egg, nymph, and adult stages. The origin of metamorphosis may be linked directly to the evolution of flight due to selection against less-adaptive winged intermediate stages (Marshall 2006). These transitions paved the way for later innovations, such as wing folding and holometabolism (i.e. egg, larval, pupal, and adult stages), additionally implicated in the evolutionary success of insects (Mayhew 2007). Despite the advantages associated with flight, flight has been lost an estimated thousands of time in pterygotes (Whiting et al. 2003), such as in lineages representing fleas, snowflies, and stick insects (Roff 1990). The evolution of key traits at the origin of Pterygota is not well understood. Three main hypotheses have been proposed for the origin of the structures forming wings, involving modification of gills, extensions of the body wall, or both (Averof and Cohen 1997, Clark- Hatchel et al. 2013). Genes important for the physical development of wings have been identified, including the protein-coding genes wingless, apterous, vestigial, nubbin, nub (reviewed in Brook et al. 1996), and vein (Paul et al. 2013). Genes differentially expressed in flying and non-flying morphs within certain insect species have also been identified. Genes more highly expressed in flying morphs include 1) those involved in energy production, e.g. genes that function in the mitochondria (Yang et al. 2014a, Brisson et al. 2007) and the nuclear gene Isocitrate dehydrogenase (IDH), which is important in the citric acid cycle (Brisson et al. 2007); 2) those involved with lipid metabolism (Yang et al. 2014a); and 3) the flightin gene (Brisson et al. 2007, Xue et al. 2013, Yang et al. 2014a), which is important for indirect flight muscle function (Vigoreaux 1998). Genes more highly expressed in flightless morphs include those related to sugar metabolism (Yang et al. 2014a) such as trehalase (involved in conversion of trehalose to glucose) (Brisson et al. 2007), and seryl-tRNA synthetase (involved in tRNA metabolic process) (Yang et al. 2014a). In order to further understand the evolution of flight in insects, it would be useful to pinpoint genes that experienced differing selective pressures during the time when flight originated—during a time span of approximately 14 million years (Misof et al. 2014)—as well as associated with the evolutionary loss of flight. Differential selective pressures may be 1) directly related to the evolution or develop of wings or to the function of

86 flying itself, 2) may occur as a consequence of flight, or 3) may involve some pre-adaptation that allowed the evolution of flight. While flight evolved once in insects and in the distant past, making it difficult to detect signatures of selection, the numerous and more recent losses of flight may provide further information on genes relevant to the evolution and maintenance of flight. Flight is a highly energetically costly activity in animals; flying insects use up to 50 (Roff 1991) or 100 times (Krogh et al. 1951) more energy when flying than at rest. The oxidative phosphorylation (OXPHOS) pathway in the mitochondrion provides 95% of the energy required for eukaryotic cells (Erecinska and Wilson 1982). Therefore, the 13 mitochondrial protein- coding OXPHOS genes, the 78 nuclear OXPHOS genes (number present in Drosophila, Tripoli et al. 2005), and the 1000+ additional nuclear-encoded genes that function in the mitochondria are likely important in the evolution of traits that require large amounts of energy (Shen et al. 2010), such as large brain:body size ratios (Ai et al. 2010). Genes involved in energy production were observed to bear signatures of positive selection with the evolution of flight in animals, or conversely under relaxed selection with flight loss (Shen et al. 2009, 2010, Mitterboeck and Adamowicz 2013, Yang et al. 2014b). However, the broad question of what gene types are associated with flight evolution in insects has not been investigated, with previous studies focused on mitochondrial energy-related genes a priori (Yang et al. 2014b). We conduct an exploratory analysis to identify the types of protein-coding genes that have evolved under positive selection in the lineage that gave rise to the pterygote clade (Branch ‘P’ in Figure 4.1), and those under positive and relaxed selection in flightless lineages. We use sequences from up to 1476 orthologous nuclear and 13 mitochondrial protein-coding genes obtained from transcriptomes. To further test the relationship between energy-related genes and flight, we test for positive selection in available nuclear OXPHOS and mitochondrial OXPHOS genes throughout the major lineages of hexapods.

METHODS

Source of genetic data

The nuclear genetic data used in this study consist of transcriptome-derived DNA sequences obtained as part of the 1000 Insect Transcriptome Evolution (1KITE) project (http://www.1kite.org) from Misof et al. (2014). We used the updated (e3) assembly version 2 of

87 this alignment (NCBI accession PRJNA183205) of 1476 single-copy nuclear orthologous genes and 113 species; this data set involved the removal or replacement of problematic sequences detected since the Misof et al. (2014) analysis and followed the same pipeline. This included: 1) 1476 single-copy orthologous genes for each species were inferred using Orthograph (unpublished, program available on Github: https://github.com/mptrsen/Orthograph), 2) multiple sequence alignment was conducted with MAFFT v7.017 (Katoh and Standley 2013) using the L- INS-I algorithm for amino acid sequences translated from original nucleotide transcripts, 3) multiple sequence alignment of each orthologous group was refined by removal of outliers and re-aligned, 4) Pal2Nal (Suyama et al. 2006) was applied to obtain the corresponding nucleotide multiple sequence alignments. Sequences for the 13 mitochondrial protein-coding genes were obtained from the associated mitochondrial transcriptome sequencing project of the Beijing Genomics Institute, with some substitution of sequences from mitochondrial genomes published on NCBI (species and sources of data provided in Supplementary Material [SM] Ch4_S2). These sequences were aligned with EMBL-EBI Clustal Omega (Sievers et al. 2011) and Pal2Nal (Suyama et al. 2006), and Guidance (Penn et al. 2010) was applied to mask sequence regions that were unreliably aligned. Phylogenetic tree topology for selection tests was obtained from Misof et al. (2014).

1. Exploratory test of positive selection in lineage leading to Pterygota

Twenty-eight species of hexapods were selected to maximize the number of shared nuclear genes available for analysis and the phylogenetic representation of pterygotes and non- pterygote hexapods. Not all genes were available for all species in the candidate alignment, and thus species were selected with the trade-off of number of species versus obtaining the largest gene set. We excluded flightless species or orders here. Species selection was performed in a phylogenetically stratified way, with the final list of 28 species being those that gave the maximum gene count: 1) all 5 apterygote orders were included, with a maximum of three species per order, but allowing up to 1 missing sequence per gene for this set; 2) one species from Odonata and one species from Ephemeroptera were included, with no missing sequences allowed; 3) 1 species per each of 5 orders of Polyneoptera was included, allowing one missing sequence per gene for this set; 4) 1 species from each of 10 orders in the clade including Thysanoptera to Diptera (as ordered in Figure 4.1) was included, allowing up to 3 missing

88 sequences per gene (species selected shown in Figure 4.1). This resulted in 954 genes. Similarly, 27 species representing the hexapod orders were selected for the 13 mitochondrial protein-coding genes, with no missing sequences allowed. We tested for evidence of positive selection (PS) in these nuclear and mitochondrial genes in the lineage leading to the Pterygota (Fig. 4.1 branch ‘P’). The branch-site method of detecting positive selection (Zhang et al. 2005) in the program PAML codeml (Yang 2007) was used, with the fit of models A1 (omega [dN/dS ratio] fixed at 1) vs. A (omega free to vary) (each with four classes of sites, each class allowing a certain combination of dN/dS ratios representing positive selection, purifying selection, or neutral evolution) compared for each gene separately through likelihood ratio tests. For this and subsequent analyses, we corrected for false discovery (with rate controlled to 5%) due to multiple genes being tested by using the Benjamini-Hochberg correction (Benjamini and Hochberg 1995) for each gene within a set. We repeat the tests on two additional lineages to serve as a null hypothesis to compare to the results for the lineage ‘P’. Branch ‘U’ (upstream) and ‘D’ (downstream) (Fig. 4.1) were tested, each representing a time span of an estimated 20 million years (Misof et al. 2014). Using these results, we separated out genes that were uniquely detected as being under positive selection in the lineage leading to the pterygotes. All positively selected genes and these unique genes were each subjected to Gene Ontology (GO) analysis, described in section 4 below.

2. Exploring genes under positive selection with flight loss

Eleven cases of flight ‘loss’ were identified on the tree (Figure 4.1) to be analyzed for nuclear gene data. Not all of these evolutionary losses were accurately mapped to the correct node here, given the available species sampled. For example, a loss may have occurred in the common ancestor of a family, but only species representing superfamily-level divergences were available for analysis here. In the case of phasmids, flight loss is known to have occurred multiple times within the order (Whiting et al. 2003, Stone and French 2003). However, all available species were flightless, and thus the losses could not be represented accurately on the phylogeny; we tested the branch leading to the phasmid clade to approximate the timing of the early flight losses in that order as accurately as possible. Due to incomplete phylogenetic mapping of the flight losses, the branches tested here likely represent some flying lineage history in addition to flightless lineage history, which may cause underestimation of molecular signal

89 due to flight loss. A qualitative assessment is provided to indicate the likely degree of accuracy in the mapping of each case of flight loss, considering the density of taxonomic sampling in that group and how frequently flight is thought to have been lost in those groups (Figure 4.1 and SM Ch4_S1 Table Ch4_S1_2). Sub-trees including the lineage of interest, sister lineage(s), and 3 successively branching outgroups were used to test for signatures of positive selection associated with each flight loss separately in order to maximize gene coverage; no missing gene data were allowed for the species. Each sub-tree contained 14 to 19 species, with 584 to 1174 genes available for all species in that analysis (provided in SM Ch4_S2 Table Ch4_S2_4). A total of 1284 genes was included considering all 11 sub-trees. A test for positive selection was performed on each of the 11 branches of interest for each tree and gene separately. Those genes with significant p values (at 0.05 level after Benjamini- Hochberg correction) within a tree were further considered. We identified overlapping genes among those evolving under positive selection in the 11 cases of flight loss tested. Genes detected as under positive selection in at least 3 cases of flight loss (referred to as ‘candidate genes’) were included in GO analysis (Part 4 below). Furthermore, GO analysis was performed with datasets modified to reduce confounding variables: 1) re-analysis with only full flight losses, by excluding 3 cases of female flight loss; and 2) removing genes from the original candidate gene list if the positive selection was detected in 2 or all 3 of the 3 comparisons involving a parasitic lineage and if the positive selection count in that candidate gene would be reduced below 3 without those counts.

3. Exploring genes under relaxed and/or positive selection with flight loss

Mitochondrial genes were examined for relaxed and/or positive selection associated with flight loss. A 66-species tree similar to Figure 4.1 was used, shown in Supplementary Material (SM) Ch4_S1 Figure Ch4_S1_1. Branch models in PAML codeml were applied to estimate dN/dS ratios for lineages of interest. While gene-wide dN/dS ratios above 1 most likely indicate positive selection, those between 0 and 1 could indicate positive or relaxed selection. We identify genes with increased dN/dS ratio (between 0 and 1) in flightless as compared to flying lineages. dN/dS ratios within this range may more likely indicate relaxed selection due to the assumption that the majority of non-synonymous changes across a whole gene sequence are selectively neutral or slightly deleterious; by contrast, positive selection is assumed to affect a small

90 minority of sites at which beneficial mutations have occurred (Hughes 2007). These assumptions of the nearly neutral theory of molecular evolution (Ohta 1973, 1992) have been broadly supported by empirical studies of patterns of genetic variability (e.g. Woolfit and Bromham 2003, Shen et al. 2009). All secondarily flightless lineages were coded one rate, and all sister or related flying lineages of similar tip number and taxonomic rank were coded together a separate rate; this was performed in order to avoid biasing results based on differential number and depth of nodes in each category, such as due to node density effects (Robinson et al. 1998) or rates being underestimated for deeper nodes in a tree (Ho et al. 2011). In preliminary tests on these mitochondrial genes and in Mitterboeck and Adamowicz (2013), the female-flightless lineages yielded similar results to full-flightless lineages as compared with related flying lineages. Due to this and the small number of flightless lineages, we considered both female- and both-sexes flightless lineages in the flightless category. Likelihood ratio tests between 3-rate trees (secondarily flightless [red and purple branches; Fig. Ch4_S1_1], related flying [blue branches], all other lineages [black branches]) and 2-rate trees (secondarily flightless plus related flying [blue+red+purple], all other lineages [black]) were used to test for significant dN/dS differences between target lineages and sister lineages (with Chi square test at p<0.05). P values were corrected by Benjamini-Hochberg correction across genes. We directly compared the dN/dS ratios in the flightless vs. related flying lineages.

4. Gene Ontology categories

We tested for over-representation in Gene Ontology (GO) categories by the genes exhibiting positive selection (from Methods parts 1 and 2) as compared to each total gene set analyzed (‘background genes’) using the DAVID (Database for Annotation, Visualization and Integrated Discovery) Functional Annotation tool (Huang et al. 2009a, 2009b) and similarly using PANTHER (Protein Analysis Through Evolutionary Relationships) (Mi et al. 2015). These tools were chosen a priori without testing the data using other similar tools that could give differing results. Gene Ontology terms used in DAVID are from a standard list, used to identify ‘themes’; PANTHER may use some of these same terms but additionally uses its own protein family ontologies, extending the vocabulary of GO terms to focus on ‘processes’. The genes, which all have FlyBase IDs (http://flybase.org/), were matched to Drosophila genome functional

91 annotations where available. No additional false discovery rate correction was applied (p values are raw) as correction was applied for selection analysis and the number of genes tested was relatively modest. The background gene sets themselves were not representative of all the genes present in the insect genomes; we provide information on the gene categories over- or under- represented among the total available 1476 background gene set in SM Ch4_S1 Table Ch4_S1_1 (for PANTHER) and SM Ch4_S2 Table Ch4_S2_1 (for DAVID).

5. Positive selection in energy-related genes in Hexapoda

Specific genes were investigated that related to energy production or were a priori hypothesized to be related to flight or flight loss. These included 14 nuclear OXPHOS genes available in the total gene set (1476 genes) identified via their FlyBase IDs, which are a subset of the 78 nuclear OXPHOS genes listed in Tripoli et al. (2005), and 5 other genes of interest identified by name or description in DAVID functional annotation: wingless, IDH, flightless1, myosin binding subunit, and an energy-related gene (Dmel_CG1271). Ten additional genes with full species coverage were pseudo-randomly selected (not considering function, with the selections spread out by FlyBase IDs) and also analyzed to check for phylogenetic biases in the positive selection results. We selected one species per hexapod order and one species from each of two outgroups (outgroups were available for the nuclear genes only; 34 species total), for each set of nuclear and mitochondrial genes. In selecting species, we considered gene completeness, with preference for those species available across the most genes of interest. In a few cases, substitutions of some species were made to improve gene completeness (species lists provided in SM Ch4_2 Tables Ch4_S2_6 and Ch4_S2_7). Mitochondrial genes, where gene sampling was more complete for species, were additionally tested with more than 1 species per order (up to 6 species) to investigate effects of species sampling on the results; the 66-species tree was the same as that used for relaxed selection analysis of mitochondrial genes (SM Ch4_S1 Figure Ch4_S1_1). Tests for positive selection were conducted on all lineages using the program HyPhy (Pond et al. 2005) and the Branch-site REL (Random Effects Likelihood) model (Pond et al. 2011) implemented on the DataMonkey server (Delport et al. 2010).

92

RESULTS

1. Positive selection associated with the origin of flight

126 out of 954 nuclear genes (13%) were detected to be under positive selection in lineage ‘P’ (Pterygota); 39 of these were uniquely detected to be under positive selection in branch ‘P’ and not also detected in either branch ‘U’ (upstream) or ‘D’ (downstream). The 126 candidate nuclear genes over-represent 17 Gene Ontology (GO) categories (with p<0.05) (Table 4.1) and 2 Biological process categories (Table 4.2). The 39 unique candidate genes over- represented 8 GO categories and 1 Biological process. The full set of 126 candidate genes contained just one nuclear OXPHOS gene, out of a total of 13 OXPHOS genes available in the background gene set (of 946 genes); this value does not differ from the proportional representation expected by chance (pFisher’s=1.0). None of the 13 mitochondrial genes were detected to be under positive selection in the ‘P’ lineage after Benjamini-Hochberg correction.

2. Positive selection associated with flight loss

In the 11 lineages exhibiting flight loss that were tested for positive selection in 584-1174 genes each, between 0.8 and 53.7% of genes exhibited positive selection as compared to the background genes, with a median of 2.4%. Fifty-eight genes were detected to be under positive selection in 3 or more of the 11 lineages of interest associated with flight loss. These genes overrepresented 3 Gene Ontology and 12 Biological Process categories (Table 4.3). None of the 58 genes were OXPHOS, which is not significantly different from the proportion of OXPHOS genes present in the pooled background gene set (13 of 1284, pfisher’s=1.0). The GO and Biological process categories were similar to those in Table 4.3 when considering only cases of full flight loss (8 lineages) and when excluding genes under positive selection mainly in flightless parasitic lineages (SM Ch4_S1 Tables Ch4_S1_5 to Ch4_S1_8).

3. Relaxed selection associated with flight loss

The flightless lineages, which here included both-sexes-flightless and female-flightless lineages, had significantly higher dN/dS ratios than related flying lineages in the mitochondrial genes (Figure 4.2). Eleven out of 13 mitochondrial OXPHOS genes had higher dN/dS ratios in the flightless lineage than in the related flying lineage (pbinomial = 0.0225). The ratio was higher in

93 the flightless lineage for all 5 of the genes that individually exhibited significant differences in the dN/dS ratio between flight categories (p values for the likelihood ratio tests for each individual gene are given in SM Ch4_S1 Table Ch4_S1_9). Four out of the 5 genes are cytochrome complex genes (COI, COII, COIII, and CytB).

4. Gene Ontology analysis

Results for Gene Ontology analysis were reported above in sections 1-3.

5. Positive selection in nuclear and mitochondrial OXPHOS genes in hexapod lineages

Six of the 14 nuclear OXPHOS available in the total gene set exhibited positive selection in at least one branch, along with 4 of 10 randomly selected nuclear genes, and 3 of the 5 other nuclear genes chosen a priori (all genes listed in SM Ch4_S1 Table Ch4_S1_6). All mitochondrial OXPHOS genes had positive selection detected in at least one branch (Figure 4.3). The apterygote branches, excepting Protura, and lineages Odonata/Ephemeroptera, which have a direct flight mechanism, did not exhibit many instances of detection of positive selection; however, density of sampling may partially impact detection. In the mitochondrial tree including more than one species per order, again no positive selection was detected in apterygotes (excepting Protura); however, some instances of positive selection were detected within the Odonata-Ephemeroptera clade. Occurrence of positive selection, mainly in mitochondrial OXPHOS genes, appears to be denser in the holometabolous insect clade (clade labeled ‘H’ in Figure 4.3) than in the polyneopteran clade (labeled ‘L’ in Figure 4.3); both of those clades contain a similar number of orders and are of similar age (~362 and 387 MY, respectively, Misof et al. 2014).

DISCUSSION

This study tested for trends in the categories of genes evolving under differing selective pressures associated with flight gain and loss in insects. The incorporation of both transition directions—gain and loss of flight—allows a comparison of trends in the genes under adaptive evolution and relaxed selective constraints with the evolution and loss of flight, respectively. We observed the origin of Pterygota to be associated with detection of positive selection in

94 categories of genes tied to catabolic processes and flight loss tied to metabolic processes. Flight loss was also associated with signal of increased relaxed selection in mitochondrial protein- coding (energy-related) genes as compared with flying lineages. The holometabolous insects exhibited the highest prevalence of positive selection within the hexapods, when examining a subset of genes selected a-priori to be potentially relevant to flight.

Gene categories with signature of positive selection at the origin of Pterygota

In this exploratory analysis, we observed the origin of Pterygota to be associated with signatures of positive selection in genes whose categories have a common theme of catabolism, which is the subset of metabolic activities involved in breaking down molecules to release energy and building components. The catabolic categories included peptidase, hydrolase, and protease, and the category RNA splicing was also detected. These results considered the baseline positive selection detected in related lineages (2 control lineages tested) and were consistent across two different analysis methods for detecting gene category over-representation. Some of these categories are similar to those observed to be more highly expressed in flying vs. flightless morphs of aphids, specifically proteasome (protease) and spliceosome (Yang et al. 2014a). Any genes detected under differing selective pressures, even if the methods provided biologically accurate results, could be linked to the many other apomorphies arising at the origin of Pterygota, such as metamorphosis or direct sperm transfer. Previously, Shen et al. (2010) examined positive selection in the lineage representing the origin of bats (flying mammals) using branch-site models, similar to performed here for the origin of Pterygota. They observed the origin of bats to be associated with a 4% increase (from a baseline of 1% of all nuclear genes under positive selection) in the number of nuclear OXPHOS genes displaying positive selection and an increase from 1% to 2% representation of mitochondrial-functioning nuclear genes in the positively selected category (Shen et al. 2010). However, when considering the proportion of positive selection in those genes as compared to the lineage leading to (non-flying) rodents, nuclear OXPHOS genes showed only little difference between bats and rodents, and nuclear-related mitochondrial genes showed no difference. Five of the 102 nuclear OXPHOS genes examined were observed to be under positive selection in the lineage leading to bats, as compared to 2 under positive selection in the lineage leading to rodents (Shen et al. 2010). The proportion of mitochondrial-related nuclear genes detected under

95 positive selection in the lineage leading to bats was the same as that in the rodent lineages. One- quarter of mitochondrial OXPHOS genes showed evidence of positive selection in the lineage leading to bats, and none in rodents. Thus, mitochondrial trends were the most apparent related to the origin of bats. Here, no such trend was observed for nuclear or mitochondrial OXPHOS genes; nuclear OXPHOS genes were not specifically detected to be over-represented under positive selection in the Pterygota lineage, and no mitochondrial OXPHOS genes were detected under positive selection in the lineage leading to Pterygota using branch-site methods.

Gene categories with signatures of positive and relaxed selection associated with flight loss

Linked to flight loss, we observed over-representation of positive selection in genes associated with metabolic processes, including primary, RNA, and nitrogen compound metabolic processes. However, these results are to be confirmed with comparison to flying relatives to eliminate genes that are often under positive selection in any lineage, regardless of flight state. The background gene set also contained genes associated with more highly energy-related categories, such as those involved in oxidative phosphorylation. Given this inclusion of suitable data for testing this pattern, it does not appear that flight loss is significantly tied to positive selection in the most highly energy-linked gene categories. This is in accordance with our prior expectations that we would be more likely to detect relaxed selection in these gene categories associated with flight loss. We expected relaxed selection in energy-related gene categories, arising from reduced selective constraints on energy production with flight loss, and that is what we observed. Here, mitochondrial OXPHOS genes showed evidence of relaxed selection in flightless as compared with flying lineages as demonstrated by significantly higher dN/dS ratios in flightless lineages. This is in accordance with previous observations of proposed relaxed selection (higher dN/dS ratios) in mitochondrial genes associated with flight loss within insect orders (Mitterboeck and Adamowicz 2013) and in birds (Shen et al. 2009). These findings also mirror molecular patterns in sedentary vs. highly locomotive fish (Strohm et al. 2015). Interestingly, 4 out of the 5 significant differences in dN/dS ratios between flightless vs. flying insect lineages were observed in the mitochondrial cytochrome genes (COI, COII, COIII, and CytB), while only 1 significant difference (in ND6) was present for the other genes (ND1, ND2, ND3, ND4, ND4L, ND5, ND6, ATP6, and ATP8). This trend could stem from differences

96 in the function of those genes and in the level of purifying selection acting upon them. Briefly, mitochondrial-encoded proteins in Complex I (ND1 to ND6) function as electron transporters and play structural roles; those in Complex III (CytB) have catalytic activity; those in Complex IV (COI, COII, and COIII) catalyze electron transfer; and within Complex V (ATP6 and ATP8) one functions as a component in a proton channel and another as an assembly regulator (da Fonseca et al. 2008). In mammals, dN/dS ratios of these genes suggest the greatest purifying selection on sequences of COI, COII, COIII, and CytB (Castellana et al. 2011). Similarly, in beetles the lowest rates of substitutions at 1st and 2nd codon positions (where substitutions are mainly non-synonymous) in mitochondrial protein-coding genes were observed in COI, CytB, ND1, COIII, and COII (Pons et al. 2010). Therefore, the observation of greater difference in dN/dS ratios between flightless and flying lineages for the genes COI, COII, COIII, and CytB may be due to greater purifying selection on those genes in combination with relaxed purifying selection in flightless taxa. In the other genes, trends could be masked by lower differences in non-synonymous substitution rates between flightless and flying insects, or due to saturation at 1st and 2nd as well as 3rd codon positions.

Positive selection in the major lineages of hexapods

Previously, mitochondrial OXPHOS genes were examined for positive selection throughout insect lineages, and there were fewer signatures of selection in apterygote lineages (Yang et al. 2014b). However, only the two (of 2) apterygote insect orders and 20 (of 27) pterygote orders were previously tested. Here, we included all 5 apterygote hexapod orders and all pterygote orders as well as examined nuclear OXPHOS genes. We similarly observed a lack of positive selection in apterygote lineages and no disproportionate detection of OXPHOS genes specifically associated with the origin of Pterygota; however, there appears to be a concentration of positive selection in both mitochondrial (mainly) and nuclear OXPHOS genes in the holometabolous insects. While the number of nodes included here for holometabolous insects (clade ‘H” in Figure 4.3) was similar to that for the polyneopteran clade (‘L’), the actual species diversity of homometabolous insects is large and represents 83% of all insect species (Foottit and Adler 2009). Furthermore, Hymenoptera had a high incidence of positive selection detected in mitochondrial genes, which is similar to reports of accelerated evolution in mitochondrial but not

97 nuclear genes in this group (Kaltenpoth et al. 2012). The actual species diversity of Hymenoptera may additionally be much higher than recorded (Veijalainen et al. 2012). Species diversity and molecular evolutionary rates have been observed to correspond (e.g. Eo and DeWoody 2010). Given the radiation of species in holometabolous insects, the detection of selection may in part be linked to the speciation rate of the group. However, this potential mechanism does not fully explain the findings as several highly species-rich groups (such as Lepidoptera) did not exhibit significant positive selection.

It was previously proposed that the asynchronous vs. synchronous flight mechanism may explain trends in the detection of adaptive evolution in flying insects (Yang et al. 2014b). Asynchronous flight, the ability for multiple wing beats per nerve impulse, is present for all of Hymenoptera, Coleoptera, Strepsiptera, Diptera, and Thysanoptera (Resh and Carde 2009). The pattern of positive selection here does not mirror the occurrence of asynchronous or synchronous flight; no pattern may be expected as the cost of flight is determined by aerodynamics and resulting wingbeat frequencies; although synchronous flight may cost more metabolically per stroke, asynchronous fliers often achieve higher stroke frequencies (Evans and Wigglesworth 1988, Conley and Lindstedt 2002). Some positive selection was associated with the origin of Pterygota, but not more-so than in downstream lineages. It is possible that the origin of flight set the stage for downstream selection pressures within some lineages related to metabolic efficiency. It is also possible that we were not able to accurately detect signatures of selection on longer timeframes, due to saturation and/or subsequent periods of positive and purifying selection.

Note on the exploratory analysis of gene categories

It is likely that certain genes or categories of genes are more likely to be detected under positive selection in all lineages, regardless of flight condition. A gene related to zinc-ion binding was detected in most cases of flight loss, as with the origin of Pterygota; however, that category was no longer present when only unique candidate genes were considered with Pterygota. In addition, this gene category was detected previously under positive selection in ants (Roux et al. 2014) and under rapid evolution in Drosophila (Hahn et al. 2007). The situation was similar for the biological process category “regulation of transcription from RNA polymerase II

98 promoter (GO:0006357)”, which was detected in association with flight loss but was not among the unique genes detected in the lineage leading to Pterygota. Furthermore, the genes detected under positive selection for the single case of the evolution of flight should be examined with caution due to lacking phylogenetically independent replication; additionally, the evolution of other apomorphies occurred on this lineage, confounding the investigation of a particular characteristic of interest. The agreement of results with biological expectations does not verify the validity of any analysis (Pavlidis 2012). It is also possible that different gene categories would be detected under positive selection with varying species choice and associated change in background genes available. Finally, the analysis of positive selection in multiple genes can be vulnerable to false positives, the rate of which varies with many factors including quality of sequence alignment (Mallick et al. 2009) and choice of alignment software (Markova-Raina and Petrov 2011). The rate of false positives in branch-site tests also increases slightly with some synonymous site saturation; however, the rate of false negatives (decreased power) increases with high synonymous site saturation (Gharib and Robinson-Rechavi 2013).

Next steps

Next steps include 1) testing the robustness of the genes under positive selection in the Pterygota lineage with varying species choice, 2) obtaining a baseline point of comparison for positive selection associated with flight loss by similarly testing flying lineages of similar taxonomic rank, and 3) performing relaxed selection analysis for the nuclear genes, as was performed for the mitochondrial protein-coding genes. Since it is difficult to directly link any results to flight itself given co-occurring variables and lack of replication, the results from relaxed selection analysis of flight loss could theoretically provide some opportunity to assess, in reverse, the potential positive selection associated with flight evolution.

While multiple cases of flight loss were considered via positive selection analysis, future insect phylogenomic work increasing taxonomic sampling would allow improvement of the number of cases of flight loss available and with increased accuracy of phylogenetic mapping. This would be especially useful for more recent cases of flight loss that were not available for inclusion here. For example, the order Phasmatodea may contain at least 13 losses of flight

99

(Whiting et al. 2003, Stone and French 2003), most of these appearing to be located at fairly high taxonomic levels (e.g. families), while Diptera contains at least 24 losses within various families (Wiegmann et al. 2011). With more instances of flight loss, the effects of co-occurring confounding factors, such as parasitism, could be separated. As well, the difference in molecular evolutionary trends between various flight loss types (e.g. female-flight loss vs. full flight loss) would be interesting to examine. Finally, the number of orthologous genes available between more closely related species, if smaller taxonomic breadth were examined in each single analysis, would be higher.

Methodological comparisons

The percentage of genes detected to be under positive selection associated with Pterygota (13%) is higher than other genome-wide scan studies, while the percentage for flight loss (median 2.4%) is similar to other studies. For example, 3% of positively selected genes were detected in the lineage represented by dolphins as compared to 2% in cows (Sun et al. 2012); 1% were detected in the lineage leading to bats (flying) (Shen et al. 2010); and 1% in humans and 6% in chimpanzees using the same branch-site test of positive selection as used here (Arbiza et al. 2006). These studies included a higher number of genes than were examined here: over 12,000 in Sun et al. (2012), over 7,100 genes in Shen et al. (2010), and over 13,100 in Arbiza et al. (2006). The use of transcriptomes and single-copy orthologous genes represent a small portion of the genome; 1476 total genes analyzed here are single-copy orthologs across available arthropod transcriptomes as compared to ~16,000 total genes present in Drosophila (Hahn et al. 2007). Thus, the single-copy orthologs do not present a complete picture of gene evolution; however, the transcriptome data provided phylogenetic breadth of lineages, which is important when distantly-related taxa are considered together.

This study involved detecting positive and relaxed selection along longer timespans as compared with previous work; sister insect orders are separated by about 150 to 350 million years since their common ancestors (from Misof et al. 2014). It is likely that much amino acid change is represented along any single branch analyzed, as compared with more closely-related taxa typically examined in genome-wide scan studies; for example, cows and dolphins are separated by approximately 60 million years (Meredith et al. 2011). Thus, it is likely that true

100 positive or relaxed selection was more difficult to detect the lineages analyzed here, especially in the deeper ‘P’ lineage, due to multiple periods of positive and purifying selection leading to noise. However, the trends observed for dN/dS ratios in flightless lineages as compared to flying lineages on the long timeframes investigated here mirrored trends observed for cases of flight loss within insect orders (Mitterboeck and Adamowicz 2013), suggesting that at least some of the same signal for relaxed selection is present on shorter as well as on longer timeframes. The orthologous genes included here represent those more essential for life as they are present and transcribed across a range of arthropod species, and many serve basic cellular functions (Misof et al. 2014). As such they may be more highly expressed, and thus slower evolving due to tighter selective constraints (Drummond et al. 2005). If so, the use of this subset of genes could have improved the detection of signals of selection as compared to genes that are more rapidly evolving as a result of greater effects of neutral processes.

Caveats

The accuracy of the data appears comparable to other genome studies as shown by the accuracy in identification of transcripts to official gene sets of reference arthropod taxa (Misof et al. 2014). However, the genes analyzed here represent a portion of all protein-coding genes in the insect genomes and thus restrict the total pool of possible gene categories that could be detected under differing selection pressures. Protein-coding genes themselves are only part of the functioning genome; there may be important changes in regulatory regions, which govern expression levels and the specific tissues in which expression occurs, associated with flight and loss. Thus, future comparative genomics analysis could further investigate both protein-coding and non-coding loci. Additionally, the use of transcriptomes for sequence data did not allow an assessment of presence or absence of genes to assess gene gains or losses, and thus the evolution of gene families could not be examined here. Further work on this point would likely prove interesting, given that other studies have provided evidence for trends in adaptation based on gene presence and absence or gene family evolution such as diversification among paralogous genes (Hahn et al. 2007, De Grassi et al. 2008).

101

Conclusions

This study presents an exploratory examination of the genes under positive and relaxed selection associated with the evolution and loss of flight in insects. Together, similarities in selective pressures on gene types across flying and flightless animal groups suggest convergent trends in molecular evolution paralleling convergent functional evolution. The results here contribute insight into the evolution of this successful clade of life.

ACKNOWLEDGEMENTS

We thank Lili Zhou for her contribution during the early stages of this project. We thank Stephen Marshall and Daniel Ashlock for input on the ideas in earlier versions of this manuscript. This work was supported by the University of Guelph (Integrative Biology PhD Award and Dean’s Tri-council Scholarship to T.F.M), the Government of Ontario (Ontario Graduate Fellowship to T.F.M.), and the Natural Sciences and Engineering Research Council of Canada (Alexander Graham Bell Canada Graduate Scholarship to T.F.M., Discovery Grants 386591-2010 to S.J.A. and 400479 to J.F.).

LIST OF SUPPLEMENTARY MATERIAL

SM Ch4_S1. Contains input data information such as species lists and trees, full results tables from DAVID and PANTHER analyses

SM Ch4_S2. (MS Excel). Contains input and output information such as gene lists, p values for selection tests, and newick trees

102

Figures and Tables for Chapter 4

103

104

Figure 4.1. Tree topology and species used in analysis of nuclear genes. Species names followed by a star indicate those species used in positive selection analysis associated with the origin of Pterygota (branch ‘P’). Starbursts indicate each of the 11 branches that were used in positive selection analysis of flight loss.

Figure 4.2. dN/dS ratios in flightless vs. related flying lineages for 13 mitochondrial protein-coding genes. In 11 of 13 genes, the flightless dN/dS ratio is higher than the flying dN/dS ratio. Genes with significant difference in rates (after Benjamini-Hochberg correction) are given with ‘*’; in all 5 cases, the flightless dN/dS ratio is higher than the flying dN/dS ratio. Dashed lines signify the mean dN/dS values; flightless: 0.031, and flying: 0.021. The tree with lineages tested is given in SM Ch4_S1 Figure Ch4_S1_1.

105

Figure 4.3. Positive selection in hexapod lineages in nuclear and mitochondrial genes of interest. The results shown are for trees involving one species representative per insect order. The lineage marked with ‘P’ represented the lineage leading to the clade Pterygota; ‘L’ = polyneoptera, ‘H’ = holometabola (complete metamorphosis) insects.

106

Table 4.1. Gene Ontology (GO) categories from DAVID analysis of the positively selected genes in the lineage (‘P’) leading to Pterygota; A) GO terms for all positively selected genes, and B) GO terms for those positively selected genes uniquely detected in ‘P’ lineage and not in two control lineages tested. Here, only categories with p<0.05 are shown; full results and list of genes in the categories are given in SM Ch4_S1 Table Ch4_S1_3. 954 background genes were mapped to 914 IDs; 126 candidate genes were mapped to 119 IDs; 39 unique candidate genes were mapped to 37 IDs.

A) 126 candidate genes 914 total genes 119 positively selected genes GO term # in category Expected #a Observed # P value Fold enrichmentb compositionally biased region:Ser-rich 15 2.0 8 0.00027 4.10 helicase activity (GO:0004386) 16 2.1 8 0.0015 3.84 Tetratricopeptide-like helical (IPR011990) 19 2.5 8 0.0058 3.23 transition metal ion binding (GO:0046914) 90 11.7 20 0.0075 1.71 metal ion binding (GO:0046872) 113 14.7 23 0.011 1.56 cation binding (GO:0043169) 114 14.8 23 0.012 1.55 ion binding (GO:0043167) 114 14.8 23 0.012 1.55 zinc ion binding (GO:0008270) 76 9.9 17 0.015 1.72 Zinc finger, C2H2-like (IPR015880) 9 1.2 5 0.018 4.27 ZnF_C2H2 (SM00355) 9 1.2 5 0.023 4.27 repeat:TPR 2 3 0.4 3 0.030 7.68 repeat:TPR 1 3 0.4 3 0.030 7.68 purine NTP-dependent helicase activity (GO:0070035) 11 1.4 5 0.036 3.49 ATP-dependent helicase activity (GO:0008026) 11 1.4 5 0.036 3.49 Tetratricopeptide region (IPR013026) 16 2.1 6 0.041 2.88 DNA helicase activity (GO:0003678) 7 0.9 4 0.044 4.39 zinc 62 8.1 14 0.044 1.73 B) 39 unique candidate genes 914 total genes 37 positively selected genes GO term # in category Expected #c Observed # P value Fold enrichmentb peptidase activity, acting on L-amino acid peptides 25 1.0 5 0.010 4.94 (GO:0070011) peptidase activity (GO:0008233) 27 1.1 5 0.013 4.57 Protease 17 0.7 4 0.026 5.81 hydrolase 96 3.9 9 0.027 2.32 Nucleotide-binding, alpha-beta plait (IPR012677) 20 0.8 4 0.039 4.94 nuclear mRNA splicing, via spliceosome (GO:0000398) 25 1.0 4 0.049 3.95 RNA splicing, via transesterification reactions with 25 1.0 4 0.049 3.95 bulged adenosine as nucleophile (GO:0000377) RNA splicing, via transesterification reactions 25 1.0 4 0.049 3.95 (GO:0000375) aExpected number of genes in the GO term was calculated by (‘# in category’/914)*119 bFold enrichment was calculated by Expected#/Observed# of genes in that GO term cExpected number of genes in the GO term was calculated by (‘# in category’/914)*37

107

Table 4.2. PANTHER Biological Process categories from analysis of positively selected genes in the lineage (‘P’) leading to Pterygota; A) all positively selected genes, and B) positively selected genes uniquely detected in ‘P’ lineage and not in the two control lineages tested. Here, only categories with p<0.05 are shown; 954 background genes were mapped to 930 IDs; 126 candidate genes were mapped to 124 IDs; 39 unique candidate genes were mapped to 39 IDs.

A) 126 candidate genes 930 total genes 124 positively selected genes PANTHER GO-Slim Biological Process term # in category Expected # Observed # P value Fold enrichment regulation of transcription from RNA polymerase II 8 3.7 8 0.034 2.14 promoter (GO:0006357) RNA catabolic process (GO:0006401) 6 0.8 3 0.047 3.75 B) 39 unique candidate genes 930 total genes 39 positively selected genes PANTHER GO-Slim Biological Process term # in category Expected # Observed # P value Fold enrichment RNA catabolic process (GO:0006401) 6 0.25 2 0.026 7.95

108

Table 4.3. Gene Ontology or Pathway categories for genes detected to be under positive selection in 3 or more lineages with flight loss of 11 tested. A) Gene Ontology (GO) categories from DAVID analysis. B) PANTHER Biological Process categories; child (sub-categorical) processes are indented below parent processes. Here, only categories with p<0.05 are shown; full results for DAVID analysis including lists of genes in each category are given in SM Ch4_S1 Table Ch4_S1_4. 1284 total background genes were mapped to 1232 IDs (A) or 1254 IDs (B); 58 candidate genes were mapped to 55 IDs (A) or 56 IDs (B). A) DAVID Gene Ontology analysis 1232 total 55 positively selected genes genes GO Term # in category Expected #a Observed # P value Fold enrichmentb mRNA binding (GO:0003729) 37 1.7 6 0.014 3.63 compositionally biased region:Ser-rich 21 0.9 4 0.015 4.27 Nucleotide-binding, alpha-beta plait (IPR012677) 27 1.2 5 0.026 4.15 B) PANTHER Biological Process analysis 1254 total 56 positively selected genes genes PANTHER GO-Slim Biological Process term # in category Expected # Observed # P value Fold enrichment biological regulation (GO:0065007) 168 7.5 14 0.014 1.87 regulation of biological process (GO:0050789) 117 5.2 11 0.013 2.11 regulation of transcription from RNA 41 1.8 5 0.036 2.73 polymerase II promoter (GO:0006357) primary metabolic process (GO:0044238) 550 24.6 33 0.017 1.34 nucleobase-containing compound metabolic process 262 11.7 19 0.016 1.62 (GO:0006139) DNA repair (GO:0006281) 17 0.8 3 0.041 3.95 nitrogen compound metabolic process (GO:0006807) 106 4.7 10 0.018 2.11 RNA metabolic process (GO:0016070) 170 7.6 13 0.034 1.71 rRNA metabolic process (GO:0016072) 28 1.3 4 0.037 3.2 organelle organization (GO:0006996) 56 2.5 6 0.038 2.4 cellular component organization or biogenesis 122 5.5 10 0.042 1.84 (GO:0071840) cellular component biogenesis (GO:0044085) 49 2.2 6 0.022 2.74 aExpected number of genes in the GO term was calculated by (‘# in category’/1232)*55. This calculation is only necessary for DAVID analysis, as the expected number is provided in PANTHER. bFold enrichment was calculated by Expected#/Observed# of genes in that GO term

109

Chapter 5

Thesis integration and conclusions

110

Summary of key findings

These studies have revealed that major transitions in evolution have shaped molecular evolutionary patterns across a wide range of life.

In Chapter 2 I observed overall relatively even rates of molecular evolution between freshwater and saline taxa. However, freshwater taxa had significant trends of higher molecular rates in protein-coding genes, as compared with marine or saline lake taxa. This was contrary to previous literature findings, using small sample size, of inland saline and marine invertebrate taxa. Possible explanations for higher molecular rates in freshwaters include smaller effective population size, higher mutation rate, and/or some effect linked to novel ecological niche.

In Chapter 3 I observed relatively even molecular rates between terrestrial and freshwater taxa in insects; however, a significant trend of higher dN/dS ratios was observed in terrestrial taxa, especially when compared with the fully aquatic taxa. The results may suggest reduced effective population size in terrestrial lineages, which could be due to greater specialization, contrary to expectations due to patchier landscape provided by freshwater habitats. No strong molecular convergence was detected on a wider phylogenetic scale in the COI gene tied to freshwater or terrestrial habitat transitions, contrary to a previous observation in a beetle clade.

In Chapter 4 I explored what categories of genes may be under differing selection pressures with the evolution and loss of flight in insects. Genes commonly under positive selection associated with flight evolution include catabolic processes, which did not overlap with those detected for flight loss, which included metabolic processes. Relaxed selection acting upon oxidative phosphorylation genes in flightless lineages, as compared with flying insect lineages, was similar to that observed in my previous work within insect orders (Mitterboeck and Adamowicz 2013) and reflect other observations of energy-related genes linked to flight gain and loss in bats and birds. Overall, the trends observed for flight gain and loss to some degree reflect those in other animal groups, with differences observed.

111

Synthesis of findings

Together, the conclusions of these thesis studies support a degree of predictability in molecular evolution that can be informed by observable organismal characteristics. Furthermore, from the analysis of bi-directional transitions, the results together suggest that overall there are some trait-specific trends in molecular evolution.

The effect of transition direction was explored through one method I introduced in the introduction: by examining the difference in trends for both forward and reverse transition directions. While sample size of each direction of transition was influenced by asymmetry in trait evolution––some shifts occur more frequently than the reverse––inferences can still be drawn from the results.

In Chapter 2, the direction of the habitat transition and molecular rates corresponded, i.e. freshwater was the new environment as well as the environment with higher rates in certain genes. Freshwater environment was also proposed to be linked to reduced effective population size, which would be a confounding factor if novel environment also led to reduced effective population size itself. However, when reverse directional transitions were examined (freshwater to inland saline or marine) these results were not the opposite of what was otherwise observed, but were weaker. Thus, there may be both an effect of trait state as well as transition direction on the molecular rates here.

In Chapter 3, the terrestrial habitat had the more frequently higher dN/dS ratios than the freshwater habitat; however, this trend was not stronger when examining only freshwater to terrestrial transitions. Thus, in it appears that transition direction did not influence the relative dN/dS ratios, suggesting some trait-specific trends.

The results of Chapter 4 on insects, in combination with considering prior literature, provide an opportunity to assess trends in molecular evolution associated with flight across insects, birds, and bats. Unsurprisingly, a consistent pattern emerges with flight: energy-related genes are implicated across all these taxonomic groups. This supports convergence in genetic trends mirroring the convergence in function (powered flight). From these studies, trait-specific patterns of molecular evolution are observed, in opposite ways for flight gain and loss. However, in the Chapter 4 study of insects, there were novel findings in relation to the literature for both

112 flight gain and loss in their categories of genes under differing selective pressures. Differences in the gene categories associated with flight in the various taxa (birds, bats, insects) would be expected, as evolutionary trajectories, contingency, and context of those transitions differed. As well, the timeframes involved in the evolution of these traits in the various taxa likely impact our ability to hypothesize and assess molecular changes in ancestral lineages.

Contributions to the field of molecular evolution and to addressing knowledge gaps

These studies fill knowledge gaps in understanding how organismal traits and habitats are related to trends in molecular evolution, by examining three major transitions that have shaped life but that have not been explored in a phylogenetically broad context before. Chapters 2 and 3 contribute to the growing body of literature of molecular rate correlates summarised in the introduction Table 1.1, while Chapter 4 fills in some gaps left in the genomic literature (Table 1.2) regarding flight. The taxa investigated, including diverse eukaryote lineages and insects as a whole, are broadly distributed phylogenetically and as groups represent a large proportion of the multicellular diversity of life on earth. Additionally, a large number of phylogenetically independent contrasts was used in each study. Thus, the results are likely relatively robust and applicable to large blocks of taxa; secondly, due to sample size I may have been able to detect more subtle trends, which may have been overlooked otherwise. Secondly, my research on these transitions starts to fill in gaps in the literature regarding the evaluation of effect of transition direction vs. trait state on molecular evolutionary rates. Finally, the discovery of trends is a crucial first step in the scientific process, and these studies have revealed some patterns that would provide the basis for further investigation. Basic applications of these studies include informing methodology in evolutionary works, such as in informing molecular clock analysis and in species delineation through molecular markers.

Implications

The findings are applicable to knowledge broadly. Different types of molecular changes may be correlated, as is demonstrated by relationships between molecular evolutionary rates and genome size (Bromham et al. 2015) or large-scale genetic rearrangements (Shao et al. 2003). 113

This could be due to common causes, such as relaxed selection in the genome, or effects of one on the other. Thus, understanding one source of genetic variation can help us to understand and predict other sources of genetic variation.

Furthermore, microevolutionary trends relate to macroevolutionary trends. An increased pace of microevolutionary processes is expected to translate or be associated with, to some extent, increased pace of macroevolution. Studies have provided evidence toward a link between molecular rates and diversification in various taxa, and particularly frequently in plants (Eo and deWoody 2010, Duchene and Bromham 2013). Possible explanations include speciation leading to reduced population size and increased molecular evolutionary rates, or conversely, reduced population size allowing greater adaptive evolution (discussed below) (Venditti and Pagel 2010). Associations can also be due to biological characteristics that have an impact on both evolutionary levels: for example, small effective population size is expected to reduce within- species genetic diversity, increase between-lineage rates of molecular evolution (Fujisawa et al. 2015), but also increase extinction risk of lineages over evolutionary time (Davies et al. 2000).

Patterns in genetic changes associated with certain environments or biological traits could have implications for interpretation of macroevolutionary patterns. The rate of morphological evolution may correspond to some extent to microevolutionary pace (Omland et al. 1997, but see Bromham et al. 2002). Additionally, different types of macroevolutionary change outside of diversification could correspond with molecular evolutionary rates. Relaxed selective constraints, for example due to reduced effective population size, may be associated with increased phenotypic plasticity, which could lead to greater evolutionary novelty (Hunt et al. 2011). Similarly, reduced population size may also influence genome complexity through nonadaptive processes, providing new substrates for natural selection (Lynch and Conery 2003). Thus, on top of ecological opportunity upon entering new niche spaces, there could be a genetic basis for increased evolutionary opportunity associated with trait shifts.

While the debate between Gould (1989) and Conway Morris (1998) on contingency vs. inevitability addressed the functional or phenotypic level of organisms, a similar opposition of contingency vs. convergence applies to molecular evolution. There is no doubt that both chance and repeatability govern macro- and microevolutionary worlds, as numerous observations have supported (Storz et al. 2016). Predictability is, to some extent, necessitated by shared ancestry of

114 all life on earth. Discovering associations between molecular evolution and observable organismal traits was a primary focus of this thesis.

Limitations

One limitation of this work is that I have relied upon available genetic data. This in itself has allowed the inclusion of a large taxonomic breadth of lineages as well as in incorporating expertise from various researchers working on those taxa; however, the gene availabilities necessarily limits the type of questions that can be investigated and the use of genetic and phylogenetic data from many authors could have introduced noise into the results (Chapters 2 and 3). While transcriptome data (Chapter 4) is useful for obtaining taxonomic breadth at this stage of sequencing technology, the data do not represent the entire genome information. Furthermore, although the transcriptome data were collected together in a consistent manner, there is the possibility of the data containing remaining contamination or other errors, such as sequences actually belonging to non-target species, which may slightly impact results.

A second limitation is the type and amount of tests conducted. I generally performed gene-wide evaluations of molecular rates (Chapters 2 and 3) since the research questions involved mainly genome-wide expectations on molecular evolution, with a focus on relaxed selection. However, it is possible for strong positive selection to give a similar signal to relaxed selection when evaluating rates gene-wide, and thus I could not pinpoint instances of positive selection in these results. Specific tests of positive or relaxed selection could be performed; however, the greatest benefit of these particular tests would be obtained with a larger availability of loci. For example, the investigation of positive selection associated with marine to freshwater habitat shifts using genomes would provide information on adaptation associated with those shifts, as was performed for terrestrial to marine shifts in mammals (Foote et al. 2015). In Chapter 4 there is some potential difficulty in accurately detecting positive and relaxed selection due to the ancient divergence of the lineages investigated and potential false positives or negatives.

Lastly, the causes of the trends in molecular evolutionary rates observed in Chapters 2 and 3 can only be hypothesized since I did not directly measure parameters, such as effective

115 population size or metabolic rate, in each environment. Furthermore, I cannot explain all of the trends observed across gene types. Thus, this thesis focused upon detecting patterns, and potential mechanisms were discussed in each chapter, but testing these mechanisms would require further research, for example, through measurement of key parameters, specific tests of selection, and inclusion of greater number of genes (further discussed below).

Future work and recommendations

Future study could involve targeted data collection on select taxa to circumvent limitations involved with data availability. As well, measurement of key habitat and biological parameters in select taxa, such as effective population size and specialization, would allow more precise identification of the causes of the trends I observed. Furthermore, future studies could gain biological information associated with the shifts (Chapters 2 and 3) by examining genes and gene types under positive and relaxed selection, as was performed in Chapter 4. Particular convergent sites located through these tests could furthermore be evaluated through laboratory tests of functional importance.

Next, the approach I employed to investigate the potential influence of transition directionality on molecular evolutionary trends was not the only approach to examine such a question. Where time-stamps are available for lineage divergences, I recommend further work on this question with the use of molecular clocks for comparing earlier and later rates within and between clades. As well, further work could consider whether more-distantly-ancestral habitat might affect interpretation of trends in molecular evolution, for example, in the case of possible freshwater ancestral habitat of marine fish lineages (Vega and Wiens 2012) that have undergone marine to freshwater transitions.

Lastly, in the introduction I began to synthesize the relative strength of various biological or ecological traits or underlying mechanisms as they relate to molecular evolution. However, more work quantifying the relative effects are needed. Few studies have looked at multiple factors at the same time (Davies et al. 2004, Welch et al. 2008, Bromham et al. 2015, Fujisawa et al. 2015). Multivariate analysis on a large phylogeny including variation in multiple biological,

116 ecological, and genetic characteristics would allow the relative quantification of the effects in explaining the total rate variability present.

Final conclusions

This thesis work was novel and significant in these regards: 1) it included work on topics previously not explored but of broad interest to evolutionary biologists, such as habitat transitions applicable to many life groups and the enigmatic evolution of flight in insects, and contributed to basic knowledge of what biological or ecological traits are associated with trends in molecular evolution; 2) it focused on taxonomic groups understudied but representing large blocks of life, including insects, while most molecular work has been performed on mammals and plants; 3) it included the discovery of new trends, with some observations contrary to previous expectations; and 4) it employed some differing approaches to previous literature through the examination of reverse direction transitions and in exploratory contexts. Each of these thesis studies presents its own story and conclusions, and together they support a degree of predictability in molecular evolution tied to the biology and ecology of organisms.

117

Literature Cited

Adamowicz, S. J., and A. Purvis. 2006. From more to fewer? Testing an allegedly pervasive trend in the evolution of morphological structure. Evolution 60:1402–16.

Adamowicz, S. J., and V. Sacherová. 2006. Testing the directionality of evolution: the case of chydorid crustaceans. J. Evol. Biol. 19:1517–30.

Adamowicz, S. J., A. Petrusek, J. K. Colbourne, P. D. N. Hebert, and J. D. S. Witt. 2009. The scale of divergence: a phylogenetic appraisal of intercontinental allopatric speciation in a passively dispersed freshwater zooplankton genus. Mol. Phylogenet. Evol. 50:423–36.

Adamowicz, S. J., S. Menu-Marque, S. A. Halse, J. C. Topan, T. S. Zemlak, P. D. N. Hebert, and J. D. S. Witt. 2010. The evolutionary diversification of the Centropagidae (Crustacea, ): A history of habitat shifts. Mol. Phylogenet. Evol. 55:418–30.

Ai, W.-M., S.-B. Chen, X. Chen, X.-J. Shen, and Y.-Y. Shen. 2014. Parallel evolution of IDH2 gene in cetaceans, primates and bats. FEBS Lett. 588:450–4.

Alverson, A. J., R. K. Jansen, and E. C. Theriot. 2007. Bridging the rubicon: phylogenetic analysis reveals repeated colonizations of marine and fresh waters by thalassiosiroid diatoms. Mol. Phylogenet. Evol. 45:193–210.

Arbiza, L., J. Dopazo, and H. Dopazo. 2006. Positive selection, relaxation, and acceleration in the evolution of the human and chimp genome. PLoS Comput. Biol. 2:e38.

Arts, M. T., R. D. Robarts, M. J. Waiser, V. P. Tumber, A. J. Plante, and H. J. de Lange. 2000. The attenuation of ultraviolet radiation in high dissolved organic carbon waters of wetlands and lakes on the northern Great Plains. Limnol. Ocean. 45:292–9.

Ashkenazy, H., O. Penn, A. Doron-Faigenboim, O. Cohen, G. Cannarozzi, O. Zomer, and T. Pupko. FastML: a web server for probabilistic reconstruction of ancestral sequences. Nucleic Acids Res. 40:W580–4.

Aspöck, U., E. Haring, and H. Aspöck. 2012. The phylogeny of the Neuropterida: long lasting and current controversies and challenges (Insecta: Endopterygota) The monophyly of Neuropterida. Arthropod Syst. Phylogeny 70:119–29.

Audzijonyté, A., J. Damgaard, S.-L. Varvio, J. K. Vainio, and R. Väinölä. 2005. Phylogeny of Mysis (Crustacea, Mysida): history of continental invasions inferred from molecular and morphological data. Cladistics 21:575–96.

Averof, M., and S. M. Cohen. 1997. Evolutionary origin of insect wings from ancestral gills. Nature 385:627–30.

118

Ballard, J. W. O., and M. Kreitman. 1995. Is mitochondrial DNA a strictly neutral marker? Trends Ecol. Evol. 10:485–8.

Barraclough, T. G., S. Nee, and P. H. Harvey. 1998. Sister-group analysis in identifying correlates of diversification. Evol. Ecol. 12:751–4.

Barraclough, T. G., and V. Savolainen. 2001. Evolutionary rates and species diversity in flowering plants. Evolution 55:677–83.

Bass, D., N. Brown, J. Mackenzie-Dodds, P. Dyal, S. A. Nierzwicki-Bauer, A. A. Vepritskiy, and T. A. Richards. 2009. A molecular perspective on ecological differentiation and biogeography of cyclotrichiid ciliates. J. Eukaryot. Microbiol. 56:559–67.

Benjamini, Y., and Y. Hochberg. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. R. Stat. Soc. Ser. B 57:289–300.

Bennett, A. M. R. 2008. Global diversity of hymenopterans (Hymenoptera; Insecta) in freshwater. Hydrobiologia 595:529–34.

Bernhard, D., C. Schmidt, A. Korte, G. Fritzsch, and R. G. Beutel. 2006. From terrestrial to aquatic habitats and back again? Molecular insights into the evolution and phylogeny of Hydrophiloidea (Coleoptera) using multigene analyses. Zool. Scr. 35:597–606.

Besnard, G., A. M. Muasya, F. Russier, E. H. Roalson, N. Salamin, and P.-A. Christin. 2009. Phylogenomics of C4 photosynthesis in sedges (Cyperaceae): multiple appearances and genetic convergence. Mol. Biol. Evol. 26:1909–19.

Betancur-R, R. 2010. Molecular phylogenetics supports multiple evolutionary transitions from marine to freshwater habitats in ariid catfishes. Mol. Phylogenet. Evol. 55:249–58.

Betancur-R, R., G. Ortí, A. M. Stein, A. P. Marceniuk, and R. A. Pyron. 2012. Apparent signal of competition limiting diversification after ecological transitions from marine to freshwater habitats. Ecol. Lett. 15:822–30.

Bieler, A. R., P. M. Mikkelsen, T. M. Collins, E. A. Glover, V. L. González, D. L. Graf, E. M. Harper, J. Healy, G. Y. Kawauchi, P. P. Sharma, S. Staubach, E. E. Strong, J. D. Taylor, I. Tëmkin, J. D. Zardus, S. Clark, A. Guzmán, E. McIntyre, P. Sharp, and G. Giribet. 2014. Investigating the bivalve tree of life – an exemplar-based approach combining molecular and novel morphological characters. Invertebr. Syst. 28:32–115.

Bloom, D. D., and N. R. Lovejoy. 2012. Molecular phylogenetics reveals a pattern of biome conservatism in New World anchovies (family Engraulidae). J. Evol. Biol. 25:701–15.

Bloom, D. D., J. T. Weir, K. R. Piller, and N. R. Lovejoy. 2013. Do freshwater fishes diversify faster than marine fishes? A test using state-dependent diversification analyses and molecular phylogenetics of new world silversides (Atherinopsidae). Evolution 67:2040–57.

119

Botello, A., and F. Alvarez. 2013. Phylogenetic relationships among the freshwater genera of palaemonid shrimps (Crustacea: Decapoda) from Mexico: evidence of multiple invasions? Lat. Am. J. Aquat. Res. 41:773–80.

Bowman, H. H. M. 1956. Salinity data on marine and inland waters and plant distribution. Ohio J. Sci. 56:101–6.

Brammer, C. A., and C. D. von Dohlen. 2007. Evolutionary history of Stratiomyidae (Insecta: Diptera): the molecular phylogeny of a diverse family of flies. Mol. Phylogenet. Evol. 43:660– 73.

Bråte, J., D. Klaveness, T. Rygh, K. S. Jakobsen, and K. Shalchian-Tabrizi. 2010. Telonemia- specific environmental 18S rDNA PCR reveals unknown diversity and multiple marine- freshwater colonizations. BMC Microbiol. 10:168.

Brisson, J. A, G. K. Davis, and D. L. Stern. 2007. Common genome-wide patterns of transcript accumulation underlying the wing polyphenism and polymorphism in the pea aphid (Acyrthosiphon pisum). Evol. Dev. 9:338–46.

Bromham, L. 2016. An introduction to molecular evolution and phylogenetics. Second edi. Oxford University Press, Oxford, UK.

Bromham, L., A. Rambaut, and P. H. Harvey. 1996. Determinants of rate variation in mammalian DNA sequence evolution. J. Mol. Evol. 43:610–21.

Bromham, L., M. Woolfit, M. S. Y. Lee, and A. Rambaut. 2002. Testing the relationship between morphological and molecular rates of change along phylogenies. Evolution 56:1921– 30.

Bromham, L., and M. Cardillo. 2003. Testing the link between the latitudinal gradient in species richness and rates of molecular evolution. J. Evol. Biol. 16:200–7.

Bromham, L., and R. Leys. 2005. Sociality and the rate of molecular evolution. Mol. Biol. Evol. 22:1393–402.

Bromham, L., P. F. Cowman, and R. Lanfear. 2013. Parasitic plants have increased rates of molecular evolution across all three genomes. BMC Evol. Biol. 13:126.

Bromham, L., X. Hua, R. Lanfear, and P. F. Cowman. 2015. Exploring the relationships between mutation rates, life history, genome size, environment, and species richness in flowering plants. Am. Nat. 185:507–24.

Brook, W. J., F. J. Diaz-Benjumea, and S. M. Cohen. 1996. Organizing spatial pattern in limb development. Annu. Rev. Cell Dev. Biol. 12:161–80.

120

Carr, M., B. S. C. Leadbeater, R. Hassan, M. Nelson, and S. L. Baldauf. 2008. Molecular phylogeny of choanoflagellates, the sister group to Metazoa. Proc. Natl. Acad. Sci. U. S. A. 105:16641–6.

Carver, M., G. F. Gross, and T. E. Woodward. 1991. Hemiptera (Bugs, leafhoppers, cicadas, aphids, scale insects, etc.). in Division of Entomology, CSIRO, The Insects of Australia. Cornell University Press, Ithaca, USA.

Castellana, S., S. Vicario, and C. Saccone. 2011. Evolutionary patterns of the mitochondrial genome in Metazoa: exploring the role of mutation and selection in mitochondrial protein– coding genes. Genome Biol. Evol. 3:1067–79.

Castresana, J. 2000. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol. 17:540–52.

Cavalier-Smith, T., and S. von der Heyden. 2007. Molecular phylogeny, scale evolution and of heliozoa. Mol. Phylogenet. Evol. 44:1186–203.

Chapman, E. G., A. A. Przhiboro, J. D. Harwood, B. A. Foote, and W. R. Hoeh. 2012. Widespread and persistent invasions of terrestrial habitats coincident with larval feeding behavior transitions during snail-killing fly evolution (Diptera: Sciomyzidae). BMC Evol. Biol. 12:175.

Clark-Hachtel, C. M., D. M. Linz, and Y. Tomoyasu. 2013. Insights into insect wing origin provided by functional analysis of vestigial in the red flour beetle, Tribolium castaneum. Proc. Natl. Acad. Sci. U. S. A. 110:16951–6.

Colbourne, J. K., C. C. Wilson, and P. D. N. Hebert. 2006. The systematics of Australian Daphnia and Daphniopsis (Crustacea: Cladocera): a shared phylogenetic history transformed by habitat-specific rates of evolution. Biol. J. Linn. Soc. 89:469–88.

Conley, K. E., and S. L. Lindstedt. 2002. Energy-saving mechanisms in muscle: the minimization strategy. J. Exp. Biol. 205:2175–81.

Conway Morris, S. 1998. The crucible of creation: the Burgess shale and the rise of animals. Oxford University Press, Oxford, England.

Cronin, G., K. D. Wissing, and D. M. Lodge. 1998. Comparative feeding selectivity of herbivorous insects on water lilies: Aquatic vs. semi-terrestrial insects and submersed vs. floating leaves. Freshw. Biol. 39:243–57.

Curler, G. R., and J. K. Moulton. 2012. Phylogeny of psychodid subfamilies (Diptera: Psychodidae) inferred from nuclear DNA sequences with a review of morphological evidence. Syst. Entomol. 37:603–16.

121 da Fonseca, R. R., W. E. Johnson, S. J. O. Brien, M. J. Ramos, and A. Antunes. 2008. The adaptive evolution of the mammalian mitochondrial genome. BMC Genomics 22:119.

Davies, K. F., C. R. Margules, and J. F. Lawrence. 2000. Which traits of species predict population declines in experimental forest fragments? Ecology 81:1450–61.

Davies, T. J., V. Savolainen, M. W. Chase, J. Moat, and T. G. Barraclough. 2004. Environmental energy and evolutionary rates in flowering plants. Proc. R. Soc. B. 271:2195– 200.

Davis, A. M., P. J. Unmack, B. J. Pusey, J. B. Johnson, and R. G. Pearson. 2012. Marine- freshwater transitions are associated with the evolution of dietary diversification in terapontid grunters (Teleostei: Terapontidae). J. Evol. Biol. 25:1163–79.

De Grassi, A., C. Lanave, and C. Saccone. 2008. Genome duplication and gene-family evolution: The case of three OXPHOS gene families. Gene 421:1–6.

Delport, W., A. F. Y. Poon, S. D. W. Frost, and S. L. Kosakovsky Pond. 2010. Datamonkey 2010: A suite of phylogenetic analysis tools for evolutionary biology. Bioinformatics 26:2455– 7.

DeWoody, J. A., and J. C. Avise. 2000. Microsatellite variation in marine, freshwater and anadromous fishes compared with other animals. J. Fish Biol. 56:461–73.

Dijkstra, K.-D. B., M. T. Monaghan, and S. U. Pauls. 2014. Freshwater biodiversity and aquatic insect diversification. Annu. Rev. Entomol. 59:143–63.

Dollo, L. 1893. Les Lois de l’evolution. Bull. Soc. Geol. Pal. Hydro. 7:164–6.

Drummond, D. A., J. D. Bloom, C. Adami, C. O. Wilke, and F. H. Arnold. 2005. Why highly expressed proteins evolve slowly. Proc. Natl. Acad. Sci. U. S. A. 102:14338–43.

Duchene, D., and L. Bromham. 2013. Rates of molecular evolution and diversification in plants: chloroplast substitution rates correlated with species-richness in the Proteaceae. BMC Evol. Biol. 13:65.

Eo, S. H., and J. A. DeWoody. 2010. Evolutionary rates of mitochondrial genomes correspond to diversification rates and to contemporary species richness in birds and reptiles. Proc. R. Soc. B. 277:3587–92.

Erecinska, M., and D. F. Wilson. 1982. Regulation of cellular energy metabolism. J. Membr. Biol. 70:1–14.

Evans, D. H. 2008. Osmotic and ionic regulation: cells and animals. CRC Press, Boca Raton, FL.

122

Evans, P. D., and V. B. Wigglesworth (eds). 1988. Advances in insect physiology. Academic Press Inc., London, UK.

Favre, N., and W. Rudin. 1996. Salt-dependent performance variation of DNA polymerases in co- amplification PCR. Biotechniques 21:28–30.

Felsenstein, J. 1985. Phylogenies and the comparative method. Am. Nat. 125:1–15.

Figueroa, R. I., and K. Rengefors. 2006. Life cycle and sexuality of the freshwater raphidophyte Gonyostomum semen (Raphidophyceae). J. Phycol. 42:859–71.

Fiol, D. F., and D. Kültz. 2007. Osmotic stress sensing and signaling in fishes. FEBS J. 274:5790–8.

Floyd, R., E. Abebe, A. Papert, and M. Blaxter. 2002. Molecular barcodes for soil nematode identification. Mol. Ecol. 11:839–50.

Foltz, D. W. 2003. Invertebrate species with nonpelagic larvae have elevated levels of nonsynonymous substitutions and reduced nucleotide diversities. J. Mol. Evol. 57:607–12.

Foote, A. D., Y. Liu, G. W. C. Thomas, T. Vinař, J. Alföldi, J. Deng, S. Dugan, C. E. van Elk, M. E. Hunter, V. Joshi, Z. Khan, C. Kovar, S. L. Lee, K. Lindblad-Toh, A. Mancia, R. Nielsen, X. Qin, J. Qu, B. J. Raney, N. Vijay, J. B. W. Wolf, M. W. Hahn, D. M. Muzny, K. C. Worley, M. T. P. Gilbert, and R. A. Gibbs. 2015. Convergent evolution of the genomes of marine mammals. Nat. Genet. 47:272–5.

Foottit, R. J., and P. H. Adler. 2009. Insect Biodiversity: Science and Society. John Wiley & Sons, Chichester, UK.

Frey, D. G. 1993. The penetration of cladocerans into saline waters. Hydrobiologia 267:233– 48.

Fujisawa, T., A. P. Vogler, and T. G. Barraclough. 2015. Ecology has contrasting effects on genetic variation within species versus rates of molecular evolution across species in water beetles. Proc. R. Soc. B. 282: 20142476.

Galtier, N., R. W. Jobson, B. Nabholz, S. Glemin, and P. U. Blier. 2009. Mitochondrial whims: metabolic rate, longevity and the rate of molecular evolution. Biol. Lett. 5:413–6.

Garland, T., P. H. Harvey, and A. R. Ives. 1992. Procedure for the analysis of comparative data using phylogenetically independent contrasts. Syst. Biol. 41:18–32.

Gehring, W. J., and K. Ikeo. 1999. Pax 6: Mastering eye morphogenesis and eye evolution. Trends Genet. 15:371–7.

123

Gharib, W. H., and M. Robinson-Rechavi. 2013. The branch-site test of positive selection is surprisingly robust but lacks power under synonymous substitution saturation and variation in GC. Mol. Biol. Evol. 30:1675–86.

Gillman, L. N., D. J. Keeling, H.A. Ross, and S. D. Wright. 2009. Latitude, elevation and the tempo of molecular evolution in mammals. Proc. R. Soc. B. 276:3353–9.

Gillman, L. N., D. J. Keeling, R. C. Gardner, and S. D. Wright. 2010. Faster evolution of highly conserved DNA in tropical plants. J. Evol. Biol. 23:1327–30.

Gillooly, J. F., J. H. Brown, G. B. West, V. M. Savage, and E. L. Charnov. 2001. Effects of size and temperature on metabolic rate. Science. 293:2248–51.

Gould, S. J. 1989. Wonderful life: the Burgess shale and the nature of history. W. W. Norton & Co., New York.

Graf, D. L. 2013. Patterns of freshwater bivalve global diversity and the state of phylogenetic studies on the Unionoida, Sphaeriidae, and Cyrenidae. Amer. Malac. Bull. 31:135–53.

Graur, D., and W.-H. Li. 2000. Fundamentals of Molecular Evolution. Second edi. Sinauer Associates, Sunderland MA.

Grimaldi, D., and M. S. Engel. 2005. Evolution of the Insects. Cambridge University Press, New York.

Grosberg, R. K., G. J. Vermeij, and P. C. Wainwright. 2012. Biodiversity in water and on land. Curr. Biol. 22:R900–3.

Haase, M. 2005. Rapid and convergent evolution of parental care in hydrobiid gastropods from New Zealand. J. Evol. Biol. 18:1076–86.

Hahn, M. W., M. V Han, and S.-G. Han. 2007. Gene family evolution across 12 Drosophila genomes. PLoS Genet. 3:e197.

Hall, B. G. 2005. Comparison of the accuracies of several phylogenetic methods using protein and DNA sequences. Mol. Biol. Evol. 22:792–802.

Hamilton, H., S. Caballero, A. G. Collins, and R. L. Brownell. 2001. Evolution of river dolphins. Proc. R. Soc. B. 268:549–56.

Hayashi, F., Y. Kamimura, and T. Nozaki. 2008. Origin of the transition from aquatic to terrestrial habits in Nothopsyche caddisflies (Trichoptera: Limnephilidae) based on molecular phylogeny. Zoolog. Sci. 25:255–60.

Hebert, P. D. N., E. A. Remigio, J. K. Colbourne, D. J. Taylor, and C. C. Wilson. 2002. Accelerated molecular evolution in halophilic crustaceans. Evolution 56:909–26.

124

Hebert, P. D. N., S. Ratnasingham, and J. R. deWaard. 2003. Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proc. R. Soc. B. 270 Suppl:S96–9.

Hebert, P. D. N., E. H. Penton, J. M. Burns, D. H. Janzen, and W. Hallwachs. 2004. Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. Proc. Natl. Acad. Sci. U. S. A. 101:14812–7.

Heger, T. J., E. A. D. Mitchell, M. Todorov, V. Golemansky, E. Lara, B. S. Leander, and J. Pawlowski. 2010. Molecular phylogeny of euglyphid testate amoebae (Cercozoa: Euglyphida) suggests transitions between marine supralittoral and freshwater/terrestrial environments are infrequent. Mol. Phylogenet. Evol. 55:113–22.

Henley, W. J., J. L. Hironaka, L. Guillou, and M. A. Buchheim. 2004. Phylogenetic analysis of the “Nannochloris-like” algae and diagnoses of oklahomensis gen. et sp. nov. (Trebouxiophyceae, Chlorophyta). Phycologia 43:641–52.

Herbst, D. B. 2001. Gradients of salinity stress, environmental stability and water chemistry as a templet for defining habitat types and physiological strategies in inland salt waters. Hydrobiologia 466:209–19.

Ho, S. Y. W., R. Lanfear, L. Bromham, M. J. Phillips, J. Soubrier, A. G. Rodrigo, and A. Cooper. 2011. Time-dependent rates of molecular evolution. Mol. Ecol. 20:3087–101.

Hoef-Emden, K. 2008. Molecular phylogeny of phycocyanin-containing cryptophytes: Evolution of biliproteins and geographical distribution. J. Phycol. 44:985–93.

Hollister, J. D., S. Greiner, W. Wang, J. Wang, Y. Zhang, G. K.-S. Wong, S. I. Wright, and M. T. J. Johnson. 2014. Recurrent loss of sex is associated with accumulation of deleterious mutations in Oenothera. Mol. Biol. Evol. 32:896–905.

Holm, S. 1979. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6:65–70.

Horn, M. H. 1972. The amount of space available for marine and freshwater fishes. Fish. Bull. 70:1295–7.

Hou, Z., B. Sket, C. Fišer, and S. Li. 2011. Eocene habitat shift from saline to freshwater promoted Tethyan amphipod diversification. Proc. Natl. Acad. Sci. U. S. A. 108:14533–8.

Houde, E. D. 1994. Differences between marine and freshwater fish larvae: implications for recruitment. ICES J. Mar. Sci. 51:91–7.

Huang, D. W., B. T. Sherman, and R. A. Lempicki. 2009a. Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37:1–13.

125

Huang, D. W., B. T. Sherman, and R. A. Lempicki. 2009b. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4:44–57.

Huang, Y., N. D. Temperley, L. Ren, J. Smith, N. Li, and D. W. Burt. 2011. Molecular evolution of the vertebrate TLR1 gene family - a complex history of gene duplication, gene conversion, positive selection and co-evolution. BMC Evol. Biol. 11:149.

Hugall, A. F., and M. S. Y. Lee. 2007. The likelihood node density effect and consequences for evolutionary studies of molecular rates. Evolution 61:2293–307.

Hughes, A. L. 2007. Looking for Darwin in all the wrong places: the misguided quest for positive selection at the nucleotide sequence level. Heredity. 99:364–73.

Hunt, B. G., L. Ometto, Y. Wurm, D. Shoemaker, S. V. Yi, L. Keller, and M. A. D. Goodisman. 2011. Relaxed selection is a precursor to the evolution of phenotypic plasticity. Proc. Natl. Acad. Sci. U. S. A. 108:15936–41.

Hunt, T., J. Bergsten, Z. Levkanicova, A. Papadopoulou, O. S. John, R. Wild, P. M. Hammond, D. Ahrens, M. Balke, M. S. Caterino, J. Gómez-Zurita, I. Ribera, T. G. Barraclough, M. Bocakova, L. Bocak, and A. P. Vogler. 2007. A comprehensive phylogeny of beetles reveals the evolutionary origins of a superradiation. Science. 318:1913–6.

Hunter, R. L., M. S. Webb, T. M. Iliffe, and J. R. A. Bremer. 2008. Phylogeny and historical biogeography of the cave-adapted genus Typhlatya (Atyidae) in the Caribbean Sea and western Atlantic. J. Biogeogr. 35:65–75.

Igic, B., L. Bohs, and J. R. Kohn. 2006. Ancient polymorphism reveals unidirectional breeding system shifts. Proc. Natl. Acad. Sci. U. S. A. 103:1359–63.

Jobson, R., and V. A. Albert. 2002. Molecular rates parallel diversification contrasts between carnivorous plant sister lineages. Cladistics 18:127–36.

Jones, F. C., M. G. Grabherr, Y. F. Chan, P. Russell, E. Mauceli, J. Johnson, R. Swofford, M. Pirun, M. C. Zody, S. White, E. Birney, S. Searle, J. Schmutz, J. Grimwood, M. C. Dickson, R. M. Myers, C. T. Miller, B. R. Summers, A. K. Knecht, S. D. Brady, H. Zhang, A. A. Pollen, T. Howes, C. Amemiya, J. Baldwin, T. Bloom, D. B. Jaffe, R. Nicol, J. Wilkinson, E. S. Lander, F. Di Palma, K. Lindblad-Toh, and D. M. Kingsley. 2012. The genomic basis of adaptive evolution in threespine sticklebacks. Nature 484:55–61.

Jones, J. M., and M. Gellert. 2004. The taming of a transposon: V(D)J recombination and the immune system. Immunol. Rev. 200:233–48.

Kaltenpoth, M., P. Showers Corneli, D. M. Dunn, R. B. Weiss, E. Strohm, and J. Seger. 2012. Accelerated evolution of mitochondrial but not nuclear genomes of Hymenoptera: new evidence from crabronid wasps. PLoS One 7:e32826.

126

Kapoor, B. G., and B. Khanna. 2004. Ichthyology handbook. Springer, NY, USA.

Katoh, K., and D. M. Standley. 2013. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol. Biol. Evol. 30:772–80.

Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, UK.

Kits, J. H., S. A. Marshall, and J. H. Skevington. 2013. Phylogeny of the Archiborborinae (Diptera: Sphaeroceridae) based on combined morphological and molecular analysis. PLoS One 8:e51190.

Korall, P., E. Schuettpelz, and K. M. Pryer. 2010. Abrupt deceleration of molecular evolution linked to the origin of arborescence in ferns. Evolution 64:2786–92.

Krogh, A., and T. Weis-Fogh. 1951. The respiratory exchange of the desert locust (Schistocerca gregaria) before, during and after flight. J. Exp. Biol. 28:344–57.

Kupriyanova, E. K., H. A. ten Hove, B. Sket, V. Zakšek, P. Trontelj, and G. W. Rouse. 2009. Evolution of the unique freshwater cave-dwelling tube worm Marifugia cavatica (Annelida: Serpulidae). Syst. Biodivers. 7:389–401.

Kumar, S. 2005. Molecular clocks: four decades of evolution. Nat. Rev. Genet. 6:654–62.

Lancaster, J., and B. J. Downs. 2013. Aquatic Entomology. Oxford University Press, Oxford, UK.

Lanfear, R., J. A. Thomas, J. J. Welch, T. Brey, and L. Bromham. 2007. Metabolic rate does not calibrate the molecular clock. Proc. Natl. Acad. Sci. U. S. A. 104:15388–93.

Lanfear, R., J. J. Welch, and L. Bromham. 2010a. Watching the clock: Studying variation in rates of molecular evolution between species. Trends Ecol. Evol. 25:495–503.

Lanfear, R., S. Y. W. Ho, D. Love, and L. Bromham. 2010b. Mutation rate is linked to diversification in birds. Proc. Natl. Acad. Sci. U. S. A. 107:20423–8.

Lavoué, S., M. Miya, P. Musikasinthorn, W.-J. Chen, and M. Nishida. 2013. Mitogenomic evidence for an Indo-West Pacific origin of the Clupeoidei (Teleostei: ). PLoS One 8:e56485.

Lee, C. E., and M. A. Bell. 1999. Causes and consequences of recent freshwater invasions by saltwater animals. Trends Ecol. Evol. 14:284–8.

Lee, C. E., M. Posavi, and G. Charmantier. 2012. Rapid evolution of body fluid regulation following independent invasions into freshwater habitats. J. Evol. Biol. 25:625–33.

127

Li, M., Y. Tian, Y. Zhao, and W. Bu. 2012. Higher level phylogeny and the first divergence time estimation of Heteroptera (Insecta: Hemiptera) based on multiple genes. PLoS One 7:e32152.

Little, C. 1990. The terrestrial invasion: an ecophysiological approach to the origins of land animals. Cambridge University Press, Cambridge, UK.

Logares, R., K. Rengefors, and K. Shalchian-Tabrizi. 2007. Extensive phylogenies indicate infrequent marine–freshwater transitions. Mol. Phylogenet. Evol. 45:887– 903.

Logares, R., J. Brate, S. Bertilsson, J. L. Clasen, K. Shalchian-Tabrizi, and K. Rengefors. 2009. Infrequent marine–freshwater transitions in the microbial world. Trends Microbiol. 17:414–22.

Logares, R., J. Bråte, F. Heinrich, K. Shalchian-Tabrizi, and S. Bertilsson. 2010. Infrequent transitions between saline and fresh waters in one of the most abundant microbial lineages (SAR11). Mol. Biol. Evol. 27:347–57.

Lovejoy, N. R., and B. B. Collette. 2001. Phylogenetic relationships of New World needlefishes (Teleostei: Belonidae) and the biogeography of transitions between marine and freshwater habitats. Copeia 2001:324–38.

Lutzoni, F., and M. Pagel. 1997. Accelerated evolution as a consequence of transitions to mutualism. Proc. Natl. Acad. Sci. U. S. A. 94:11422–7.

Lynch, M., and J. S. Conery. 2003. The origins of genome complexity. Science. 302:1401–5.

Mallick, S., S. Gnerre, P. Muller, and D. Reich. 2009. The difficulty of avoiding false positives in genome scans for natural selection. Genome Res. 19:922–33.

Margoliash, E. 1963. Primary structure and evolution of cytochrome c. Proc. Natl. Acad. Sci. U. S. A. 50:672–9.

Markova-Raina, P., and D. Petrov. 2011. High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. Genome Res. 21:863–74.

Marshall, S.A. 2006. Insects: Their Natural History and Diversity. Firefly Books Ltd., Richmond Hill, ON.

Marshall, S. A. 2012. Flies: the Natural History and Diversity of Diptera. Firefly Books Ltd., Richmond Hill, ON.

Marten, A., M. Brändle, and R. Brandl. 2006. Habitat type predicts genetic population differentiation in freshwater invertebrates. Mol. Ecol. 15:2643–51.

128

Martin, A. P., and S. R. Palumbi. 1993. Body size, metabolic rate, generation time, and the molecular clock. Proc. Natl. Acad. Sci. U. S. A. 90:4087–91.

Mayhew, P. J. 2007. Why are there so many insect species? Perspectives from fossils and phylogenies. Biol. Rev. 82:425–54.

McDowall, R. M. 1997. The evolution of diadromy in fishes (revisited) and its place in phylogenetic analysis. Rev. Fish Biol. Fish. 7:443–62.

McMahon, D. P., A. Hayward, and J. Kathirithamby. 2011. The first molecular phylogeny of Strepsiptera (Insecta) reveals an early burst of molecular evolution correlated with the transition to endoparasitism. PLoS One 6:e21206.

Meredith, R. W., J. E. Janec, J. Gatesy, O. A. Ryder, C. A. Fisher, E. C. Teeling, A. Goodbla, E. Eizirik, T. L. L. Simão, T. Stadler, D. L. Rabosky, R. L. Honeycutt, J. J. Flynn, C. M. Ingram, C. Steiner, T. L. Williams, T. J. Robinson, A. Burk-herrick, and M. Westerman. 2011. Impacts of the Cretaceous terrestrial revolution and KPg extinction on mammal diversification. Science. 334:521-4.

Merritt, R. W., and K. J. Cummins (eds). An introduction to the aquatic insects of North America. Third edi. Kendall/Hunt Publishing Co., Dubuque, IA.

Mi, H., S. Poudel, A. Muruganujan, J. T. Casagrande, and P. D. Thomas. 2016. PANTHER version 10: expanded protein families and functions, and analysis tools. Nucleic Acids Res. 44:D336–42.

Misof, B., S. Liu, K. Meusemann, R. S. Peters, A. Donath, C. Mayer, P. B. Frandsen, J. Ware, T. Flouri, R. G. Beutel, O. Niehuis, M. Petersen, F. Izquierdo-Carrasco, T. Wappler, J. Rust, A. J. Aberer, U. Aspöck, H. Aspöck, D. Bartel, A. Blanke, S. Berger, A. Bohm, T. R. Buckley, B. Calcott, J. Chen, F. Friedrich, M. Fukui, M. Fujita, C. Greve, P. Grobe, S. Gu, Y. Huang, L. S. Jermiin, A. Y. Kawahara, L. Krogmann, M. Kubiak, R. Lanfear, H. Letsch, Y. Li, Z. Li, J. Li, H. Lu, R. Machida, Y. Mashimo, P. Kapli, D. D. McKenna, G. Meng, Y. Nakagaki, J. L. Navarrete-Heredia, M. Ott, Y. Ou, G. Pass, L. Podsiadlowski, H. Pohl, B. M. von Reumont, K. Schutte, K. Sekiya, S. Shimizu, A. Slipinski, A. Stamatakis, W. Song, X. Su, N. U. Szucsich, M. Tan, X. Tan, M. Tang, J. Tang, G. Timelthaler, S. Tomizuka, M. Trautwein, X. Tong, T. Uchifune, M. G. Walzl, B. M. Wiegmann, J. Wilbrandt, B. Wipfler, T. K. F. Wong, Q. Wu, G. Wu, Y. Xie, S. Yang, Q. Yang, D. K. Yeates, K. Yoshizawa, Q. Zhang, R. Zhang, W. Zhang, Y. Zhang, J. Zhao, C. Zhou, L. Zhou, T. Ziesmann, S. Zou, X. Xu, H. Yang, J. Wang, K. M. Kjer, and X. Zhou. 2014. Phylogenomics resolves the timing and pattern of insect evolution. Science. 346:763–7.

Mitterboeck, T. F., and S. J. Adamowicz. 2013. Flight loss linked to faster molecular evolution in insects. Proc. R. Soc. B. 280:20131128.

129

Morrill, G. A., A. B. Kostellow, and R. K. Gupta. 2014. The pore-lining regions in cytochrome c oxidases: A computational analysis of caveolin, cholesterol and transmembrane helix contributions to proton movement. Biochim. Biophys. Acta 1838:2838–51.

Neilson, M. E., and C. A. Stepien. 2009. Evolution and phylogeography of the tubenose goby genus Proterorhinus (Gobiidae: Teleostei): evidence for new cryptic species. Biol. J. Linn. Soc. 96:664–84.

Neiman, M., and D. R. Taylor. 2009. The causes of mutation accumulation in mitochondrial genomes. Proc. R. Soc. B. 276:1201–9.

Nosil, P. 2002. Transition rates between specialization and generalization in phytophagous insects. Evolution 56:1701–6.

Ogden, T. H., and M. S. Rosenberg. 2006. Multiple sequence alignment accuracy and phylogenetic inference. Syst. Biol. 55:314–28.

Ohta, T. 1972. Population size and rate of evolution. J. Mol. Evol. 1:305–14.

Ohta, T. 1973. Slightly deleterious mutant subsitutions in evolution. Nature 426:96–8.

Ohta, T. 1992. The nearly neutral theory of molecular evolution. Annu. Rev. Ecol. Syst. 23:263–86.

Olden, J. D., Z. S. Hogan, and M. J. Vander Zanden. 2007. Small fish, big fish, red fish, blue fish: size-biased extinction risk of the world’s freshwater and marine fishes. Glob. Ecol. Biogeogr. 16:694–701.

Omland, K. E. 1997. Correlated rates of molecular and morphological evolution. Evolution 51:1381–93.

Ord, T. J., and T. C. Summers. 2015. Repeated evolution and the impact of evolutionary history on adaptation. BMC Evol. Biol. 15:137.

Pánek, T., J. D. Silberman, N. Yubuki, B. S. Leander, and I. Cepicka. 2012. Diversity, evolution and molecular systematics of the Psalteriomonadidae, the main lineage of anaerobic/microaerophilic heteroloboseans (Excavata: Discoba). Protist 163:807–31.

Park, J. S., and A. G. B. Simpson. 2010. Characterization of halotolerant Bicosoecida and Placididea (Stramenopila) that are distinct from marine forms, and the phylogenetic pattern of salinity preference in heterotrophic stramenopiles. Environ. Microbiol. 12:1173–84.

Parker, J., G. Tsagkogeorga, J. A. Cotton, Y. Liu, P. Provero, E. Stupka, and S. J. Rossiter. 2013. Genome-wide signatures of convergent evolution in echolocating mammals. Nature 502:228–31.

130

Paul, L., S.-H. Wang, S. N. Manivannan, L. Bonanno, S. Lewis, C. L. Austin, and A. Simcox. 2013. Dpp-induced Egfr signaling triggers postembryonic wing development in Drosophila. Proc. Natl. Acad. Sci. U. S. A. 110:5058–63.

Pavlidis, P., J. D. Jensen, W. Stephan, and A. Stamatakis. 2012. A critical assessment of storytelling: Gene ontology categories and the importance of validating genomic scans. Mol. Biol. Evol. 29:3237–48.

Penn, O., E. Privman, H. Ashkenazy, G. Landan, D. Graur, and T. Pupko. 2010. GUIDANCE: A web server for assessing alignment confidence scores. Nucleic Acids Res. 38:23–28.

Pond, S. L. K., B. Murrell, M. Fourment, S. D. W. Frost, W. Delport, and K. Scheffler. 2011. A random effects branch-site model for detecting episodic diversifying selection. Mol. Biol. Evol. 24:1–13.

Pons, J., I. Ribera, J. Bertranpetit, and M. Balke. 2010. Molecular phylogenetics and evolution nucleotide substitution rates for the full set of mitochondrial protein-coding genes in Coleoptera. Mol. Phylogenet. Evol. 56:796–807.

Porter, M. L., and K. A. Crandall. 2003. Lost along the way: The significance of evolution in reverse. Trends Ecol. Evol. 18:541–7.

R Development Core Team. 2010. R: a language and environment for statistical computing. Vienna (Austria): R Foundation for Statistical Computing.

Ratnasingham, S., and P. D. N. Hebert. 2007. BOLD: The Barcode of Life Data System (www.barcodinglife.org). Mol. Ecol. Notes 7:355–64.

Ratnasingham, S., and P. D. N. Hebert. 2013. A DNA-based registry for all animal species: the barcode index number (BIN) system. PLoS One 8:e66213.

Raupach, M. J., C. Held, and J.-W. Wägele. 2004. Multiple colonization of the deep sea by the Asellota (Crustacea: Peracarida: Isopoda). Deep. Res. II 51:1787–95.

Regier, J. C., A. Zwick, M. P. Cummings, A. Y. Kawahara, S. Cho, S. Weller, A. Roe, J. Baixeras, J. W. Brown, C. Parr, D. R. Davis, M. Epstein, W. Hallwachs, A. Hausmann, D. H. Janzen, I. J. Kitching, M. A. Solis, S.-H. Yen, A. L. Bazinet, and C. Mitter. 2009. Toward reconstructing the evolution of advanced moths and butterflies (Lepidoptera: Ditrysia): an initial molecular study. BMC Evol. Biol. 9:280.

Resh, V. H., and R. T. Carde (eds). 2009. Encyclopedia of Insects, 2nd edi. Elsevier, Burlington, MA.

Ribera, I., and A. P. Vogler. 2000. Habitat type as a determinant of species range sizes: the example of lotic-lentic differences in aquatic Coleoptera. Biol. J. Linn. Soc. 71:33–52.

131

Ribera, I., T. G. Barraclough, and A. P. Vogler. 2001. The effect of habitat type on speciation rates and range movements in aquatic beetles: inferences from species-level phylogenies. Mol. Ecol. 10:721–35.

Rieseberg, L. H., and B. K. Blackman. 2010. Speciation genes in plants. Ann. Bot. 106:439– 55.

Robinson, M., M. Gouy, C. Gautier, and D. Mouchiroud. 1998. Sensitivity of the relative-rate test to taxonomic sampling. Mol. Biol. Evol. 15:1091–8.

Roff, D. A. 1990. The evolution of flightlessness in insects. Ecol. Monogr. 60:389–421.

Roff, D. A. 1991. Life history consequences of bioenergetic and biomechanical constraints on migration. Am. Zool. 31:205–15.

Rolland, J., O. Loiseau, J. Romiguier, and N. Salamin. 2016. Molecular evolutionary rates are not correlated with temperature and latitude in Squamata: an exception to the metabolic theory of ecology? BMC Evol. Biol. 1–6.

Rousset, V., L. Plaisance, C. Erséus, M. E. Siddall, and G. W. Rouse. 2008. Evolution of habitat preference in Clitellata (Annelida). Biol. J. Linn. Soc. 95:447–64.

Roux, J., E. Privman, S. Moretti, J. T. Daub, M. Robinson-Rechavi, and L. Keller. 2014. Patterns of positive selection in seven ant genomes. Mol. Biol. Evol. 31:1661–85.

Rubinoff, D., and P. Schmitz. 2010. Multiple aquatic invasions by an endemic, terrestrial Hawaiian moth radiation. Proc. Natl. Acad. Sci. U. S. A. 107:5903–6.

Santini, F., M. T. T. Nguyen, L. Sorenson, T. B. Waltzek, J. W. Lynch Alfaro, J. M. Eastman, and M. E. Alfaro. 2013. Do habitat shifts drive diversification in fishes? An example from the pufferfishes (). J. Evol. Biol. 26:1003–18.

Schmitz, J., and R. F. A. Moritz. 1998. Sociality and the rate of rDNA sequence evolution in wasps (Vespidae) and honeybees (Apis). J. Mol. Evol. 47:606–12.

Scott, G. R., P. M. Schulte, S. Egginton, A. L. M. Scott, J. G. Richards, and W. K. Milsom. 2011. Molecular evolution of cytochrome c oxidase underlies high-altitude adaptation in the bar-headed goose. Mol. Biol. Evol. 28:351–63.

Sezaki, K., R. A. Begum, P. Wongrat, M. P. Srivastava, S. SriKantha, K. Kikuchi, H. Ishihara, S. Tanaka, T. Taniuchi, and S. Watabe. 1999. Molecular phylogeny of Asian freshwater and marine stingrays based on the DNA nucleotide and deduced amino acid sequences of the cytochrome b gene. Fish. Sci. 65:563–70.

132

Shalchian-Tabrizi, K., J. Bråte, R. Logares, D. Klaveness, C. Berney, and K. S. Jakobsen. 2008. Diversification of unicellular eukaryotes: colonizations of marine and fresh waters inferred from revised 18S rRNA phylogeny. Environ. Microbiol. 10:2635–44.

Shalchian-Tabrizi, K., K. Reier-Røberg, D. K. Ree, D. Klaveness, and J. Bråte. 2011. Marine- freshwater colonizations of inferred from phylogeny of environmental 18S rDNA sequences. J. Eukaryot. Microbiol. 58:315–8.

Shao, R., M. Dowton, A. Murrell, and S. C. Barker. 2003. Rates of gene rearrangement and nucleotide substitution are correlated in the mitochondrial genomes of insects. Mol. Biol. Evol. 20:1612–9.

Shen, Y.-Y., P. Shi, Y.-B. Sun, and Y.-P. Zhang. 2009. Relaxation of selective constraints on avian mitochondrial DNA following the degeneration of flight ability. Genome Res. 19:1760– 5.

Shen, Y.-Y., L. Liang, Z.-H. Zhu, W.-P. Zhou, D. M. Irwin, and Y.-P. Zhang. 2010. Adaptive evolution of energy metabolism genes and the origin of flight in bats. Proc. Natl. Acad. Sci. U. S. A. 107:8666–71.

Shen, Y.-Y., W.-P. Zhou, T.-C. Zhou, Y.-N. Zeng, G.-M. Li, D. M. Irwin, and Y.-P. Zhang. 2012. Genome-wide scan for bats and dolphin to detect their genetic basis for new locomotive styles. PLoS One 7:e46455.

Sievers, F., A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, H. McWilliam, M. Remmert, J. Söding, J. D. Thompson, and D. G. Higgins. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7:1-6.

Simon, M., P. López-García, D. Moreira, and L. Jardillier. 2013. New lineages and multiple independent colonizations of freshwater ecosystems. Environ. Microbiol. Rep. 5:322– 32.

Smirnov, A. V, E. S. Nassonova, E. Chao, and T. Cavalier-Smith. 2007. Phylogeny, evolution, and taxonomy of vannellid amoebae. Protist 158:295–324.

Smith, A. B., B. Lafay, and R. Christen. 1992. Comparative variation of morphological and molecular evolution through geologic time: 28S ribosomal RNA versus morphology in echinoids. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 338:365–82.

Smith, M. A., D. M. Wood, D. H. Janzen, W. Hallwachs, and P. D. N. Hebert. 2007. DNA barcodes affirm that 16 species of apparently generalist tropical parasitoid flies (Diptera: Tachinidae) are not all generalists. Proc. Natl. Acad. Sci. U. S. A. 104:4967–72.

133

Smith, M. A., J. J. Rodriguez, J. B. Whitfield, A. R. Deans, D. H. Janzen, W. Hallwachs, and P. D. N. Hebert. 2008. Extreme diversity of tropical parasitoid wasps exposed by iterative integration of natural history, DNA barcoding, morphology, and collections. Proc. Natl. Acad. Sci. U. S. A. 105:12359–64.

Smith, S. A., and M. J. Donoghue. 2008. Rates of molecular evolution are linked to life history in flowering plants. Science. 322:86–9.

Song, K., H. Xue, R. G. Beutel, M. Bai, D. Bian, J. Liu, Y. Ruan, W. Li, F. Jia, and X. Yang. 2014. Habitat-dependent diversification and parallel molecular evolution: Water scavenger beetles as a case study. Curr. Zool. 60:561–70.

Stone, G., and V. French. 2003. Evolution: Have wings come, gone and come again? Curr. Biol. 13:R436–R438.

Storz, J. F. 2016. Causes of molecular convergence and parallelism in protein evolution. Nat. Rev. Genet. 17:239–50.

Strohm, J. H. T., R. A. Gwiazdowski, and R. Hanner. 2015. Fast fish face fewer mitochondrial mutations: Patterns of dN/dS across fish mitogenomes. Gene 572:27–34.

Strong, E. E., D. J. Colgan, J. M. Healy, C. Lydeard, W. F. Ponder, and M. Glaubrecht. 2011. Phylogeny of the gastropod superfamily Cerithioidea using morphology and molecules. Zool. J. Linn. Soc. 162:43–89.

Sun, Y.-B., W.-P. Zhou, H.-Q. Liu, D. M. Irwin, Y.-Y. Shen, and Y.-P. Zhang. 2013. Genome- wide scans for candidate genes involved in the aquatic adaptation of dolphins. Genome Biol. Evol. 5:130–9.

Suyama, M., D. Torrents, and P. Bork. 2006. PAL2NAL: Robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34:609–12.

Tamura, K., D. Peterson, N. Peterson, G. Stecher, M. Nei, and S. Kumar. 2011. MEGA5: Molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods research resource. Mol. Biol. Evol. 28:2731–9.

Tamura, K., G. Stecher, D. Peterson, A. Filipski, and S. Kumar. 2013. MEGA6: Molecular Evolutionary Genetics Analysis Version 6.0. Mol. Biol. Evol. 30:2725–9.

Tedetti, M., and R. Sempere. 2006. Penetration of ultraviolet radiation in the marine environment: A review. Photochem. Photobiol. 82:389–97.

Thomas, J. A, J. J. Welch, M. Woolfit, and L. Bromham. 2006. There is no universal molecular clock for invertebrates, but rate variation does not scale with body size. Proc. Natl. Acad. Sci. U. S. A. 103:7366–71.

134

Thomas, J. A, J. J. Welch, R. Lanfear, and L. Bromham. 2010. A generation time effect on the rate of molecular evolution in invertebrates. Mol. Biol. Evol. 27:1173–80.

Tourasse, N. J., and W. H. Li. 2000. Selective constraints, amino acid composition, and the rate of protein evolution. Mol. Biol. Evol. 17:656–64.

Tripoli, G., D. D’Elia, P. Barsanti, and C. Caggese. 2005. Comparison of the oxidative phosphorylation (OXPHOS) nuclear genes in the genomes of Drosophila melanogaster, Drosophila pseudoobscura and Anopheles gambiae. Genome Biol. 6:R11.1-11.17.

Ujvari, B., N. R. Casewell, K. Sunagar, K. Arbuckle, W. Wüster, N. Lo, D. O ’Meally, C. Beckmann, G. F. King, E. Deplazes, T. Madsen, and D. M. Hillis. 2015. Widespread convergence in toxin resistance by predictable molecular evolution. Proc. Natl. Acad. Sci. U. S. A. 112:11911-6.

Underwood, J. N., M. J. Travers, and J. P. Gilmour. 2012. Subtle genetic structure reveals restricted connectivity among populations of a coral reef fish inhabiting remote atolls. Ecol. Evol. 2:666–79.

Väinölä, R., J. K. Vainio, and J. U. Palo. 2001. Phylogeography of “glacial relict” Gammaracanthus (Crustacea, Amphipoda) from boreal lakes and the Caspian and White seas. Can. J. Fish. Aquat. Sci. 58:2247–57.

Vega, G. C., and J. J. Wiens. 2012. Why are there so few fish in the sea? Proc. R. Soc. B. 279:2323–9.

Veijalainen, A., N. Wahlberg, G. R. Broad, T. L. Erwin, J. T. Longino, and I. E. Saaksjarvi. 2012. Unprecedented ichneumonid parasitoid wasp diversity in tropical forests. Proc. R. Soc. B. 279:4694–8.

Vélez-Zuazo, X., and I. Agnarsson. 2011. Shark tales: a molecular species-level phylogeny of sharks (Selachimorpha, ). Mol. Phylogenet. Evol. 58:207–17.

Venditti, C., and M. Pagel. 2010. Speciation as an active force in promoting genetic evolution. Trends Ecol. Evol. 25:14–20.

Vermeij, G. J., and R. Dudley. 2000. Why are there so few evolutionary transitions between aquatic and terrestrial ecosystems? Biol. J. Linn. Soc. 70:541–54.

Vigoreaux, J. O., C. Hernandez, J. Moore, G. Ayer, and D. Maughan. 1998. A genetic deficiency that spans the flightin gene of Drosophila melanogaster affects the ultrastructure and function of the flight muscles. J. Exp. Biol. 201:2033–44.

Volkenstein, M. V. 1994. Physical approaches to biological evolution. Springer, Berlin, Germany.

135

Wägele, J.-W., B. Holland, H. Dreyer, and B. Hackethal. 2003. Searching factors causing implausible non-monophyly: ssu rDNA phylogeny of Isopoda Asellota (Crustacea: Peracarida) and faster evolution in marine than in freshwater habitats. Mol. Phylogenet. Evol. 28:536–51.

Wang, Z., T. Ma, J. Ma, J. Han, L. Ding, and Q. Qiu. 2015. Convergent evolution of SOCS4 between yak and Tibetan antelope in response to high-altitude stress. Gene 572:298–302.

Ward, R. D., M. Woodwark, and D. O. F. Skibinski. 1994. A comparison of genetic diversity levels in marine, freshwater, and anadromous fishes. J. Fish Biol. 44:213–32.

Welch, J. J., and D. Waxman. 2008. Calculating independent contrasts for the comparative study of substitution rates. J. Theor. Biol. 251:667–78.

Welch, J. J., O. R. P. Bininda-Emonds, and L. Bromham. 2008. Correlates of substitution rate variation in mammalian protein-coding sequences. BMC Evol. Biol. 8:53.

Wertheim, J. O., B. Murrell, M. D. Smith, S. L. Kosakovsky Pond, and K. Scheffler. 2014. RELAX: detecting relaxed selection in a phylogenetic framework. Mol. Biol. Evol. 32:1–13.

Whitehead, A. 2010. The evolutionary radiation of diverse osmotolerant physiologies in killifish (Fundulus sp.). Evolution 64:2070–85.

Whiting, M. F., S. Bradler, and T. Maxwell. 2003. Loss and recovery of wings in stick insects. Nature 421:264–7.

Whitlock, M. C., and N. H. Barton. 1997. The effective size of a subdivided population. Genetics 146:427–41.

Wiegmann, B. M., M. D. Trautwein, I. S. Winkler, N. B. Barr, J.-W. Kim, C. Lambkin, M. a Bertone, B. K. Cassel, K. M. Bayless, A. M. Heimberg, B. M. Wheeler, K. J. Peterson, T. Pape, B. J. Sinclair, J. H. Skevington, V. Blagoderov, J. Caravas, S. N. Kutty, U. Schmidt-Ott, G. E. Kampmeier, F. C. Thompson, D.A. Grimaldi, A. T. Beckenbach, G. W. Courtney, M. Friedrich, R. Meier, and D. K. Yeates. 2011. Episodic radiations in the fly tree of life. Proc. Natl. Acad. Sci. U. S. A. 108:5690–5.

Woolfit, M., and L. Bromham. 2003. Increased rates of sequence evolution in endosymbiotic bacteria and fungi with small effective population sizes. Mol. Biol. Evol. 20:1545–55.

Woolfit, M., and L. Bromham. 2005. Population size and molecular evolution on islands. Proc. R. Soc. B. 272:2277–82.

Woolfit, M. 2009. Effective population size and the rate and pattern of nucleotide substitutions. Biol. Lett. 5:417–20.

Wright, S., J. Keeling, and L. Gillman. 2006. The road from Santa Rosalia: a faster tempo of evolution in tropical climates. Proc. Natl. Acad. Sci. U. S. A. 103:7718–22.

136

Xue, J., Y.-Y. Bao, B.-L. Li, Y.-B. Cheng, Z.-Y. Peng, H. Liu, H.-J. Xu, Z.-R. Zhu, Y.-G. Lou, J.-A. Cheng, and C.-X. Zhang. 2010. Transcriptome analysis of the brown planthopper Nilaparvata lugens. PLoS One 5:e14233.

Xue, J., X. Q. Zhang, H. J. Xu, H. W. Fan, H. J. Huang, X. F. Ma, C. Y. Wang, J. G. Chen, J. A. Cheng, and C. X. Zhang. 2013. Molecular characterization of the flightin gene in the wing- dimorphic planthopper, Nilaparvata lugens, and its evolution in Pancrustacea. Insect Biochem. Mol. Biol. 43:433–43.

Yamanoue, Y., M. Miya, H. Doi, K. Mabuchi, H. Sakai, and M. Nishida. 2011. Multiple invasions into freshwater by pufferfishes (Teleostei: Tetraodontidae): a mitogenomic perspective. PLoS One 6:e17410.

Yang, X., X. Liu, X. Xu, Z. Li, Y. Li, D. Song, T. Yu, F. Zhu, Q. Zhang, and X. Zhou. 2014a. Gene expression profiling in winged and wingless cotton aphids, Aphis gossypii (Hemiptera: Aphididae). Int. J. Biol. Sci. 10:257–67.

Yang, Y., S. Xu, J. Xu, Y. Guo, and G. Yang. 2014b. Adaptive evolution of mitochondrial energy metabolism genes associated with increased energy demand in flying insects. PLoS One 9:e99120.

Yang, Z., W. S. W. Wong, and R. Nielsen. 2005. Bayes Empirical Bayes inference of amino acid sites under positive selection. Mol. Biol. Evol. 22:1107–18.

Yang, Z. 2007. PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24:1586–91.

Yi, S., and J. T. Streelman. 2005. Genome size is negatively correlated with effective population size in ray-finned fish. Trends Genet. 21:639–43.

Yokoyama, R., and A. Goto. 2005. Evolutionary history of freshwater sculpins, genus (Teleostei: Cottidae) and related taxa, as inferred from mitochondrial DNA phylogeny. Mol. Phylogenet. Evol. 36:654–68.

Zhang, J., and S. Kumar. 1997. Detection of convergent and parallel evolution at the amino acid sequence level. Mol. Biol. Evol. 14:527–36.

Zhang, J., R. Nielsen, and Z. Yang. 2005. Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Mol. Biol. Evol. 22:2472–9.

Zou, Z., and J. Zhang. 2015. No genome-wide protein sequence convergence for echolocation. Mol. Biol. Evol. 32:1237–41.

137

Zuckerkandl, E., and L. Pauling. 1962. Molecular disease, evolution, and genic heterogeneity. Pp. 189–225 in Horizons in Biochemistry. Academic Press, New York.

Zuckerkandl, E., and L. Pauling. 1965. Evolutionary divergence and convergence in proteins. Pp. 97–166 in V. Bryson and H. J. Vogel, eds. Evolving Genes and Proteins. Academic Press, New York.

138

Appendix

139

Supplementary Material Ch2_S1

Data collection methods and additional tables and figures summarizing the results

A) Table of species analyzed, including habitats occupied and the list of published studies used as data sources

Table Ch2_S1_1. 150 comparisons with genetic data analyzed. Note, two of the listed sister comparisons (indicated by ‘NA’ in the ‘C#’ column) were considered but excluded from final analysis, following application of the inclusion/exclusion criteria. Table Ch2_S1_1 includes source study name, topology source used, sources of habitat information (if different from phylogenetic source study), species in each clade that were included in analysis and their general habitat categorization, and loci available. Loci listed beside the first sister comparison within a source study represent the loci available for all transitions within that study. Protein-coding loci are bolded. SM2 (‘All_Data_rates’ tab) contains more detailed habitat categorization, postulated ancestral states, all rate results, and more detailed information regarding which specific genes were available for each sister comparison (after applying the inclusion criteria). ‘Even’ and ‘odd’ numbers for Freshwater and Saline taxa refers to the coding in the PAML analysis input tree files available through the Dryad Digital Repository. In the PAML tree files, the lineages are coded as follows: #1 (freshwater clade) is paired with #2 (saline clade) for the first sister comparison for that study, #3 (freshwater) is paired with #4 (saline) for the second sister comparison for that study, etc. Each sister clade is assigned its own number from the point of divergence onward, while all other lineages were assigned to the same “background” rate. Short non-italicized names beside the species names are abbreviations for the species names, used in our PAML input files (since species headers have a maximum number of characters allowed). When needed for clarifying which individuals were analyzed, specimen codes from the source studies are also included below.

#P = the number of comparisons (sister pairs) used from the source study C# = the unique number assigned to that sister comparison, as labeled in manuscript Table 2.1 and Figure 2.2. Within each source study, the comparisons are listed in order from 1, e.g. 1-3, corresponding with the labeling that was used in the PAML input files. #S = the maximum number of species/tips from each sister lineage used

140

Species or individual name as given in source study Source study Taxa # C# # Freshwater Taxa (odd#) Marine/Saline Taxa (even#) Loci and info P S

Bieler et al. Bivalves 3 68 1 Urionoidea superfamily: Neotrigonia lamarckii COI 2014 Unio pictorum H3 With habitat 69 1 Sphaeriidae family: Mya arenaria 16S information Sphaerium sp. 28S from Graf 70 1 Cyrenidae family: Cyrenoida floridana 18S 2013 Corbicula fluminea Strong et al. Gastropod 3 71 2 Hemisinus cubanianus Terebralia palustris 16S 2011 superfamily Melanoides tuberculata Telescopium telescopium (plus 3 Figure 10 Cerithiodea Thiara amarula anticipata trna (combined 72 3 Melanopsis praemorsa Modulus modulus genes morphological Holandriana holandri Finella sp. included and molecular) 73 2 Faunus ater Maoricolpus roseus ) 28S Pachychilus sp. australis Haase 2005 Gastropod 3 74 1 Potamopyrgus oppidanus Potamopyrgus kaitunuparaoa COI family 16S Hydrobiidae 75 1 Sororipyrgus kutukutu Potamopyrgus estuarinus 76 2 Obtusopyrgus alpinus Halopyrgus pagodulus Tongapyrgus kohitatea Hadopyrgus ngataana Botello and Crustaceans, 4 63 1 Palaemonetes suttkusi Palaemon longirostris 16S Alverez 2013 Shrimp, Fig. 2 Decopoda, 64 4 Pseudopalaemon chryseus Macrobrachium acanthurus family Macrobrachium Macrobrachium tenellum , tuxtlaense Macrobrachium rosenbergii subfamily Cryphiops luscus Australia Palaemoninae Cryphiops villalobosi Macrobrachium rosenbergii 65 1 Troglomexicanus Pontonia manningi tamaulipasensis 66 2 Creaseria morleyi Gnathophylloides mineri Neopalaemon nahuatlus Periclimenaeus wilsoni Hamilton et al. Cetaceans, 3 1 1 Inia geoffrensis - Brazil Pontoporia blainsvillei CytB 2001 dolphins (a marine ‘river dolphin’) 12S 2 1 Lipotes vexillifer Delphinapterus leucas 16S 3 2 Platanista gangetica Mesoplodon bidens Platanista minor Mesoplodon europaeus Velez-Zuazo Elasmobranch, 2 4 1 Gylphis garricki Lamiopsis temminckii COI and Agnarsson sharks ND2 2011 + Genbank. Glyphis genus FW or Brackish Sezaki et al. Elasmobranch, 3 5 1 Dasyatis laosensis (restricted to Dasyatis sp. CytB 1999 family FW, rivers) Figure 5 (ML Dasyatidae 6 1 Himantura signifier Himantura imbricate tree) “Whiptail stingrays” 7 1 Himantura chaophraya (India) Dasyatis akajei

Rousset et al. Clitellata 1 77 1 CapAus DinoGyro 18S 2008 (class) 0 Capilloventer australis Dinophilus gyrociliatus “earthworms, 78 1 BothVejd HeroGrav leeches, and Bothrioneurum vejdovskyanum Heronidrilus gravidus allies” 79 1 MonoRubr PrisLong Monopylephorus rubroniveus Pristina longiseta 80 1 DeroDigi SlavApp Dero digitata Slavina appendiculata 81 3 AinuLutu OlavVacu Ainudrilus lutulentus Olavius vacuus ChaeDias HeteJami Chaetogaster diastrophus Heterodrilus jamiesoni UnciUnci TectBori Uncinais uncinata Tectidrilus bori 82 3 LimnUdek TubiAmpl

141

Limnodrilus udekemianus Tubificoides amplivasatus TubiIgno TubiParv Tubifex ignotus Tubificoides parviductus TubiTubi TubiBerm Tubifex tubifex Tubificoides bermudae 83 1 IlyoTemp HeteCost Ilyodrilus templetoni Heterochaeta costata NA 1 SpirFero ClitAren Spirosperma ferox Clitellio arenarius 84 5 PlacCost OzobMarg Placobdella costata Ozobranchus margoi HaemGhil PontMuri Haementeria ghilianii Pontobdella muricata HaemMole StibMacro Haementeria molesta Stibarobdella macrothela DinaLine MyzoLugu Dina lineata Myzobdella lugubris ErpoOcho BranTorp Erpobdella ochoterenai Branchellion torpedinis 85 1 PiscGeom CallVivi Piscicola geometra Calliobdella vivida Adamowicz et Centropagidae 3 54 1 LJohans CFurcatus 16S al. 2010 ( Limnocalanus johanseni Centropages furcatus 28S family) 55 1 CTasSubB CClitella Figure 4 Calamoecia tasmanica Calamoecia clitellata subattenuata B 56 1 BMeteor BPoop Boeckella meteoris Boeckella poopoensis Audzijonytė et Mysis 1 57 4 MDiluviana MAmblyops COI al. 2005 Mysis diluviana M. amblyops CytB MRelicta MMicrop1 ITS2 M. relicta M. microphthalma 1 16S MSalemaai MCaspia 18S M. salemaai M. caspia MSeger MMicrop2 M. segerstralei M. microphthalma2 Hunter et al. Typhlatya 1 67 2 Typhlatya mitchelli Typhlatya garciai COI 2008 Typhlatya pearsei Typhlatya iliffei CytB For COI the individuals: 16S TmitSJ3, TpeaSM6 TgarBH4, TiliCP1; Outgroup: AlauMI1) For CytB the individuals: TgarBH2, AlauMI2 For 16S: AlauMI2 Hou et al. 2011 Gammarus 3 58 3 Gammarus pseudolimnaeus G. setosus1 COI Fig.1 G. troglophilus G. locusta3 18S G. bousfieldi G. duebeni 28S 59 1 G. pecos G. tigrinus 60 6 Up to 6 of: Up to 6 of: G. acalceolatus G. mucronatus G. montanus1 G. annulatus G. albimanus G. sp6 G. istrianus G. aequicauda3 G. kesslerianus2 G. aequicauda2 G. triacanthus6 G. crinicornis G. qiani G. locusta4 G. clarus2 G. aequicauda1 G. gregoryi G. chevreuxi Väinölä et al. Gammaracanth 2 61 2 Gammaracanthus lacustris Gammaracanthus caspius COI 2001 us Fig. 2b) 62 1 Eulimnogammarus cyaneus Monoporeia affinis Betancur-R. et Ariidae 1 8 2 A081 Notarius bonillai A077 Notarius aff. kessleri CytB al. 2012 “Sea catfishes" 0 A082 Notarius cookei A088 Notarius neogranatensis ATP8 Figure with 9 1 A032 Cathorops cf. festae A039 Cathorops multiradiatus ATP6 species names Rag1_1 found in 10 1 A031 Cathorops aguadulce A033 Cathorops cf. higuchi Rag1_2 supplementary Rag2 11 3 A107 Potamarius nelsoni A005 Ariopsis seemanni file ele1802- A106 Potamarius izabalensis A006 Ariopsis sp. MYH6

142 sup-0003- A108 Potamarius usumacintae A003 Ariopsis felis 12S FigureS1.ai ; 16S 12 1 A062 Hemiarius stormii A122 Sciades sagor Rag1_In Figure with t habitats Fig. 1a 13 1 A045 Cephalocassis borneensis A052 Cryptarius truncatus

14 6 A053 Doiichthys novaeguineae A103 Plicofollis layardi Two PAML A064 Nedystoma dayi A105 Plicofollis tonggol files were A095 Pachyula crassilabris A097 Plicofollis aff argyropleuron used: in the A051 Cochlefelis spatula sp 2 first PAMl file A069 augustus A099 Plicofollis aff. containing A110 Potamosilurus latirostris polystaphylodon comparisons A102 Plicofollis dussumieri ‘1-8’ (i.e. 8-15 A104 Plicofollis nella in Figure 2.2), ignore 15 1 A072 Neoarius midgleyi A002 Amissidens hainesi comparison ‘7’ 16 1 A070 Neoarius berneyi A024 Brustiarius proximus (i.e. 14 in Fig. 2.2). 17 1 A112 Potamosilurus velutinus A118 Sciades mastersi Comparisons ‘7,9,10’ (i.e. 14, 16,17) were run in the second PAML file

Bloom and Engraulidae 7 18 1 Pterengraulis atherinoides Anchoviella lepidentostole CytB Lovejoy 2012 “Anchovies” Rag1 Fig. 4 19 1 Lycengraulis batesii Lycengraulis poeyi Rag2 16S 20 1 Anchoviella alleni (Napo?) Anchoviella brevirostris

21 7 Jurengraulis juruensis Anchoa cubana (W. Atlantic) Anchovia surinamensis Anchoa walkeri (E. Pacific) Anchoviella alleni (Nanay) Anchoa colonensis (W. Atlantic) Amazonsprattus scintilla Anchoa nasus (E. Pacific) (Casiquiare) Engraulis eurystole (W. Atlantic) Anchoviella n. sp1. (Marowijne) Engraulis mordax (E. Pacific) Anchoviella cf. guianensis Anchoa filifera (W. Atlantic) (Rupununi) Anchoviella n. sp3. (Rupununi) 22 1 Lycothrissa crocodilus Setipinna cf. tenuifilis 23 2 Dorosoma cepedianum Alosa sapidissima Pellonula leonensis Brevoortia tyrannus 24 1 Pellona flavipinnis Clupea harengus Davis et al. Terapontidae 2 25 9 Bidyanus bidyanus Terapon theraps CytB 2012 “Terapontid Pingalla lorentzi Helotes octolineatus Rag1_1 Fig. 4 grunters” Hephaestus epirrhinos Helotes sexlineatus Rag1_2 Scortum parviceps Pelates sexlineatus Rag1_3 Scortum ogillbyi Pelates quadrilineatus Rag2 Hephaestus tulliensis Rhyncopelates oxyrhyncus Rag1_In Leiopotherapon unicolor Mesopristes argenteus t1 Amniataba percoides Mesopristes cancellatus Rag1_In Hephaestus habbernai Pelsartia humeralis t2 [5 marine, 4 euryhaline] 26 1 Hannia greenwayi Amniataba caudavittatus Lavoue et al. Clupeoidei 4 27 7 Potamothrissa acutirostris Tenualosa ilisha COI 2013 (excl. Potamothrissa obtusiostris Tenualosa thibaudeavi CytB Fig. on page 6 Engraulidae) Microthrissa royauxi Tenualosa toil ATP6 “Sardines, Microthrissa congica Clupanodon thryssa ATP8 herrings & Odaxothrissa losera Konosirus punctatus 12S relatives” Pellonula leonensis Nematalosa japonica 16S Pellonula vortex chacunda [3 marine, 4 euryhaline] 28 8 Clupeichthys gognioganthus Sardinella maderensis Clupeichthys aerarnensis Sardinella albella Clupeichthys perakensis thoracata Clupeoides borneensis Harengula jaguana Clupeoides sp. “Chao Phraya” Hilsa kelee

143

Sundasalanx mekongensis Brevoortia tyrannus Sundasalanx praecox Sardina pilchardus Sundasalanx sp. “Chao Phraya” Sardinops melanostictus [all marine] 29 1 Potamalosa richmondia Hyperlophus vittatus 30 1 Pellona flavipinnis Ilisha elongate [euyryhaline] Lovejoy and Belonidae 5 31 1 Pseudotylosurus angusticeps Strongylura timucu CytB Collette 2001 “” Rag2 Fig. 2 32 1 Strongylura hubbsi Strongylura exilis Tmo 33 1 Strongylura fluviatilis Strongylura scapularis 16S 34 1 Belonion apodion Platybelone argalus 35 1 Xenentodon cancila Strongylura incisa Neilson and Proterorhinus 1 36 1 Proterorhinus cf semipellucidus Proterorhinus marmoratus AMR1 COI Stepien 2009 (tubenose ALT1 CytB Figure 3 goby) Rag1 Whitehead Fundulus 4 37 2 Fundulus nottii Adinia xenica CytB 2010 Fundulus notatus Fundulus luciae Gylt 38 1 Lucania goodei Lucania parva Rag1 Physiological types generally 39 3 Fundulus rathbuni Fundulus heteroclitus ME match habitat Fundulus catenatus Fundulus grandis type, except F. Fundulus stellifer Fundulus pulvereus diaphanous 40 1 Fundulus seminolis Fundulus similis (brackish phys., mainly FW habitat); F. zebrinus and F. leansae are highly salt tolerant but confined to inland waters. Whitehead 2010. These conflicts were excluded. Yamanoue et Teraodontidae 3 41 1 asellus ATP6 al. 2011 “Pufferfish” ATP8 Fig. 5 42 6 Carinotetraodon salivator Tetraodon biocellatus COI Carinotetraodon lorteti Pelagocephalus marki COIII *Overlapping Tetraodon palembangensis Canthigaster valentini CytB study Santini Tetraodon cochinchinensis Omegophora armilla ND1 et al. (2013) Tetraodon cutcutia Arothron hispidus ND2 not included in Auriglobus modestus Arothron manilensis ND3 analysis ND4 43 2 Tetraodon mbu Chelonodon pleurospilus ND4L Tetraodon miurus Chelonodon patoca ND5 12S 16S Yokoyama and Cottidae 1 44 1 Cottus nozawae Yamagata Leptocottus armatus 12S Goto 2005 “Sculpins” CR (Mit Control Fig 2 Region)

Bloom et al. Menidiinae 9 45 1 Basilichthys semotilus Leuresthes tenuis CytB 2013 (subfam) ND2 (Teleostei, 46 1 Odontesthes mauleanum Odontesthes smitti Rag1 Fig. 1 Atherinopsidae) 47 1 Atherinella colombiensis Atherinella starksi Tmo “New World Rag 1 and Tmo silversides” 48 1 Atherinella ammophila Membras martinica No Tmo gene: species gene for differ slightly 49 2 Atherinella crystallina Atherinella serrivomer Compar. from this list. Atherinella guatamalanensis Membras gilberti 1 and 7 See SM2 for Except 1 vs 1 for Tmo gene. for this species 50 1 Atherinella sallei Atherinella argentea study

144 numbers and 51 1 Labidesthes sicculus Melanorhinus microps (i.e. 45, PAML input 51 in files for exact 52 1 Chirostoma contrerasi Menidia menidia Fig.1) species used. 53 1 Menidia beryllina Menidia colei Bass et al. Cyclotrichiida 1 86 5 SKL_RX_E4 (EU376944) Mesodinium pulex clone MPCR99 18S 2009 (Ciliophora SKL_P_A3 (EU378938) (AY587130) Fig. 2 order) A_BM1_1 (EU378925) SA2_3C3 (EF526958) A_WL16_8 (EU569625) Mesodinium pulex (DQ411865) SKL_RX_12 (EU378945) CCW75 (AY180041) NA1_2H5 (EF526756) Carr et al. Choanoflagellat 1 87 2 Salpingoeca amphoridium Salpingoeca pyxidium 18S 2008 ea (class), Salpingoeca napiformis Salpingoeca urceolata Fig. 2 Sarpingoeca genus Cavalier-Smith Centrohelida 9 88 1 Polyplacocystis aff. coerulea Helio10 18S and von der (Heliozoa Heyden 2007 class) 89 1 Helio11 Marophrys marina Figure 5 90 1 H19.9 H9.6 91 1 Sphaerastrum fockii H20.6 92 2 H5.6 H28.1 H26.4 H8.9 93 1 Helio5 H28.9 94 2 Choanocystis curvata H27.8 Helio1 H8.7 95 3 H6.7 H28.4 H7.10 H15.6 H22.6 H27.10 96 8 Helio9 H16.3 Pterocystis polymorpha H15.7 H22.4 H17.3 H20.9 Helio3 Chlamydaster sternii TCS H26.1 Helio4 H15.3 H12.9 H28.2 Pterocystis cuspidata H15.10 Alverson et al. Thalassiosirales 6 126 1 Thalassiosira gessneri Thalassiosira weissflogii L1296 Psbc 2007 (Heterokontoph Rbcl Fig. 4 yta order) 127 1 Cyclotella distinguenda Thalassiosira pseudonana 28S CCMP1057 18S 128 1 Cyclotella atomus Cyclotella sp. L1844 129 1 Cyclotella cf. cryptica WC03-01 Cyclotella cryptica 130 2 Cyclotella cf. meneghiniana Cyclotella striata LS03-01 Cyclotella stylorum Cyclotella cf. meneghiniana F8 131 3 Discostella cf. pseudostelligera Thalassiosira sp. CCMP353 Cyclostephanos sp. WTC16 Bacterosira sp. CCMP991 Stephanodiscus yellowstonensis Bacterosira bathyomphala Figueroa and Raphidophycea 1 132 2 Vacuolaria virescens Heterosigma akashiwo 18S Rengefors e Gonyostomum semen Heterosigma carterae 2006 (Heterokontoph Fig. 11 yta class) Henley et al. Trebouxiophyc 1 133 11 Chlorella sp. Tanaqocha RA1 Nannochloris sp. UTEX 2491 18S 2004 eae Nannochloris bacillaria Nannochloris sp. UTEX 2378 Fig. 15 (Chlorophyta Nannochloris sp. JL 4-6 Nannochloris sp. RCC 011 class) Nannochloris sp. ANR-9 Nannochloris maculates Nannochloris sp. AS 2-10 Nannochloris occulata Koriella sp. MDL5-3 Nannochloris atomus Gloetila contorta Nannochlorum sp. MBIC10208 Gloeotilla sp. JL 11-10 Nannochlorum sp. MBIC10053 Marvania geminate Nannochlorum sp. MBIC10091 Marvania sp. JL 11-11 Nannochlorum sp. MBIC10096 Nannochloris coccoides CCAP Nannochlorum eukaryotum 251/1b Mainz1 Hoef-Emden 3 134 10 C. sp. M0851 Germany H. rufescens CCAP 984/2 English 28S 2008 C. sp. M2291 Germany Channel 18S

145

Fig. 2 Chroomonas C. sp. CCAP 978/3 UK Wales H. rufescens CCAP 440 USA (Cryptophyta C. sp. SAG 980-1 UK Wales Maine Note: We genus) Komma caudate M1074 H. andersenii CCMP 441 Gulf of excluded Germany Mexico ‘Chroomonas’ C. sp. M2067 Germany H. andersenii CCMP 644 Gulf of from our C. sp. M1481 Germany Mexico analysis of C. coerulea UTEX 2780 USA H. andersenii CCMP 1180 Gulf of Shalchian- Colorado Mexico Tabrizi et al. C. sp. M0874 Germany “Chroomonas” sp. UTEX 2000 2008, as we C. sp. M1953 Germany USA Virginia analyzed the H. tepida CCMP 443 USA Texas genus from H. cf. virescens M1635 Sweden Hoef-Emden H. pacifica CCMP 706 USA 2008. Washington H. cryptochromatica CCMP 1181 USA Maine 135 1 C. sp. M1624 Denmark C. placoidea CCAP 978/8 UK 136 2 C. pochmanni UTEX 2779 USA C. sp. M1703 Denmark Colorado C. sp. M1318 France C. sp. M1312 Germany Shalchian- Cryptophyceae 5 NA 1 Uncultured eukaryote Uncultured eukaryote DQ647536 18S Tabrizi et al. (Cryptophyta AY821951 2008 class; 137 11 Uncultured eukaryote Uncultured eukaryote DQ310300 s 1 Chroomonas AY642715 prolonga AF508272 excluded) Uncultured eukaryote Uncultured eukaryote AY665097 AY919723 Uncultured eukaryote AY665099 Uncultured eukaryote Uncultured eukaryote AY426874 AY919779 cryophila U53124 Uncultured eukaryote Geminigera cryophila strain AY919805 MBIC10567 Uncultured eukaryote Uncultured eukaryote DQ120006 AY642713 Uncultured eukaryote DQ120007 Uncultured eukaryote Uncultured eukaryote DQ781323 AY919781 Uncultured eukaryote AF363183 Uncultured eukaryote AY642724 Uncultured eukaryote AY642712 Uncultured eukaryote AY919733 Uncultured eukaryote AY919695 Plagioselmis nannoplanctica 138 1 baltica AB241128 Rhinomonas sp. AF508273 139 9 Crytomonas sp. AJ420697 Teleaulax acuta AF508275 Uncultured eukaryote theta X57162 AY821950 Hanusia phi U53126 Campylomonas reflexa Falcomonas daucoides AF143943 AF508267 Rhodomonas duplex AB240960 ovate strain CCAC Storealula sp. AF508276 0064 AM051193 Rhodomonas sp. AJ421148 Cryptomonas ovate AF508270 Rhodomonas abbreviata U53128 Crytomonas borealis SCCAP Proteomonas sulcata AJ007285 K0063 AJ420696 Cryptomonas sp. AJ007280 Cryptomonas paramaecium AM051194 Uncultured eukaryote AY919727 140 6 sp. AY360459 Uncultured eukaryote AJ536451 Goniomonas sp. AY360458 Goniomonas pacifica AF508277 Goniomonas sp. AY705740 Goniomonas sp. AY360456 Goniomonas sp. AY360457 Goniomonas amphinema Goniomonas sp. AY705739 AY705738 Goniomonas truncate U03072 Goniomonas sp. AY360455 Goniomonas sp. AY360454 Heger et al. 4 113 2 Paulinella chromatophora AB275059 Environmental 18S 2010 FJ456918 sequence

146

Fig. 4 Euglyphida Paulinella chromatophora EF526891 Environmental (Cercozoa X81811 sequence order) 114 4 AY620259 Environmental Cyphoderia compressa, Galata sequence (BG) 1 Cyphoderia trochus ssp. Cyphoderia compressa, Galata Palustris, Marchairuz (CH) 1 (BG) 2 Cyphoderia ampulla, Vitosha Cyphoderia compressa, Galata (BG) 1 (BG) 3 Cyphoderia amphoralis Rila AY620293 Environmental (BG) 1 sequence 115 1 Cyphoderia ampulla, Moiry AY620325 Environmental (CH) 1 sequence 116 8 Assulina seminulum EF456749 AB453002 Environmental Placocista spinosa EF456748 sequence AB275100 EF024433 Environmental Environmental sequence sequence AY620307 Environmental EF024522 Environmental sequence AY620315 sequence Environmental sequence Trinema enchelys AJ418792 AY620326 Environmental Euglypha acanthophora sequence AJ418788 Corythionella minima, Galata Euglypha penardi EF456753 (BG) 1 Euglypha rotunda AJ418782 Cyphoderia littoralis, Galata (BG) 1 Cyphoderia littoralis, Galata (BG) 2

Smirnov et al. Vannellidea 3 117 10 Vannella miroides ATCC 30945 Vannella anglica AF099101 18S 2007 (Amoebozoa AY18388 Vannella sp. AY929905 Fig.1 family) Vannella lata CCAP 1589/12 Vannella sp. AY929907 Vannella (=Platyamoeba) sp. Vannella (=Platyamoeba) sp. AY929923 AY929919 Vannella sp. AY929912 Vannella (=Platyamoeba) sp. Vannella sp. Geneva strain AY929916 Vanella (=Platyamoeba) placida Vannella (=Platyamoeba) sp. CCAP 1565/2 TYPE AY929918 Vannella sp. “P. plurinucleolus” Vannella sp. AY929906 ATCC 50745 AY121849 Vannella aberdonica AY121853 Vannella sp. AY929911 Vannella (=Platyamoeba) sp. Vannella sp. AY929909 AY929920 Vannella sp. AY929910 Vannella sp. AY929904 118 4 Vannella simplex Geneva Vanella danica CCAP 1589/17 Vannella simplex Gurre Lake Variant 1 TYPE Vannella simplex CCAP 1589/3 Vanella danica CCAP 1589/17 Bonn TYPE Variant 2 TYPE Vannella simplex CCAP 1589/3 Vanella danica CCAP 1589/17 clone 1 TYPE clone1 Vanella danica CCAP 1589/17 clone2 119 2 Ripella (=Vannella) sp. Clydonella sp. AY183892 AY929913 Clydonella sp. AY183890 Ripella sp. from CCAP 1555/2 culture clone1 Logares et al. Dinoflagellata 1 97 3 Ceratium hirundinella Ceratium fusus AF022153 18S 2007 (phylum) 3 AY443014 Ceratium furca AJ276699 Fig. 1 Ceratium hirundinella Ceratium tenue AF022192 AY460574 Ceratium sp. DQ487192 98 1 Cystodinium phaseolus Lingulodinium EF058235 99 1 Hemidinium nasutum AY443016 Protoperidium 100 3 Jadwigia applanata EF058240 Pyrocystis Peridinium sp. AY827955 horrida AF022154 Tovellia leopoliensis AY443025 Amphidium 101 2 Glenodiniopsis steinii AF274257 AB092336 Gymnodinium impatiens Dinophyceae AB092337 EF058239 102 5 Peridinium willei JAPAN Gyrodinium instriatum AY421786 EF058249 Gyrodinium dorsum AF274261 147

Peridinium sp. AF022202 Gyrodinium uncatenum AF274263 Peridinium gatunense Akashiwo DQ166208 Peridinium cinctum SWEDEN EF058245 103 1 Gloeodinium montanum Adenoides eludens AF274249 EF058238 104 1 Peridinium polonicum Scrippsiella sweeneyae AF274276 AY443017 105 1 Peridinium wierzejskii Dinophyceae Shepherds Crook AY443018 AY590479 106 4 Peridinium umbonatum Scrippsiella nutricula U52357 AF274271 Durinskia baltica AF231803 Peridiniopsis borgei EF058241 Lessardia elongata AF521100 Peridinium centennial EF058236 Dinophyceae sp. AY25128 Peridinium cf. centennial EF058237 107 2 Gymnodinium sp. AY829527 Gymnodinium fuscum AF022194 Gymnodinium sp. AY840208 Gymnodinium sp. AF022196 108 1 Gyrodinium helveticum Gyrodinium rubrum AB120003 AB120004 109 1 Woloszynskia pascheri Woloszynskia halophile EF058253 Simon et al. Haptophyta 3 141 1 AN2-Pry1-C12 (Prym_12) Pacific Ocean clone Biosope 18S 2013 (division) T84.038 (FJ537355) Fig. S1 142 1 EV10-Pry1-C4 (Prym_5) Chrysochromulina acantha Prymnesiophyt (AJ246278) *note: We only a (class) 143 4 EV3-Pry1-C11 (Prym_4) Chrysotila lamellosa (AM490998) used Simon et EV3-Pry1-C2 (Prym_1) Pseudoisochrysis paradoxa al. 2013 for EV3-Pry1-C22 (Prym_2) (AM490999) these taxa. EV3-Pry1-C13 (Prym_19) SCH1-Pry1-C6 (Prym_24) Shalchian- Isochrysis litoralis (AM490996) Tabrizi et al. [2 out of 4 are saline continental] 2011 (not included) has overlapping group and species. Due to lack of formal species names in these taxa we did not include transitions from both studies in case they represent the same comparisons. Simon et al. Haptophyta 5 144 4 Scandinavian freshwater clone Pacific Ocean clone Biosope 18S 2013 Fig. S2 (division) Finsevatn AI9LL T58.080 (FJ537336) Scandinavian freshwater Marine euphotic zone clone Pavlovophyta Finsevatn APB2H EN360CTD001 (HM581631) (class) Suboxic freshwater pond Exanthemachrysis sp AC37 sediment clone CV1_B1_97 (JF714224) (AY821959) Pavlova pinguis (JF714248) Suboxic freshwater pond sediment clone CV1_B2_32 (AY821960) [some clones] 145 1 BG1C3_Pav3 (Pav_2) Diacronema vlkianum (AJ515246) 146 2 Corcontochrysis noctivaga Diacronema vlkianum (AF106056) (DQ207406) Pavlova lutheri (AF102369) Scandinavian freshwater clone Svaersvann 14 147 1 Scandinavian freshwater Pavlovales sp CCMP2436 Finsevatn 8912 (EU247835) 148 1 Pavlova granifera (JF714231) Pavlova pinguis (AB183600)

148

Park and Bicosoecida, 3 110 1 uncultured stramenopile Bicosoeca petiolata 18S Simpson 2010 Placididea & PSH9SP2005 relatives 111 21 1. uncultured freshwater 1. Cafeteria sp. CAFSW0510 (within phylum eukaryote LG14-04 2. uncultured bicosoecid OC4.1 Heterokonta) 2. uncultured freshwater 3. MESS12 eukaryote LG33-04 4. Cafeteria 3. Paramonas globosa roenbergensisHFCC34 4. Adriamonas peritocrescens 5. Cafeteria 5. uncultured freshwater roenbergensisHFCC33 eukaryote LG21-12 6. ME280100 6. uncultured stramenopile 7. Halocafeteria seosinensis PSE8SP2005 8. uncultured eukaryote EHF0502 7. uncultured bicosoecid 9. ME13100 CH1_5A_8 10. Caecitellus paraparvulus 8. uncultured bicosoecid 11. Caecitellus pseudoparvulus CH1_2B_3 12. Caecitellus parvulus 9. uncultured stramenopile 13. MESS21 PSA11SP2005 14. uncultured marine eukaryote 10. Siluania monomastiga M2_18G10 11. uncultured freshwater 15. uncultured marine eukaryote eukaryote LG60-06 M3_18D02 12. uncultured freshwater 16. Bicosoeca vacillans eukaryote LG28-12 17. uncultured marine eukaryote 13. uncultured freshwater M1_18B12 eukaryote LG05-12 18. cultured marine eukaryote 14. uncultured freshwater M1_18B03 eukaryote LG02-05 19. uncultured marine eukaryote 15. uncultured freshwater 451A09 eukaryote LG01-04 20. uncultured marine eukaryote 16. uncultured freshwater UI12C04 eukaryote LG10-05 21. uncultured marine eukaryote 17. uncultured freshwater UI11D07 eukaryote LG29-01 18. uncultured bicosoecid CH1_2A_3 19. Nerada mexicana 20. uncultured freshwater eukaryote LG30-01 21. uncultured eukaryotic picoplankton freshwater P34.6 112 1 uncultured stramenopile uncultured stramenopile BAQA72 PSC4SP2005 Bråte et al. Telonemia 4 122 1 NPK97_25, Svalbard NOR46.Telo.12 Arctic Ocean, 18S 2010 (phylum) Svalbard Fig. 1 123 13 Svv.Telo.1 Lake Sværsvann, IND31.Telo.10 Indian Ocean Norway hotp1h2 Pacific Ocean, Hawaii DL-2-2 (EU078264) Freshwater IND31.Telo.8 Indian Ocean lake, Svalbard NOR26.35 Arctic Ocean, Svalbard Lut.Telo.2 Lake Lutvann, AD6S.06 Arctic Ocean, Svalbard Norway IND31.Telo.2 Indian Ocean Lut.Telo.21 Lake Lutvann, BL040126.Telo.36 Mediterranean Norway Sea, Spain Pol.Telo.10 Lake Pollen, BL040126.Telo.18 Mediterranean Norway Sea, Spain Svv.Telo.4 Lake Sværsvann, IND31.100 Indian Ocean Norway NOR46.11 Arctic Ocean, Svalbard Pol.Telo.3 Lake Pollen, Norway BL010625.25 Mediterranean Sea, BA4 Lake Bourget, France Spain Lut.Telo.15 Lake Lutvann, BL010625.26 Mediterranean Sea, Norway Spain Lut.Telo.13 Lake Lutvann, ENVP21819.00002 Western North Norway Atlantic Pol.Telo.7 Lake Pollen, Norway Lut.Telo.23 Lake Lutvann, Norway Pol.Telo.1 Lake Pollen, Norway 124 6 Lut.Sed.50.3 Lake Lutvann, M218G12 Mariager Fjord, Norway Denmark Lut.Sed.5.5 Lake Lutvann, SA14A6 Framvaren Fjord, Norway Norway 149

Lut.Sed.20.2 Lake Lutvann, NA21F1 Framvaren Fjord, Norway Norway Lut.Sed.5.2 Lake Lutvann, SA12A10 Framvaren Fjord, Norway Norway Lut.Sed.20.9 Lake Lutvann, XMCF11 Xiamen Islands, China Norway RA010412.17 English Channel, Lut.Sed.20.8 Lake Lutvann, France Norway 125 1 DGGE band 20 Hyperhaline RA000412.136 English Channel, Lake, Chile France Panek et al. Heterlobosea 2 120 2 Sawyeria marylandensis NY0199 18S 2012 (Excavata (AF439351) EVROS2N Fig. 12 class) CIZOV2 121 2 IND8 Monopylocystis visvesvarai VT1 (AF011463)

B) Description of methods for data collation, including literature search terms employed and explanation of sequence inclusion/exclusion criteria

Searches were conducted between June 2010 and July 2013 using search terms relating to habitat transitions (below). Additional searches were conducted in September 2014 for phylogenies and habitat information of specific taxonomic groups in order to broaden the phylogenetic scope of transitions included. We did not include a study if the habitat information was ambiguous or if there were multiple conflicting phylogenetic hypotheses that would influence the sister clade pairings.

Search terms for source studies

Web of Science (Thomson Reuters) database was searched:

June and September 2010 Topics searched: marine and freshwater and (“habitat transition” or “molecular evolution”) marine and freshwater and lake and (“habitat transition” or “molecular evolution”)

September 2010 Topics searched: freshwater and (saline or hypersaline) and (“molecular evolution” or “habitat transition”)

150 freshwater and (saline or hypersaline) and lake and (“molecular evolution” or “habitat transition”)

May 2013 to July 2013 Topic: (phylogen* or “molecular systematic*”) and (marine or saline) and (freshwater) and ("habitat shift" or transition) Additional literature was accessed through the reference lists of the articles recovered using the above searches and by seeking articles that have more recently cited the most relevant papers recovered (using the literature cited tool available in Web of Science). September 2014 Targeted searches were performed to obtain phylogenies and habitat information for taxonomic groups where transitions have been mentioned in the literature, but where previous searches did not locate papers having habitat transitions mapped in those groups.

Taxon inclusion criteria

We only considered organisms that are exclusively aquatic, as opposed to living on the surface of the water or that could also live on land.

We performed interspecific comparisons; however, to maximize the sample size of organisms per clade, we included more than one sequence within the same morphologically assigned species name if the genetic divergences were typical of interspecific divergences found elsewhere in the tree. In the case of unicellular organisms, species names were often not present, and species boundaries are difficult to define; each tip in the phylogeny as recorded by the study authors was considered a candidate for inclusion.

Exclusion criteria for genetic data in a sister pair

The minimum exclusion criteria for genes for a given sister pair were as follows. 1) For relative overall substitution rates (OSRs), we did not include a given gene if either the FW or SAL rate had low or no variation (output of 0.0001 relative to a background rate set at 1). We also

151 examined the distribution of relative rate results and excluded outlying genes (described below); here, outliers are defined as those genes for which the FW vs. SAL lineages exhibited 7-fold or greater difference, as the distribution of relative rates was relatively continuous until reaching that magnitude of difference (see Figure Ch2_S1_1 below). 2) For dN/dS ratios, genes were considered as having information if either the FW or SAL comparisons had dN or dS results greater than 0.0001. Extreme dN/dS ratios can be produced when there are estimated non- synonymous substitutions but no synonymous substitutions; however, as information was present in the dN or dS rates and those rates were concatenated across genes, we included all comparisons with any changes in either the dN or dS category. OSR and dN/dS exclusion criteria were applied in order to avoid genes lacking information and those producing extreme rate results.

Without the availability of absolute branch lengths for OSR results, we could not meet the requirements for the test described by Welch and Waxman (2008) for determining which comparisons to remove due to low information content. Nevertheless, through application of the criteria described above, low-information comparisons were reduced. Furthermore, we used binomial tests in our analyses of overall patterns of these summarized results, which consider direction and not magnitude.

Exclusion criteria for genetic data in a sister pair determined by distribution of relative overall substitution rates (OSRs)

One hundred and fifty independent habitat comparisons were analyzed (Table Ch2_S1_1).

435 sister pairs of relative rates for single genes were generated through our overall substitution rate (OSR) analysis. We then excluded 11 gene comparisons for which both lineages exhibited zero branch length and 22 containing one lineage with zero branch length (as described above). Of the remaining 402 pairs (Figure Ch2_S1_1, absolute relative rates), we removed 6 pairs (in orange) at the tail end of the distribution with ratios greater than 7 times unity (1:1) in order to avoid more extreme relative rates contributing to the overall summarized result. Two data points (in blue) represent sister comparisons that were not excluded, as the comparisons were

152 represented by other genes of the same category, and their inclusion or exclusion did not change the direction of the summarized relative rate for that gene category or our overall results. 396 pairs of relative rates were included in further analysis. These remaining OSR data points represented 148 independent habitat comparisons.

Originally, 248 sister pairs of individual protein-coding genes were available for dN/dS analysis. However, this number was reduced to 236 pairs after applying the exclusion criteria (excluding those with neither dN nor dS substitutions). These represented 71 evolutionarily independent habitat transitions.

Figure Ch2_S1_1. Distribution of 402 pairs of OSR relative rate ratios (difference from 1:1 ratio between sister lineages shown), for all sister pairs for which both lineages had branch lengths greater than zero. Those data points in orange were excluded from the summarized results.

20

18

16

14

12

10

8

difference from 1) 6

4

2

0 Relative overall substitution rate ratios (absolute value, 0 100 200 300 400 Numbered sister pairs, with each gene analyzed separately

153

C) Tables containing individual gene relative rates; figure presenting dN/dS ratios and dN and dS relative rates; and additional discussion of results

Table Ch2_S1_2. Summary of individual gene relative rates for Overall Substitution Rates (OSRs) analysis. 396 pairs of gene relative rate ratios were included across 30 genes and 148 phylogenetically independent habitat transitions. If considering each gene alone, each data point would represent an independent comparison. Those genes represented by six or more sister pairs (yellow cells) were tested for a directional association with habitat by binomial test; no individual genes show a significant pattern. The greatest differences were observed for the mitochondrial protein-coding gene ND2, with 10 of 13 comparisons freshwater (FW) > saline (SAL) (uncorrected p=0.092), and for the nuclear protein-coding gene Rag2, with 16 out of 23 comparisons FW>SAL (uncorrected p=0.093). With correcting for multiple gene testing via sequential Bonferroni, no genes were significant in direction. The FW rate was more often greater than the paired SAL rate across most (19 of 25) genes; however, note that comparisons across genes are not independent, as multiple genes are associated with the same taxa. Note in this table there is overlap in sequence regions in the case of ‘Rag1whole’ with Rag 1 parts 1, 2, and 3. However, within each source study, the gene sequences do not overlap; therefore, the sequence data summarized for each sister pair are non-overlapping. Mit= mitochondrial, Nuc = nuclear, Chlor = chloroplast, PC = protein-coding, NC = non-coding; FW = Freshwater, SAL= Marine or continental saline

154

Gene Gene FW>SAL SAL>FW # comparisons FW more category often higher? Mit PC COI 13 9 22 TRUE Mit PC CytB 29 24 53 TRUE Mit PC COIII 2 1 3 TRUE Mit PC ATP6 8 9 17 FALSE Mit PC ATP8 10 6 16 TRUE Mit PC ND1 3 0 3 TRUE Mit PC ND2 10 3 13 TRUE (uncorrected p=0.092) Mit PC ND3 2 1 3 TRUE Mit PC ND4 2 1 3 TRUE Mit PC ND4L 1 2 3 FALSE Mit PC ND5 2 1 3 TRUE Mit NC 12S 7 11 18 FALSE Mit NC 16S 24 22 46 TRUE Mit NC MT control 1 0 1 TRUE Nuc PC Rag1_1 2 3 5 FALSE Nuc PC Rag1_2 3 2 5 TRUE Nuc PC Rag1_3 1 1 2 EQUAL Nuc PC Rag1(whole) 11 8 19 (+ 1 no TRUE difference) Nuc PC Rag2 16 7 23 TRUE (uncorrected p=0.093) Nuc PC Tmo 5 5 10 EQUAL Nuc PC Myth6 3 1 4 TRUE Nuc PC Gylt 2 1 3 TRUE Nuc PC H3 2 1 3 TRUE Nuc NC 18S 38 38 76 EQUAL Nuc NC 28S 11 9 20 TRUE Nuc NC Rag1_Int 3 3 6 EQUAL Nuc NC Rag1_Int2 0 2 2 FALSE Nuc NC ITS2 1 0 1 TRUE Chlor PC Psbc 3 3 6 EQUAL Chlor PC RbcL 2 4 6 FALSE 30 loci 396 19 TRUE, 6 comparisons FALSE (p=0.015)

155

Table Ch2_S1_3. Summary of dN/dS ratios for individual genes. 236 pairs of gene dN/dS ratios were included across 21 genes representing 71 phylogenetically independent habitat transitions. Note some dN/dS ratios were equal between FW and SAL categories but contained non-zero dN and dS rates; the total number of comparisons including these are given in brackets. The Rag2 gene showed generally higher dN/dS ratios in the FW lineages, but not after Bonferroni correction. No genes had a significant directional pattern.

Gene Gene FW>SAL SAL>FW total comparisons FW more category often higher? Mit PC COI 10 9 19 (21) TRUE Mit PC CytB 25 28 53 FALSE Mit PC COIII 3 0 3 TRUE Mit PC ATP6 11 6 17 TRUE Mit PC ATP8 10 7 17 TRUE Mit PC ND1 3 0 3 TRUE Mit PC ND2 9 3 12 (uncorrected TRUE p= 0.15) Mit PC ND3 2 1 3 TRUE Mit PC ND4 3 0 3 TRUE Mit PC ND4L 2 1 3 TRUE Mit PC ND5 1 2 3 FALSE Nuc PC Rag1_1 1 3 4(6) FALSE Nuc PC Rag1_2 3 4 7(9) FALSE Nuc PC Rag1_3 1 1 2 EQUAL Nuc PC Rag1(whole) 8 9 17 (20) FALSE Nuc PC Rag2 16 6 22 (24) TRUE (uncorrected p= 0.052) Nuc PC Tmo 5 5 10 (12) EQUAL Nuc PC MYH6 2 4 6 (9) FALSE Nuc PC Gylt 2 1 3 (4) TRUE Chlor PC Psbc 2 4 6 FALSE Chlor PC RbcL 3 2 5 (6) TRUE 21 loci 236 comparisons 12 TRUE, 7 FALSE (p=0.36)

156

Figure Ch2_S1_2. Relative dN/dS ratios, dN rates, and dS rates across 71 comparisons, including branch lengths (calculated from the dN and dS rates). The boxes represent the middle two quartiles of relative rates and the tailed lines represent the full range of observed data.

No relationship between ancestral habitat and overall rates

There was no significant trend of either habitat having higher relative rates when the comparisons were divided based on direction of the habitat shift (FW->SAL or SAL->FW directions). Results in parentheses are presented as: number of comparisons with SAL (saline clade) rate greater (positive direction), number of comparisons with FW (freshwater clade) rate greater (negative direction), (number of sister pairs exhibiting equal rates, if present), total N used for binomial test, p-value from binomial test. For OSRs, the FW->SAL direction (5, 11, 16, p=0.21) and SAL->FW direction (49, 55, 104, p=0.62) each did not show habitat-specific trends in relative rates. Similarly, for dN/dS ratios, the FW->SAL direction (4, 3, 7, p=1.0) and SAL- >FW direction (26, 31, 57, p=0.60) each did not show habitat-specific trends in relative rates. There was no difference between the proportions of FW>SAL or SAL>FW relative rates between the two habitat shift directions (Fisher’s exact test: OSRs p=0.29; dN/dS rates p=0.70).

157

Some difference in behavior of molecular measures for total substitution rates

212 individual sister clade gene results had both relative overall substitution rates (OSRs) (PAML baseml) and relative 'branch length' information (calculated from dN and dS rates in PAML codeml) (in SM Ch2_S2 Table Ch2_S2_1). Of these, 166 were in the same direction (FW>SAL, or SAL>FW) between the two methods, giving a 78% concordance in direction (reported in main manuscript). 109 individual gene results had the exact same sequence length included in each method, i.e. providing a reduced but more direct comparison between methods. Of these, 80 were in the same direction (FW>SAL, or SAL>FW) between the two methods, giving a 73% concordance in direction. When relative rates of a minimum of 20% rate difference between habitats for either molecular measure were included, the concordance in direction increased to only 80% in the first case, and 79% in the second case.

158

Supplementary Material Ch4_S1

Details of input data and results obtained through DAVID analysis, PANTHER analysis, and relax selection analysis. Lists of genes and p values for positive selection analysis are provided in Supplementary Material Ch4_S2.

Table Ch4_S1_1. Over- and under-representation of PANTHER Biological process categories by the set of 1476 genes as compared with a background genes from Drosophila melanogaster genome. 1476 genes were mapped to 1438 IDs. Child (sub-categorical) processes are indented below parent processes. Only categories with p<0.05 are shown. Corresponding DAVID Gene Ontology results are given in SM Ch4_S2 Table Ch4_S2_1 due to size.

13690 genes 1438 genes PANTHER GO-Slim Biological # in category Expected # Observed # Over or P value Fold Process under enrichment represent metabolic process (GO:0008152) 4996 524.8 724 + 1.23E-24 1.38 primary metabolic process 4149 435.8 618 + 4.88E-22 1.42 (GO:0044238) nucleobase-containing 1694 177.9 289 + 1.68E-14 1.62 compound metabolic process (GO:0006139) RNA metabolic process 1039 109.1 189 + 4.53E-11 1.73 (GO:0016070) rRNA metabolic process 92 9.7 28 + 2.04E-04 2.9 (GO:0016072) DNA metabolic process 232 24.4 45 + 1.91E-02 1.85 (GO:0006259) DNA replication (GO:0006260) 86 9.0 24 + 4.62E-03 2.66 RNA splicing (GO:0008380) 93 9.8 26 + 2.10E-03 2.66 RNA splicing, via 90 9.5 25 + 3.41E-03 2.64 transesterification reactions (GO:0000375) localization (GO:0051179) 1462 153.6 200 + 1.51E-02 1.3 protein localization (GO:0008104) 70 7.4 19 + 4.37E-02 2.58 biological regulation (GO:0065007) 1560 163.9 214 + 6.93E-03 1.31 protein metabolic process 1813 190.4 305 + 1.15E-14 1.6 (GO:0019538) translation (GO:0006412) 387 40.7 97 + 2.88E-12 2.39 regulation of translation 148 15.6 37 + 4.34E-04 2.38 (GO:0006417) vesicle-mediated transport 411 43.2 83 + 5.23E-06 1.92 (GO:0016192) exocytosis (GO:0006887) 97 10.2 24 + 2.84E-02 2.36 cellular component organization or 672 70.6 136 + 1.29E-10 1.93 biogenesis (GO:0071840) cellular component biogenesis 215 22.6 53 + 4.85E-06 2.35 (GO:0044085) mRNA processing (GO:0006397) 189 19.9 41 + 3.55E-03 2.07 mRNA splicing, via spliceosome 122 12.8 28 + 2.89E-02 2.18 (GO:0000398) biosynthetic process (GO:0009058) 333 35.0 70 + 1.51E-05 2 cellular component organization 576 60.5 105 + 1.27E-05 1.74 (GO:0016043) organelle organization 302 31.7 63 + 8.61E-05 1.99 (GO:0006996)

159

nitrogen compound metabolic 557 58.5 114 + 5.55E-09 1.95 process (GO:0006807) cellular protein modification process 681 71.5 128 + 6.90E-08 1.79 (GO:0006464) protein phosphorylation 266 27.9 51 + 9.10E-03 1.83 (GO:0006468) protein transport (GO:0015031) 555 58.3 109 + 1.49E-07 1.87 intracellular protein transport 537 56.4 100 + 9.89E-06 1.77 (GO:0006886) cellular process (GO:0009987) 2878 302.3 458 + 1.32E-19 1.52 cell cycle (GO:0007049) 565 59.4 102 + 3.08E-05 1.72 regulation of catalytic activity 478 50.2 77 + 3.92E-02 1.53 (GO:0050790) Unclassified (UNCLASSIFIED) 7319 768.8 542 - 0.00E+00 0.71 multicellular organismal process 488 51.3 23 - 1.13E-03 0.45 (GO:0032501) single-multicellular organism 488 51.3 23 - 1.13E-03 0.45 process (GO:0044707) system process (GO:0003008) 421 44.2 20 - 5.75E-03 0.45 neurological system process 391 41.1 18 - 6.87E-03 0.44 (GO:0050877) reproduction (GO:0000003) 173 18.2 4 - 1.31E-02 0.22 gamete generation (GO:0007276) 156 16.4 3 - 1.22E-02 < 0.2 phosphate ion transport 115 12.1 1 - 1.36E-02 < 0.2 (GO:0006817) extracellular transport (GO:0006858) 147 15.4 1 - 5.77E-04 < 0.2

Location of flight losses for nuclear gene positive selection analysis

Qualification of representation of flight loss by species sampling (Table Ch4_S1_2): ‘good’ signifying that the lineage splitting accurately represents the loss (as in cases of order level losses), ‘moderate’ where the lineage contains a loss but the lineage splitting does not match up directly with that loss (e.g. family-level loss but superfamily level splits), and ‘approximate’ for where a loss does occur but the lineage does not accurately represent the loss—in the case of Phasmatodea many losses occur with the order but only flightless species were available, and in the case of Hemiptera Xenophysella lineage a loss occurs at/within the family but the lineage split is much higher.

160

Table Ch4_S1_2. Summary of flight losses and data analysed. Newick trees and gene lists are given in Supplementary Material (SM) Ch2_S2 (Excel file). Loss number corresponds to ordering of losses appearing in Figure 4.1 from the top to bottom of the tree.

# Loss Lineage in Figure Type of # Loss accuracy (1=good, 2=close, species # 4.1 loss genes 3=approximate) in tree Orthoptera lineage 1 Both sexes 17 584 2. Close – Family-level loss Prosarthria Orthoptera lineage 2 Both sexes 17 584 2. Close – Family-level loss Ceuthophilus Mantophasmatodea 1. Good – Potential above order- 3 Both sexes 17 584 and Grylloblattaria level loss Female 4 Embioptera 17 584 1. Good – Order-level loss only 3. Not true loss here; many losses Both sexes 5 Phasmatodea 17 584 within order (Whiting et al. 2003, (mostly) Stone and French 2003) Hemiptera lineage Female 6 13 697 2. Close – Family-level loss Planococcus only 3. Less accurate – loss within Hemiptera lineage 7 Both sexes 13 697 family (all species in family are Xenophysella flightless except 1 species) 2. Unclear where flight loss Clade within 8 Both sexes* 14 1174 occurred within Psocodea - 3 Psocodea species in clade are flightless Female 9 Strepsiptera 14 997 1. Good – Order-level loss only* Mecoptera lineage 10 Both sexes 12 642 1. Good – Order-level loss Boreus 11 Siphonaptera Both sexes* 12 642 1. Good – Order-level loss *Flight loss co-occurring with parasitism

161

Table Ch4_S1_3. Gene Ontology (GO) categories from DAVID analysis of positively selected genes in the lineage (‘P’) leading to Pterygota; A) GO terms for all positively selected genes, and B) GO terms for those positively selected genes uniquely detected in ‘P’ lineage and not in control lineages tested.

A) 126 candidate gene with 119 DAVID ID’s 914 total 119 positively selected genes genes GO term # in Expected Observed Genes (Flybase IDs) P value Fold category #a # enrichmentb compositionally biased region:Ser-rich 15 2.0 8 FBGN0000568, FBGN0002183, FBGN0005771, FBGN0003977, FBGN0250786, 0.00027 4.10 FBGN0037855, FBGN0037109, FBGN0002431 GO:0004386~helicase activity 16 2.1 8 FBGN0020633, FBGN0030855, FBGN0036018, FBGN0250786, FBGN0010421, 0.0015 3.84 FBGN0032690, FBGN0037232, FBGN0015929 IPR011990:Tetratricopeptide-like helical 19 2.5 8 FBGN0031020, FBGN0260749, FBGN0028475, FBGN0027496, FBGN0037022, 0.0058 3.23 FBGN0032470, FBGN0037855, FBGN0036828 GO:0046914~transition metal ion binding 90 11.7 20 FBGN0033038, FBGN0034412, FBGN0013717, FBGN0261064, FBGN0038686, 0.0075 1.71 FBGN0017567, FBGN0039252, FBGN0031060, FBGN0025635, FBGN0002431, FBGN0000568, FBGN0038769, FBGN0005771, FBGN0027951, FBGN0031093, FBGN0036398, FBGN0001978, FBGN0029941, FBGN0051120, FBGN0034728 GO:0046872~metal ion binding 113 14.7 23 FBGN0033038, FBGN0034412, FBGN0013717, FBGN0261064, FBGN0038686, 0.011 1.56 FBGN0017567, FBGN0039252, FBGN0025635, FBGN0031060, FBGN0002431, FBGN0000568, FBGN0005771, FBGN0038769, FBGN0034072, FBGN0027951, FBGN0000709, FBGN0031093, FBGN0031252, FBGN0036398, FBGN0001978, FBGN0029941, FBGN0051120, FBGN0034728 GO:0043169~cation binding 114 14.8 23 FBGN0033038, FBGN0034412, FBGN0013717, FBGN0261064, FBGN0038686, 0.012 1.55 FBGN0017567, FBGN0039252, FBGN0025635, FBGN0031060, FBGN0002431, FBGN0000568, FBGN0005771, FBGN0038769, FBGN0034072, FBGN0027951, FBGN0000709, FBGN0031093, FBGN0031252, FBGN0036398, FBGN0001978, FBGN0029941, FBGN0051120, FBGN0034728 GO:0043167~ion binding 114 14.8 23 FBGN0033038, FBGN0034412, FBGN0013717, FBGN0261064, FBGN0038686, 0.012 1.55 FBGN0017567, FBGN0039252, FBGN0025635, FBGN0031060, FBGN0002431, FBGN0000568, FBGN0005771, FBGN0038769, FBGN0034072, FBGN0027951, FBGN0000709, FBGN0031093, FBGN0031252, FBGN0036398, FBGN0001978, FBGN0029941, FBGN0051120, FBGN0034728 GO:0008270~zinc ion binding 76 9.9 17 FBGN0033038, FBGN0034412, FBGN0013717, FBGN0261064, FBGN0038686, 0.015 1.72 FBGN0039252, FBGN0031060, FBGN0025635, FBGN0002431, FBGN0000568, FBGN0038769, FBGN0005771, FBGN0027951, FBGN0036398, FBGN0001978, FBGN0029941, FBGN0034728 IPR015880:Zinc finger, C2H2-like 9 1.2 5 FBGN0005771, FBGN0027951, FBGN0013717, FBGN0261064, FBGN0025635 0.018 4.27 SM00355:ZnF_C2H2 9 1.2 5 FBGN0005771, FBGN0027951, FBGN0013717, FBGN0261064, FBGN0025635 0.023 4.27 repeat:TPR 2 3 0.4 3 FBGN0038300, FBGN0032470, FBGN0037855 0.030 7.68 repeat:TPR 1 3 0.4 3 FBGN0038300, FBGN0032470, FBGN0037855 0.030 7.68 GO:0070035~purine NTP-dependent helicase 11 1.4 5 FBGN0030855, FBGN0250786, FBGN0010421, FBGN0032690, FBGN0037232 0.036 3.49 activity GO:0008026~ATP-dependent helicase activity 11 1.4 5 FBGN0030855, FBGN0250786, FBGN0010421, FBGN0032690, FBGN0037232 0.036 3.49 IPR013026:Tetratricopeptide region 16 2.1 6 FBGN0031020, FBGN0260749, FBGN0037022, FBGN0032470, FBGN0037855, 0.041 2.88 FBGN0036828 GO:0003678~DNA helicase activity 7 0.9 4 FBGN0020633, FBGN0010421, FBGN0037232, FBGN0015929 0.044 4.39

162 zinc 62 8.1 14 FBGN0033038, FBGN0034412, FBGN0013717, FBGN0261064, FBGN0038686, 0.044 1.73 FBGN0031060, FBGN0039252, FBGN0002431, FBGN0000568, FBGN0005771, FBGN0036398, FBGN0001978, FBGN0029941, FBGN0034728 IPR001440:Tetratricopeptide TPR-1 12 1.6 5 FBGN0031020, FBGN0260749, FBGN0037022, FBGN0032470, FBGN0037855 0.053 3.20 helicase 12 1.6 5 FBGN0030855, FBGN0250786, FBGN0010421, FBGN0032690, FBGN0037232 0.057 3.20 GO:0016071~mRNA metabolic process 44 5.7 10 FBGN0003742, FBGN0003977, FBGN0035294, FBGN0034923, FBGN0011224, 0.067 1.75 FBGN0022942, FBGN0037371, FBGN0032690, FBGN0035872, FBGN0036828 IPR007087:Zinc finger, C2H2-type 8 1.0 4 FBGN0005771, FBGN0027951, FBGN0261064, FBGN0025635 0.068 3.84 IPR019734:Tetratricopeptide repeat 13 1.7 5 FBGN0031020, FBGN0260749, FBGN0037022, FBGN0032470, FBGN0037855 0.069 2.95 GO:0046395~carboxylic acid catabolic process 4 0.5 3 FBGN0028479, FBGN0032287, FBGN0025352 0.073 5.76 GO:0016054~organic acid catabolic process 4 0.5 3 FBGN0028479, FBGN0032287, FBGN0025352 0.073 5.76 GO:0016042~lipid catabolic process 4 0.5 3 FBGN0028479, FBGN0025352, FBGN0010591 0.073 5.76 GO:0044242~cellular lipid catabolic process 4 0.5 3 FBGN0028479, FBGN0025352, FBGN0010591 0.073 5.76 GO:0006397~mRNA processing 39 5.1 9 FBGN0003742, FBGN0003977, FBGN0035294, FBGN0011224, FBGN0022942, 0.082 1.77 FBGN0037371, FBGN0032690, FBGN0035872, FBGN0036828 SM00028:TPR 13 1.7 5 FBGN0031020, FBGN0260749, FBGN0037022, FBGN0032470, FBGN0037855 0.085 2.95 GO:0012505~endomembrane system 35 4.6 8 FBGN0027496, FBGN0040087, FBGN0034277, FBGN0014075, FBGN0027868, 0.086 1.76 FBGN0032052, FBGN0010591, FBGN0022213 GO:0007444~imaginal disc development 33 4.3 8 FBGN0028992, FBGN0034072, FBGN0005771, FBGN0250786, FBGN0027594, 0.087 1.86 FBGN0011224, FBGN0011739, FBGN0002431 GO:0051252~regulation of RNA metabolic 67 8.7 13 FBGN0003742, FBGN0000568, FBGN0005771, FBGN0003977, FBGN0027951, 0.089 1.49 FBGN0013717, FBGN0010520, FBGN0001978, FBGN0030093, FBGN0037109, process FBGN0001324, FBGN0035872, FBGN0036828 GO:0000166~nucleotide binding 137 17.8 23 FBGN0003742, FBGN0020633, FBGN0023083, FBGN0034401, FBGN0030855, 0.090 1.29 FBGN0026761, FBGN0036018, FBGN0036888, FBGN0034923, FBGN0011224, FBGN0035872, FBGN0032690, FBGN0032882, FBGN0032350, FBGN0250786, FBGN0260945, FBGN0010421, FBGN0260010, FBGN0028473, FBGN0034728, FBGN0011739, FBGN0037232, FBGN0015929 metal-binding 69 9.0 14 FBGN0034412, FBGN0013717, FBGN0261064, FBGN0038686, FBGN0017567, 0.093 1.56 FBGN0031060, FBGN0002431, FBGN0000568, FBGN0005771, FBGN0031252, FBGN0036398, FBGN0001978, FBGN0029941, FBGN0034728 IPR012677:Nucleotide-binding, alpha-beta plait 20 2.6 6 FBGN0003742, FBGN0036018, FBGN0260010, FBGN0034923, FBGN0011224, 0.096 2.30 FBGN0035872 tpr repeat 9 1.2 4 FBGN0260749, FBGN0038300, FBGN0032470, FBGN0037855 0.098 3.41

B) 39 unique candidate genes with 37 DAVID ID’s 914 total 37 positively selected genes genes GO term # in Expected Observed Genes (Flybase IDs) P value Fold category #c # enrichmentb GO:0070011~peptidase activity, acting on L- 25 1.0 5 FBGN0013717, FBGN0005632, FBGN0031093, FBGN0031060, FBGN0039252 0.010 4.94 amino acid peptides GO:0008233~peptidase activity 27 1.1 5 FBGN0013717, FBGN0005632, FBGN0031093, FBGN0031060, FBGN0039252 0.013 4.57 Protease 17 0.7 4 FBGN0013717, FBGN0005632, FBGN0031060, FBGN0039252 0.026 5.81 hydrolase 96 3.9 9 FBGN0032882, FBGN0020633, FBGN0013717, FBGN0005632, FBGN0031093, 0.027 2.32 FBGN0031252, FBGN0032690, FBGN0031060, FBGN0039252 IPR012677:Nucleotide-binding, alpha-beta plait 20 0.8 4 FBGN0003742, FBGN0260010, FBGN0034923, FBGN0011224 0.039 4.94

163

GO:0000398~nuclear mRNA splicing, via 25 1.0 4 FBGN0003742, FBGN0035294, FBGN0011224, FBGN0032690 0.049 3.95 spliceosome GO:0000377~RNA splicing, via 25 1.0 4 FBGN0003742, FBGN0035294, FBGN0011224, FBGN0032690 0.049 3.95 transesterification reactions with bulged adenosine as nucleophile GO:0000375~RNA splicing, via 25 1.0 4 FBGN0003742, FBGN0035294, FBGN0011224, FBGN0032690 0.049 3.95 transesterification reactions GO:0016071~mRNA metabolic process 44 1.8 5 FBGN0003742, FBGN0035294, FBGN0034923, FBGN0011224, FBGN0032690 0.057 2.81 GO:0008237~metallopeptidase activity 11 0.4 3 FBGN0031093, FBGN0031060, FBGN0039252 0.057 6.74 GO:0006508~proteolysis 48 1.9 5 FBGN0013717, FBGN0005632, FBGN0031093, FBGN0031060, FBGN0039252 0.074 2.57 GO:0008380~RNA splicing 30 1.2 4 FBGN0003742, FBGN0035294, FBGN0011224, FBGN0032690 0.078 3.29 GO:0046914~transition metal ion binding 90 3.6 7 FBGN0038769, FBGN0013717, FBGN0031093, FBGN0017567, FBGN0031060, 0.093 1.92 FBGN0039252, FBGN0025635 GO:0046872~metal ion binding 113 4.6 8 FBGN0038769, FBGN0013717, FBGN0031093, FBGN0031252, FBGN0017567, 0.097 1.75 FBGN0031060, FBGN0039252, FBGN0025635 GO:0008038~neuron recognition 3 0.1 2 FBGN0023083, FBGN0250753 0.099 16.47 GO:0008037~cell recognition 3 0.1 2 FBGN0023083, FBGN0250753 0.099 16.47 aExpected number of genes in the GO term was calculated by (‘# in category’/914)*119 bFold enrichment was calculated by Expected#/Observed# of genes in that GO term cExpected number of genes in the GO term was calculated by (‘# in category’/914)*37

164

Table Ch4_S1_4. Gene Ontology categories from DAVID analysis for genes detected to be under positive selection in 3 or more flight-loss lineages. P<0.05 results are also given in main manuscript. 58 candidate genes were mapped to 55 DAVID IDs; 1284 background genes were mapped to 1232 DAVID IDs.

1232 55 candidate genes total genes Term # in Expected Observed Genes (Flybase Gene IDs) P value Fold category #a # enrichmentb GO:0003729~mRNA binding 37 1.7 6 FBGN0027617, FBGN0030631, FBGN0017453, FBGN0036018, 0.014 3.63 FBGN0036317, FBGN0020660 compositionally biased region:Ser-rich 21 0.9 4 FBGN0005771, FBGN0023444, FBGN0037855, FBGN0002431 0.015 4.27 IPR012677:Nucleotide-binding, alpha-beta plait 27 1.2 5 FBGN0027617, FBGN0026372, FBGN0036018, FBGN0036317, FBGN0020660 0.026 4.15 zinc finger region:UBR-type 2 0.1 2 FBGN0030809, FBGN0002431 0.053 22.40 GO:0008270~zinc ion binding 105 4.7 9 FBGN0033038, FBGN0034412, FBGN0005771, FBGN0017453, 0.053 1.92 FBGN0027086, FBGN0030809, FBGN0029941, FBGN0000139, FBGN0002431 GO:0006508~proteolysis 55 2.5 6 FBGN0033038, FBGN0261112, FBGN0030809, FBGN0023444, 0.061 2.44 FBGN0028691, FBGN0002431 IPR013993:Zinc finger, N-recognin, metazoa 2 0.1 2 FBGN0030809, FBGN0002431 0.084 22.40 ubl conjugation pathway 25 1.1 4 FBGN0261112, FBGN0030809, FBGN0023444, FBGN0002431 0.091 3.58 IPR000504:RNA recognition motif, RNP-1 26 1.2 4 FBGN0027617, FBGN0036018, FBGN0036317, FBGN0020660 0.097 3.45 aExpected number of genes in the GO term was calculated by (‘# in category’/1232)*55 bFold enrichment was calculated by Expected#/Observed# of genes in that GO term

165

Table Ch4_S1_5. Gene Ontology categories from DAVID analysis for genes detected to be under positive selection in 3 or more fully flight-loss lineages (not including female-flightless lineages). 51 candidate genes were mapped to 48 DAVID IDs; 1246 background genes were mapped to 1196 DAVID IDs.

1196 48 candidate genes total genes Term # in Expected Observed Genes P value Fold category #a # enrichmentb compositionally biased region:Ser-rich 20 0.8 4 FBGN0005771, FBGN0023444, FBGN0037855, FBGN0002431 0.014 4.98 GO:0003729~mRNA binding 37 1.5 5 FBGN0030631, FBGN0017453, FBGN0036018, FBGN0036317, FBGN0020660 0.034 3.37 GO:0008270~zinc ion binding 102 4.1 8 FBGN0034412, FBGN0005771, FBGN0017453, FBGN0027086, 0.050 1.95 FBGN0030809, FBGN0029941, FBGN0000139, FBGN0002431 zinc finger region:UBR-type 2 0.1 2 FBGN0030809, FBGN0002431 0.054 24.92 ubl conjugation pathway 23 0.9 4 FBGN0261112, FBGN0030809, FBGN0023444, FBGN0002431 0.056 4.33 GO:0048749~compound eye development 28 1.1 4 FBGN0023444, FBGN0000097, FBGN0033179, FBGN0002431 0.071 3.56 IPR013993:Zinc finger, N-recognin, metazoa 2 0.1 2 FBGN0030809, FBGN0002431 0.075 24.92 IPR012677:Nucleotide-binding, alpha-beta plait 27 1.1 4 FBGN0026372, FBGN0036018, FBGN0036317, FBGN0020660 0.080 3.69 GO:0001654~eye development 30 1.2 4 FBGN0023444, FBGN0000097, FBGN0033179, FBGN0002431 0.084 3.32 SM00396:ZnF_UBR1 2 0.1 2 FBGN0030809, FBGN0002431 0.097 24.92 GO:0019941~modification-dependent protein 32 1.3 4 FBGN0261112, FBGN0030809, FBGN0023444, FBGN0002431 0.098 3.11 catabolic process aExpected number of genes in the GO term was calculated by (‘# in category’/1196)*48 bFold enrichment was calculated by Expected#/Observed# of genes in that GO term

166

Table Ch4_S1_6. PANTHER Biological Process categories for genes detected to be under positive selection in 3 or more fully flight-loss lineages (not including female-flightless). 51 candidate genes were mapped to 49 PANTHER IDs; 1246 background genes were mapped to 1217 PANTHER IDs. Child (sub-categorical) processes are indented below parent processes. Only categories with p<0.05 are shown.

PANTHER GO-Slim Biological Process # in Expected Observed Over or P Fold category # # under value enrichment represented primary metabolic process (GO:0044238) 536 21.6 28 + 0.045 1.30 nucleobase-containing compound metabolic process 258 10.4 16 + 0.042 1.54 (GO:0006139) DNA repair (GO:0006281) 15 0.6 3 + 0.023 4.97 rRNA metabolic process (GO:0016072) 27 1.1 4 + 0.023 3.68 cellular component organization or biogenesis 119 4.8 9 + 0.046 1.88 (GO:0071840) cellular component biogenesis (GO:0044085) 48 1.9 6 + 0.012 3.10 localization (GO:0051179) 163 6.6 2 - 0.032 0.30 transport (GO:0006810) 151 6.1 2 - 0.048 0.33

Table Ch4_S1_7. DAVID Gene Ontology categories for genes detected to be under positive selection in 3 or more flight-loss lineages, with those genes removed that were majority detected in parasitic flight-loss lineages. 1284 background genes were mapped to 1232 DAVID IDs; 43 candidate genes were mapped to 40 DAVID IDs.

1232 total 40 candidate genes genes Term # in Expected Observed Genes P Fold category #a # value enrichmentb IPR012677:Nucleotide-binding, 27 0.9 4 FBGN0027617, FBGN0026372, 0.048 4.56 alpha-beta plait FBGN0036018, FBGN0036317 GO:0007059~chromosome 17 0.6 3 FBGN0015391, FBGN0011573, 0.086 5.44 segregation FBGN0014010 IPR016040:NAD(P)-binding 17 0.6 3 FBGN0030731, FBGN0028479, 0.095 5.44 domain FBGN0261112 GO:0007049~cell cycle 86 2.8 6 FBGN0015391, FBGN0037249, 0.098 2.15 FBGN0023444, FBGN0011573, FBGN0086899, FBGN0014010 aExpected number of genes in the GO term was calculated by (‘# in category’/1232)*40 bFold enrichment was calculated by Expected#/Observed# of genes in that GO term

167

Table Ch4_S1_8. PANTHER Biological Process categories for genes detected to be under positive selection in 3 or more flight-loss lineages, with those genes removed that were majority detected in parasitic flight-loss lineages. 1284 background genes were mapped to 1254 PANTHER IDs; 43 candidate genes were mapped to 42 PANTHER IDs. Child (sub- categorical) processes are indented below parent processes. Only categories with p<0.05 are shown.

PANTHER GO-Slim Biological Process # in Observed Expected Under or P Fold category # # over value enrichment represented metabolic process (GO:0008152) 644 28 21.6 + 0.033 1.3 primary metabolic process (GO:0044238) 550 27 18.4 + 0.0061 1.47 nucleobase-containing compound metabolic 262 17 8.8 + 0.0031 1.94 process (GO:0006139) DNA repair (GO:0006281) 17 3 0.6 + 0.019 5.27 RNA metabolic process (GO:0016070) 170 12 5.7 + 0.0082 2.11 rRNA metabolic process (GO:0016072) 28 4 0.9 + 0.014 4.27 cellular component organization or biogenesis 122 8 4.1 + 0.047 1.96 (GO:0071840) organelle organization (GO:0006996) 56 5 1.9 + 0.038 2.67 chromatin organization (GO:0006325) 21 3 0.7 + 0.033 4.27 cellular component biogenesis (GO:0044085) 49 5 1.6 + 0.023 3.05 biological regulation (GO:0065007) 168 11 5.6 + 0.020 1.95 regulation of biological process (GO:0050789) 117 9 3.9 + 0.014 2.3 regulation of transcription from RNA polymerase 41 4 1.4 + 0.048 2.91 II promoter (GO:0006357) nitrogen compound metabolic process 106 9 3.6 + 0.0075 2.54 (GO:0006807) localization (GO:0051179) 165 1 5.5 - 0.020 < 0.2 transport (GO:0006810) 153 1 5.1 - 0.029 < 0.2

168

Figure Ch4_S1_1. Tree of 66 species used in the mitochondrial gene flight-loss relaxed- selection analysis. Red and purple coloured branches (flightless and female-flightless) were coded rate category #1 and blue branches (related flying) were coded #2; black lineages (background) were in a third rate category. In the 2-rate model used for significance tests, all coloured lineages were coded the same rate, with the black lineages in a second rate category. Note this same 66 species tree genes was used for analysis Part 5 (HyPhy positive selection analysis) with more than one species per hexapod order for all 13 mitochondrial protein-coding.

169

Table Ch4_S1_9. Results of relaxed selection analysis via dN/dS ratios of mitochondrial OXPHOS genes in flightless vs flying lineages of pterygotes. The tree analysed containing 66 species is shown above (Figure S1). BH= Benjamini-Hochberg correction.

dN/dS ratios in 3-rate model Likelihood value P value Gene Alignment background Flightless (#1) Flying (#2) Flightless 3-rate 2-rate uncorrected BH Significant? length > Flying model model COI 1674 0.028 0.02609 0.01551 TRUE -55367.5 -55376.2 0.000032 0.00042 Yes COII 738 0.035 0.0377 0.01691 TRUE -29922.5 -29927.1 0.0024 0.010 Yes COIII 822 0.04 0.04087 0.02297 TRUE -33886.6 -33891.8 0.0011 0.0074 Yes CYTB 1176 0.038 0.04507 0.03098 TRUE -46153.2 -46157.7 0.0028 0.0089 Yes ATP6 711 0.023 0.01482 0.01501 FALSE -31120.3 -31120.3 0.97 1.1 No ATP8 234 0.029 0.01088 0.01002 TRUE -6432.67 -6432.68 0.86 1.0 No ND1 1005 0.041 0.03827 0.028 TRUE -41806.4 -41808.5 0.042 0.092 No ND2 1083 0.059 0.0416 0.0279 TRUE -49958.9 -49960.7 0.057 0.11 No ND3 366 0.049 0.03771 0.03022 TRUE -16971.4 -16971.7 0.40 0.58 No ND4 1467 0.05 0.02393 0.0297 FALSE -62879.4 -62880.3 0.17 0.17 No ND4L 324 0.031 0.02927 0.02236 TRUE -13478.1 -13478.4 0.47 0.61 No ND5 1809 0.047 0.02928 0.02153 TRUE -80329.5 -80331.2 0.061 0.10 No ND6 591 0.027 0.02497 0.00672 TRUE -19147 -19150.7 0.0067 0.018 Yes

170