Assessing the impacts of environmental change on British pollinators (Syrphidae) using next generation sequencing techniques
Hannah Norman
Submitted for the degree of
Doctorate of Philosophy
Department of Life Sciences, Imperial College London Declaration of Originality
All the work in this PhD is mine, and any programs, data and work which is not mine or was not carried out by me is referenced within the text.
Copyright Declaration
The copyright of this thesis rests with the author. Unless otherwise indicated, its contents are licensed under a Creative Commons Attribution-Non-Commercial 4.0 International License (CC BY-NC). Under this licence, you may copy and redistribute the material in any medium or format. You may also create and distribute modified versions of the work. This is on the condition that: you credit the author and do not use it, or any derivative works, for a commercial purpose. When reusing or sharing this work, ensure you make the licence terms clear to others by naming the licence and linking to the licence text. Where a work has been adapted, you should indicate that the work has been changed and describe those changes. Please seek permission from the copyright holder for uses of this work that are not included in this licence or permitted under UK Copyright Law.
1 Abstract
British pollinating insects are vital for their contribution to crop yields as well as maintenance of semi-natural environments across the UK. The Syrphidae family are highly diverse and thought to be an important pollinating group. In order to understand the evolutionary relationships in this family, mitochondrial genomes were used to build the largest tree of the Syrphidae family to date. This tree was used to establish relative ages for divergences within the family, and to explore the evolution of diverse syrphid larval life histories. The recent introduction of a UK pollinator monitoring programme has resulted in a large amount of data with the potential to inform conservation and management, presenting an opportunity for DNA analysis. A CO1 reference database was curated and tested for UK syrphids, resulting in sequences for 70% of UK species and highlighting difficulties in DNA identification. Barcoding of syrphids was then expanded by also metabarcoding pollen from gut samples. This showed the diversity of plant – syrphid interactions at an individual level across three different land use types. It also highlighted the importance of including syrphid larval life histories in analyses of this family, as these appeared to have a larger effect on species composition than pollen composition did. Finally, the reach of this thesis was expanded to include non-bee and non-syrphid pan trap visitors from the pollinator monitoring programme. This highlighted the diversity of non- syrphid Dipterans and allowed species and phylogenetic diversity to be compared in these bulk insect samples. Using DNA methods to analyse monitoring data has the potential to increase our knowledge of cryptic species, phylogenetic diversity and pollinator associations. Alongside this, DNA can be used to analyse large datasets of diverse pollinating insects which otherwise would be overlooked.
2 Table of Contents
Declaration of Originality ...... 1 Copyright Declaration ...... 1 Abstract ...... 2 List of table and figures ...... 5 Chapter 1 ...... 5 Chapter 2 ...... 5 Chapter 3 ...... 5 Chapter 4 ...... 5 Literature review ...... 6 Pollination ...... 6 Syrphid pollinators ...... 8 Monitoring ...... 9 DNA monitoring ...... 11 Plant – pollinator networks ...... 13 Phylogenetics ...... 15 Conclusion ...... 17 Chapter 1: Syrphidae Mitochondrial Genomes and Phylogeny ...... 18 Introduction ...... 18 Methodology ...... 21 Results ...... 25 Discussion ...... 37 Conclusion ...... 41 Chapter 2: Developing and testing a Syrphidae CO1 reference database ...... 43 Introduction ...... 43 Methodology ...... 46 Results ...... 52 Discussion ...... 63 Conclusion ...... 67 Chapter 3: Hoverfly diversity and pollen associations in a semi-urban landscape ...... 68 Introduction ...... 68 Methods ...... 71 Results ...... 78 Discussion ...... 95
3 Conclusion ...... 100 Chapter 4: Metabarcoding of pan trap bycatch to identify non-bee and non-syrphid flower visitors ...... 102 Introduction ...... 102 Methods ...... 104 Results ...... 106 Discussion ...... 117 Conclusion ...... 122 Discussion ...... 123 Acknowledgements ...... 127 Bibliography ...... 128 Supplementary Information ...... 143 Chapter 1: Syrphidae Mitochondrial Genomes and Phylogeny ...... 143 Chapter 2: Developing and testing a Syrphidae CO1 reference database ...... 145
4 List of table and figures
Chapter 1 Table 1………………………………………………………………………………………………………………………….25 Table 2 …………………………………………………………………………………………………………………………26 Table 3 ……………………………………………………………………………………………………………………….…27 Figure 1…………………………………………………………………………………………………………………………30 Figure 2…………………………………………………………………………………………………………………………32 Figure 3…………………………………………………………………………………………………………………………33 Figure 4…………………………………………………………………………………………………………………………35 Figure 5…………………………………………………………………………………………………………………………36
Chapter 2 Table 1 …………………………………………………………………………………………………………………………52 Figure 1…………………………………………………………………………………………………………………………53 Figure 2…………………………………………………………………………………………………………………………55 Table 2 ………………………………………………………………………………………………………………………….56 Figure 3…………………………………………………………………………………………………………………………57 Figure 4…………………………………………………………………………………………………………………………59 Figure 5…………………………………………………………………………………………………………………………60 Figure 6…………………………………………………………………………………………………………………………62
Chapter 3 Figure 1…………………………………………………………………………………………………………………………72 Table 1 ……………………………………………………………………………………………………………………….…75 Table 2 …………………………………………………………………………………………………………………….……79 Figure 2…………………………………………………………………………………………………………………………80 Figure 3…………………………………………………………………………………………………………………………82 Table 3 ………………………………………………………………………………………………………………………….84 Figure 4…………………………………………………………………………………………………………………………85 Table 4……………………………………………………………………….…………………………………………………87 Figure 5 …………………………….……………………………………………………………………………………….…86 Figure 6…………………………………………………………………………………………………………………………88 Figure 7…………………………………………………………………………………………………………………………89 Figure 8…………………………………………………………………………………………………………………………91 Figure 9…………………………………………………………………………………………………………………………93 Figure 10…………………………………………………………………………………………………………………….…94
Chapter 4 Figure 1…………………………………………………………………………………………………………………………108 Figure 2…………………………………………………………………………………………………………………………109 Figure 3…………………………………………………………………………………………………………………………111 Figure 4…………………………………………………………………………………………………………………………113 Figure 5…………………………………………………………………………………………………………………………115 Figure 6…………………………………………………………………………………………………………………………116
5 Literature review
Pollination In the UK crops and wildflowers are pollinated by a wide variety of insects, providing an important ecosystem service to both semi-natural and agricultural land. It has been estimated that around 78% of temperate plants are pollinated by animals (Ollerton, Winfree and Tarrant, 2011), and without them, many crop yields would be greatly reduced or eliminated (Klein et al., 2007). There is a huge diversity of pollinating insects in the UK, with over 280 species of bees (Falk, 2018) and 280 species of syrphid (Ball and Morris, 2015), both of which are important pollinators (Garibaldi et al., 2013). There are also many non-bee flower visiting insects which contribute an unknown amount to pollination (Orford, Vaughan and Memmott, 2015).
Our population is growing, increasing the demand on land for housing and agriculture. It is therefore important for farmers to maximize their crop yields. However, at the same time pollinators are facing a number of threats (Vanbergen and Initiative, 2013) which work in concert to reduce overall pollinator health and productivity (Goulson et al., 2015). Some of these threats are headline grabbing, such as neonicotinoid pesticides which appear to affect bee behavior and reduce pollination efficiency (Godfray et al., 2014; Rundlöf et al., 2015). Others include disease and parasites, spread from commercial bee hives into the wild (Singh et al., 2010; Fürst et al., 2014), which have a larger impact on individuals already weakened by pesticides (Evans et al., 2018). For pollination, the threat is not just that pollinating species will become less efficient or diseased, but that they will become uncoupled with the plants that they are pollinating. This could be caused by climate change, which may change distributions of species and push their southern range north whilst not expanding their northern range (Kerr et al., 2015; Rafferty, 2017), as well as potentially disrupting plant- pollinator interactions by uncoupling emergence times (Memmott et al., 2007). In the UK there is also an increased threat to pollinators from land use change (Vanbergen, 2014), as more and more of our environment is given over to housing our increasing population, or growing the food needed to sustain it (D. Senapathi et al., 2015). This can result in pollinator deserts, where large swathes of monoculture provide no food for the insects.
6 Of course, the way that different pollinators will react to these threats is not straightforward. New bee and syrphid species have been arriving in the UK in more recent years (Cross and Notton, 2017; Notton and Norman, 2017), suggesting that for some pollinator species climate change is opening up niches, and this may in turn benefit pollination services in the UK. Alongside this, pollinators react in different ways to land use changes, with some species thriving in urban areas due to the floral resources available in gardens (Bates et al., 2011). Overall, ecosystems are currently experiencing extraordinary upheaval, and their ability to continue to function remains uncertain. Their resistance and resilience to these changes depends in a large part on interactions among species, and so the UK government are recognising the importance of monitoring pollinating insects (Department for Environment, 2014).
It appears that crop pollination alone may not be a suitable argument for conserving species diversity of pollinators, as a small proportion of total bee species provide the vast majority of flower visits to crops (Kleijn et al., 2015). This could be interpreted by those interested in pollination services, such as policy makers and farmers, as meaning that species diversity is unimportant for pollination, and even that the role could be filled by commercial pollinators. However, in reality the situation is more complex. Sapir et al., (2017) found that the pollination efficiency of honeybees on apples was increased by the presence of wild bee species, likely because they introduced competition for flowers, decreasing the time an individual spent on a single flower. Other studies have shown that increased diversity of pollinators increases fruit set and yield (Klein, Steffan-Dewenter and Tscharntke, 2003; Hoehn et al., 2008), and that wild bees are vital for this increase (Holzschuh, Dudenhöffer and Tscharntke, 2012; Garibaldi et al., 2013). It is also important to mention that pollinators are similarly vital in semi-natural ecosystems where they increase floral diversity and therefore are important contributors to those ecosystems. In an environment such as the UK with the kind of rapid change and numerous threats affecting pollinators, it is also important to think about the resilience of a community. Reducing conservation of pollinators to the small number of bee species which are currently the most important crop pollinators fails to provide resilience to change. Ensuring a diverse community of pollinators means that they are more likely able to adapt to changes and threats in the environment (Deepa Senapathi et al., 2015), and continue to provide a pollination service.
7 Syrphid pollinators In this thesis the focus is on the Syrphidae family (Diptera), which comprises some 5000 species, and are a large group of flower-visiting flies that provide important pollination services (Jauker and Wolters, 2008). They are commonly known as hoverflies in the UK, where there are 280 species (Ball and Morris, 2015). Recent research has shown that there has been a long term decline in syrphids across the UK, and that the rate of decline is markedly different from that of UK bees (Powney et al., 2019), making them an important group for monitoring and research in the UK.
One of the reasons for the difference in decline is that syrphids respond differently to land use change in comparison to most bee species (Jauker et al., 2009). Some research has shown syrphid species diversity decreases with an increasing urban landscape (Bates et al., 2011). This may be down to a reduction in potential larval habitat, or because of the abundance of complex floral morphologies which prevent flies accessing pollen (Geslin et al., 2013). Unlike bees, whose larvae feed solely on provisions from flowers, syrphid larvae have a large range of larval habitats and diets (Ball and Morris, 2015). Larval life histories range from predatory to aquatic to phytophagous. Aphidophagous species in particular may be able to thrive in agricultural environments and field edges (Sutherland, Sullivan and Poppy, 2001; Haenke et al., 2009), and may provide important biocontrol (Pascual-Villalobos et al., 2006).
Adult syrphids also have markedly different ecology from bees, which impacts their response to land use change. It has been suggested that syrphids can travel longer distances than bees, since they do not have to return to a nest site (Rader et al., 2011). This would make them more able to survive in substandard environments, and more adaptable to change. However, this distance is not known, nor are the impacts of travelling long distances in search of floral resources. There is evidence that syrphids are generalist pollinators, allowing them to exploit the floral resources that are available (Branquart and Hemptinne, 2000). However the morphology of syrphid mouthparts means that the morphology of flowers is important, with fewer open flowers resulting in fewer flower visits (Geslin et al., 2013). These complex interactions and the large number of factors involved mean that it is hard to predict how species will respond to land use change without evidence (Bartomeus et al., 2018).
8 There has been little research into how other known threats to pollinators affect syrphids. It has been found that neonicotinoids have no effect on the potentially vulnerable aquatic larvae of some species (Basley et al., 2018), however the impact of neonicotinoids in pollen on adult flies is unknown. In bees, disease is a stress factor, but the effect of disease on syrphids is less well known. It is unlikely that disease is a huge issue in this family as they do not form social nests or aggregations, unlike large numbers of bee species, and so there is less chance of disease spreading through a population. However, it has been suggested that flowers are important points of disease and parasite exchange (Schwarz and Huck, 1997), since many different individuals of different species can visit a single flower. Alongside this, it was recently found that syrphids are capable of spreading bee diseases between flowers, potentially increasing bees exposure to disease (Bailes et al., 2018). Surprisingly, there has been no research into whether these diseases were affecting the syrphid carriers, or how these diseases may be impacting syrphid populations.
Monitoring The UK has monitoring programmes for many different groups, for one of two reasons. There are animals and plants which it has a legal obligation to monitor. For example, the great crested newt must be surveyed for in areas where there are planned developments (Rees et al., 2014). The second type of monitoring is for groups for which long-term data is available regarding trends and habitat, tracking populations and communities over time to detect any large changes. There are several examples of this, including the UK bat survey (Barlow et al., 2015) and the plant monitoring scheme of Britain and Ireland (Pescott et al., 2015). One example of how these long-term monitoring schemes can be vital for detecting change is the UK breeding birds survey. This has been running for four years and provided sufficient long- term data on breeding birds in the UK to detect of changes in the migration phenology of UK breeding birds (Newson et al., 2016). These long-term monitoring programmes are vital for conservation, especially with the increasing threats of climate change and land use change. Understanding how populations and communities are changing is the first step towards ensuring they are conserved.
9 The majority of monitoring in the UK is carried out by volunteers who give their time and expertise for free to record wildlife. This enables large-scale long-term monitoring at much lower costs than it would take using paid experts, and thus increases the feasibility and longevity of programmes. However, setting up a new monitoring programme is a large and expensive undertaking because the surveying needs to be planned in a way that will provide meaningful data, and long-term survey sites need to be identified and accessed. Monitoring programmes also require volunteer coordinating and the ability to compile, store and analyse the data collected. In 2015, a pilot was launched with Defra and The Centre for Ecology and Hydrology (CEH) to monitor bee and syrphid pollinators across the UK (Carvell et al., 2016). This has resulted in a programme which is a combination of volunteer and expert identification. Volunteers set up pan traps and collect insects which are sent to paid taxonomists to be identified. They also conduct flower visitor observations (FIIT counts), by recording all of the insects that visit a flower for 5 minutes.
For many monitoring programmes, such as the breeding bird survey, volunteers are able to identify species during the survey. This is because there are a lot of amateur experts who take part in surveys and have a wealth of knowledge about the groups they are surveying (Newson et al., 2016), but also because it is fairly easy to identify some groups from sight. However, this is not the case for the pollinator survey, where many species of bee and syrphid are difficult to identify, and some require careful examination out of the field. Because of this volunteer observations only distinguish between groups of pollinators such as bee, syrphid and other Diptera. This allows a large amount of data to be collected by volunteers, however it does not give the species-level data required to understand detailed differences in communities over time. In order to obtain this data, the scheme also includes pan trapping in 12 locations five times over the course of the summer months. This provides a snapshot of the pollinator community with the bees and syrphids identified by expert taxonomists to species level. Although pan trapping includes lethal sampling, where individuals are removed from the environment, a recent study by Gezon et al. (2015) found that lethal sampling does not impact insect populations.
10 DNA monitoring Paying taxonomists to identify a large sample of diverse insects is time consuming and expensive. For a group such as pollinators it most likely requires several taxonomists with different specialisms due to the different groups present. Currently, this makes the monitoring programme expensive, and may result in long waiting periods while the specimens are identified. In the future this may become more of an issue since the number of taxonomists is declining around the world, resulting in less expertise for identifying these groups. Recent studies have shown that DNA can provide accurate identification of specimens to species level (Creedy et al., 2019). Utilising DNA in the correct manner could be a beneficial way forward for taxonomists and molecular biologists alike, as it would allow taxonomists to transfer their skills from routine species identification to species discovery and identification of cryptic species. It is important to emphasise that a move to using DNA methods for species identification cannot be done without the vital input of taxonomists, who are indispensable for identifying reference specimens, validating DNA methods and establishing species identification for cryptic species that DNA cannot separate.
Recently, this move towards using DNA to monitor plants and animals has been gaining interest from policy makers, and for some groups DNA is already employed in monitoring. The most well established use of DNA monitoring in the UK is using environmental DNA (eDNA) to detect the presence of great crested newts (Rees et al., 2014; Biggs et al., 2015). This eDNA method does not require sequencing, making it faster and cheaper than other DNA methods. Instead, this method utilises species-specific primers which exclusively amplify great crested newt DNA using qPCR. This method is now successfully used in monitoring of this species across the UK. In the case of great crested newts, a single species is being detected. This is different to monitoring of a mixed community, where a number of different species are present. If there were only a few species of interest, then the same methods could be applied but with several different primer sets for different species. However, for studies where there are many different species of interest, or where the focus is not on particular species but the community as a whole, a broader approach must be taken, using methods such as DNA barcoding, metabarcoding or genome skimming,
11 There are several DNA techniques that are not yet used in UK monitoring, which would enable large scale analysis of diverse samples. For moderately diverse samples, Illumina barcoding can be used. In this method, individuals from a sample are separated, and then DNA extracted separately from each. When the barcoding region is amplified using PCR, a DNA tag is attached to the primers, allowing that sample to be traced back to the specimen. Using different tags means multiple samples can be pooled together. Post sequencing, the tags are used to separate out each sample (Shokralla et al., 2015; Creedy et al 2019). This has the benefit over other bulk sequencing methods of retaining the link between specimen and sequence, resulting in more powerful data which can be verified easily, and giving accurate abundance data. This is important in monitoring, where the abundance of species present in populations is vital information.
For larger and more diverse samples, metabarcoding can be used. This technique uses PCR of a chosen barcode on a large diverse sample (Yu et al., 2012). A mixed sample is used, with multiple unknown species present, such as from a highly diverse soil community (Arribas et al., 2016). Identifying these species morphologically requires a lot of taxonomic expertise and could take years, and so molecular identification allows monitoring of biodiversity in these species-rich ecosystems. This methodology can be used for mixed samples of invertebrates, and is particularly useful for highly diverse communities such as rainforest canopy insects (Creedy, Ng and Vogler, 2019), and for challenging groups such as those found in soil communities (Andújar et al., 2015). Pan trap samples such as those obtained by the pollinator monitoring scheme (Carvell et al., 2016) would be ideal for metabarcoding, as they contain a diverse community of insects. Currently only the bees and syrphids from these samples are identified and including molecular analysis of the bulk samples could therefore increase the amount of data gained from this monitoring scheme.
In order to identify the different species correctly a well-curated reference database is required, which provides a connection between the DNA and the traditional taxonomy of species. Monitoring programmes mostly require species level identifications, making DNA monitoring without linking to taxonomic species less useful. These reference databases are already being developed, particularly at country level. For example, there is now a curated reference database for Canadian bee species (Sheffield et al., 2017). For reference databases
12 to be robust, they should ideally be linked to a database of specimens, so that the identifications can be verified. We already have long-term storage of specimens in museums, and these institutions are preserving specimens for DNA analysis. An example is the Natural History Museum in London which contains a molecular collection facility where specimens and DNA can be stored for future use.
There has been increasing focus on developing robust reference databases for monitoring fauna, and reference databases for regional bee fauna have been developed in Canada (Sheffield et al., 2017), Chile (Packer and Ruz, 2017), Ireland (Magnacca and Brown, 2012) and the UK (Creedy et al. 2019). There has also been development of a reference database for Afrotropical hoverflies (Jordaens et al., 2015). On a larger scale, the Barcode of Life Initiative have a long running project to barcode all life on Earth (Savolainen et al., 2005), which has resulted in large initiatives such as the German Barcode of Life Initiative, to barcode all German biodiversity (Geiger et al., 2016).
Plant – pollinator networks Increasingly, research into pollinators is encompassing a network approach, as interactions between pollinators and plants are vital to understanding the ecology, threats and changes (Vanbergen, 2014) in these communities. Studies have shown that the structure of a network influences the response to habitat loss (Fortuna and Bascompte, 2006), and that species loss does not have the same effect on networks as interaction loss (Santamaría et al., 2016), suggesting that researching and quantifying pollinator communities cannot be done using species diversity and abundance alone (Forup et al., 2007). Networks also potentially give a greater understanding as to why changes are occurring. This enables understanding as to why syrphid flower visitation is lower in urban areas, which is likely due to abundant floral morphologies (Geslin et al., 2013), and why specialised pollinators are more vulnerable to change (Weiner et al., 2014).
Pollination networks are often constructed using data from flower visits, collected by observing flowers and recording insect visitors for a set length of time (Garbuzov, Samuelson and Ratnieks, 2015). Just recording visits assumes that every visit results in pollination, and so
13 several studies have translated visitation into pollination using exclusion experiments (Ballantyne, Baldock and Willmer, 2015). These have shown that visitation is a useful proxy for pollination. However, these methods are limited in the area over which networks can be surveyed. Recording flower visits is labour intensive and requires participants with good taxonomic expertise. These considerations mean that implementing a large-scale monitoring programme for pollination networks is not a feasible management option. There is therefore a demand for a new methodology for establishing pollination networks, which addresses these concerns and presents a viable management option for monitoring this ecosystem service.
Pollination networks involve an exchange of plant material from the plant to the insect in the form of pollen, therefore providing a record of which plant species an individual has visited (Tur et al., 2014). By using pollen on an insect’s body to establish network links, the need for lengthy field observations is removed. Identifying pollen to species is already used in forensic biology (Karen L Bell et al., 2016), as well as for identifying honey sources (Hawkins et al., 2015), and has been shown to identify network links which were missed in field observations (Bosch et al., 2009). However, morphologically identifying pollen is time consuming, and still requires taxonomic expertise, which is often a limited resource. Identifying pollen under the microscope can be especially difficult in some families such as Rosaceae, where pollen cannot be separated to species level (Kendall and Solomon, 1973). Rather than identifying pollen from morphology, molecular methods can be used to establish the plant species present. These methods are well suited to processing large amounts of data, thus enabling large scale monitoring of the pollination system (Baird, Hajibabaei and Brunswick, 2012), as well as being well suited to identifying hidden interactions such as pollen on or inside an insect (Evans et al., 2016).
DNA has been used to identify pollen collected from hives (Keller et al., 2015), and from honey (Bruni et al., 2015; Hawkins et al., 2015; de Vere et al., 2017), where there is a large mixed sample of pollen. Alongside this, small mixed samples of pollen from individual insects have also been used to establish individual visitation networks, with metabarcoding of pollen at the specimen level (Lucas et al., 2018). Here, each individual is treated like a community,
14 with the diverse pollen in or on the insect treated as a soil arthropod community would be (Arribas et al., 2016), or a pollen sample taken from a hive.
There are some complications with DNA barcoding of plants. Unlike with insects, where the CO1 barcoding region is accepted as a barcode with appropriate inter-specific variation, there is not a single barcode which can be used to identify all plant species (CBOL Plant Working Group et al., 2009). This is due to differences in variability in different plant groups, meaning that more than one barcode should be used to identify pollen. Studies have used rbcL (de Vere et al., 2012), trnL (Taberlet et al., 2007), ITS2 (Yao et al., 2010) and matK, although a major constraint is the reference database. If a plant species’ pollen is present in the unknown sample that is not present in the reference database, then that species will remain unknown. This limitation is true both for barcoding of the insects and of the plants, and therefore an important step in developing a framework for monitoring pollination networks is to build a well curated reference dataset of plant and syrphid barcodes (de Vere et al., 2012).
Phylogenetics Currently, reasons for using DNA in monitoring and conservation focus on fast and accurate identification and high-throughput data analysis. However, DNA also provides the opportunity to gain more information from the data than traditional methods in the form of potential associations, such as detection of gut contents and parasites and disease. DNA also allows the analysis of phylogenetic diversity, which adds an extra measure alongside species diversity and abundance as to the health and resilience of a community. Measures such as phylogenetic diversity and species diversity may react differently to changes in the environment (De Palma et al., 2017), and so not including these measures may result in important information for conservation being lost. This can be important for pollination, as a study by Grab et al. (2019) found that increasing agricultural land led to a decrease in phylogenetic diversity, alongside a decrease in pollination services.
Sequencing the large amount of data needed for phylogenetics can be an expensive process, however sequencing costs are lowering and new techniques allow more data to be sequenced at once, thus further reducing costs (Shapland et al., 2015). Mitochondrial
15 metagenomics is a technique by which multiple species can be sequenced in a single pooled library, enabling full mitochondrial genomes to be obtained for many specimens at once (Andújar et al., 2015). The DNA is sequenced using shotgun sequencing, without PCR amplification, thus reducing cost and potential bias towards certain species. This technique has been used to produce high amounts of genetic data to create well resolved phylogenetic trees in diverse groups such as the beetles (Crampton-platt et al., 2015), and has been used to monitor wild bee populations (Tang et al., 2015).
There have been several published phylogenies of the Syrphidae family, although they generally have low numbers of taxa (Skevington and Yeates, 2000; Pauli et al., 2018) and low amounts of genetic data (Ståhls et al., 2003). There is often a trade-off between the amount of genetic data included in a phylogeny and the number of taxa included, because generating large amounts of data for large numbers of taxa is time consuming and expensive. However, using mitochondrial metagenomics helps to reduce the cost for generating whole mitochondrial genomes for large numbers of specimens (Andújar et al., 2015). A study using mitochondrial genomes for deep relationships in Diptera found that they were an informative source of phylogenetic data, resulting in topologies which agreed with consensus (Cameron et al., 2007). There are some mitochondrial genomes available for Syrphidae; two from Li (2019), and five Eristalinus genomes from Sonet et al. (2019), however for a comprehensive phylogeny, more mitochondrial genomes are needed. Due to the limited number of taxa in current phylogenies, the focus of the topologies has been the relationship between the three subfamilies, Syrphinae, Eristalinae and Microdontinae. Ståhls et al. (2003), found the three subfamilies to be monophyletic, but other studies have found Eristalinae to be paraphyletic (Skevington and Yeates, 2000; Mengual, Ståhls and Rojo, 2015; Pauli et al., 2018).
A well-supported phylogeny with well-represented taxa would allow phylogenetic diversity to be investigated along with other measures of diversity. It also allows investigation of evolution of traits. In the case of syrphids, the diverse range of larval diets may be important when looking at species composition and habitat type, and give a greater understanding of the evolution of this large family of pollinators.
16 Conclusion UK pollinators are an important community which contribute to agricultural yields as well as the maintenance of many habitat types across the country. They are often in the news; however, most attention and research focus is on bees, which although important pollinators form only a fraction of the taxonomic diversity of flower visitors. The Syrphidae family are an important pollinator group in the UK and appear to be responding differently to bees to threats such as land use change and climate change. DNA offers an opportunity to identify and survey this family in an efficient and accurate manner, whilst also increasing the data to include plant visitation and phylogenetic diversity. In this thesis the evolutionary history of the Syrphidae family will be explored with the largest Syrphidae phylogeny to date, resulting in further insights into the larval life histories of this family. The feasibility of identifying and monitoring syrphids using DNA will be investigated, before integrating pollen network data alongside this to investigate how syrphids are using a diverse mosaic landscape. Finally, the scope of this thesis will be expanded to investigate the non-bee and non-syrphid flower visitors across the UK, to give a fuller picture of UK pollinator diversity using DNA techniques.
17 Chapter 1: Syrphidae Mitochondrial Genomes and Phylogeny
Introduction Pollinators are the subject of conservation, monitoring and research efforts around the world, many of which are investigating the species diversity of communities. However, functional and phylogenetic diversity are also important for understanding their resilience. Communities with low phylogenetic diversity are more likely to be impacted by a sudden change to the environment, and there is evidence that some insect communities are becoming less evolutionarily diverse in agricultural land (Grab et al., 2019). This is important for pollination of crops, for which it has been shown that an increase in species diversity leads to an increase in fruit set (Holzschuh, Dudenhöffer and Tscharntke, 2012; Garibaldi et al., 2013). The lack of a robust and well sampled phylogeny for the Syrphidae family means that there is a knowledge gap around how evolutionary diversity is impacting this important pollinator group.
The most recent tree of the Syrphidae is from 2016 (Young et al., 2016), using hybrid enrichment to obtain 559 loci for 30 species of Syrphidae, which established the relationships between the three subfamilies. Earlier trees used mitochondrial and nuclear genes, as well as morphological characters to look at the relationships (Skevington and Yeates, 2000; Ståhls et al., 2003; Mengual, Ståhls and Rojo, 2015). These studies have found the subfamily Microdontinae to be sister to the rest of Syrphidae, and Syrphinae to be monophyletic within a paraphyletic Eristalinae. These studies contain varying amounts of molecular data, but all contain low numbers of species. There is therefore a need for a phylogeny with a large number of taxa and a large amount of molecular data.
There is very little molecular data publicly available for the Syrphidae family beyond CO1 barcodes, which limits the confidence in current phylogenetic analysis. To date, only three mitochondrial genomes are available on GenBank, in addition to five mitochondrial genomes for the genus Eristalis (Sonet et al., 2019), which together represent only six genera. Recent advances in sequencing and bioinformatics have allowed an increasingly large amount of data to be obtained, although large scale sequencing of genomes is still expensive. Mitochondrial genomes are much easier to obtain than nuclear genomes, due to
18 their smaller size (around 15,000bp in insects), and the fact that they are present in higher copy number in cells. This makes them amenable to genome skimming, i.e. genome assembly from low-coverage shotgun sequencing of total DNA (Straub et al., 2012). In a technique known as mitochondrial metagenomics (MMG), several specimens can be sequenced together in a single pooled library (Tang et al., 2014; Crampton-platt et al., 2015). The mitochondrial genomes for each specimen can then be assigned to one of the original samples in the pool using bait sequences, usually of the CO1 barcode. This hugely reduces the cost of sequencing each individual, making mitochondrial genomes fairly easy to obtain for larger numbers of individuals. However, working with mixtures can lead to the formation of chimeras if sequences are assembled incorrectly post sequencing. To help avoid this, the specimen pooling is done to maximize the genetic distance between individuals in the pools. The approach has been widely used and applied successfully in phylogenetic studies of several groups, including Coleoptera (Crampton-platt et al., 2015), high level Hymenoptera (Mao, Gibson and Dowton, 2015) and other groups within Diptera (Zhang et al., 2019).
Mitochondrial genomes can go a long way towards generating the desired greater number of nucleotides for each species, while also greatly expanding the number of specimens that can be sampled. MMG thus overcomes an issue in phylogenetics, where increasing the amount of data and number of species results in a more robust tree, and thus resolves the problem of prioritising more data or more species when building trees (Rokas and Carroll, 2002). However, as with any type of single locus marker, mitochondrial genomes suffer from non-uniform character variation that creates biases in the resulting tree searches and potentially leads to incorrect topologies, potentially with high support. Therefore, appropriate model choice and the use of efficient tree searches are critical to obtain the most realistic topology possible. Beyond parsimony, the most widely used model-based tree building method is Maximum Likelihood. This has efficient implementations in programs such as RAxML (Stamatakis, 2014), which is suitable for large datasets of the kind produced in MMG. Maximum likelihood methods can also deal with differences in the type of nucleotide variation within mitochondrial genomes. Genes are transcribed from different strands, leading to GC skew and great differences in codon usage, while rates vary among genes and between codon positions. Models of evolution can accommodate the
19 heterogeneity in rates and nucleotide composition to various degrees and provide an accurate estimation of how the sequence is evolving and thus establish the best tree topology. The same is true with partitioning the data, which improves the models by estimating parameters separately for different portions of the data, for example different genes within an analysis or different codon positions.
Bayesian analysis can potentially implement more complex models, due to a more efficient search strategy using MCMC chains. The basic implementations use the same GTR (general time reversible) models as the popular likelihood approaches, but they provide an alternative implementation of the tree searches which potentially reveal nodes of low confidence of other weaknesses in the inferences. In addition, Bayesian approaches can implement more complex models, which allow the application of multiple independent evolutionary models with their own rate parameters and substitution matrices. In programs such as BEAST (Drummond et al., 2012), molecular clock models are also implemented, which can be allowed to vary across the tree. This allows the dating of trees and calculation of the evolutionary rate, which provides greater information about the evolution of the family. However, for large trees implementing these methods can be time consuming, and so methods such as least squares allow dating using a Gaussian model to estimate substitution rates and ages of ancestral nodes across the tree (To et al., 2016) in a fast and accurate manner.
In this study the aim is to increase the number of mitochondrial genomes available for the Syrphidae family and use these to produce a phylogenetic tree with a large number of taxa to investigate subfamily and tribe level relationships. This tree will then be further used to look at the evolutionary rates and relative ages of important nodes within this family, and to map the evolution of the diverse larval life histories across the phylogeny.
20 Methodology Taxa Choice and DNA preparation
DNA for the mitochondrial genome sequencing was obtained through the Canadian National Collection (CDC), and the specimens were selected from the collections at the CDC. The species were chosen to represent the spread of genera and subfamilies, according to the topology of Young et al (2016). Several species from closely related families were also chosen to form an outgroup. DNA was extracted from the samples by the collaborators at the CDC and DNA concentrations measured using the Qubit high sensitivity kit. The specimens were identified by taxonomists, and alongside the DNA for shotgun sequencing the CO1 barcoding region was amplified and Sanger sequenced to obtain a reference barcode for each specimen, which could be used later in the analysis as baits to identify contigs.
The DNA pooling was based firstly on the taxonomy of the specimens, and a CO1 tree was generated in RAxML using the CO1 barcodes to check the phylogenetic relationships. The libraries were designed to maximize phylogenetic diversity and to have equimolar concentrations as in previous metagenomic studies (Gillett et al., 2014). Specimens from the same genera were not pooled in the same library, as these are highly likely to form chimeras during the assembly, and it can be difficult to distinguish closely related species using CO1 barcodes. The specimens were pooled so that each library had equimolar concentrations of each specimen, giving a total of 200ng of DNA per library. Where possible with taxonomy constraints, samples with similar concentrations were pooled together. This meant that high concentration samples were not diluted to compensate for low concentrations, increasing the potential to recover these specimens. It also meant that low concentration samples were more likely to be recovered, as even after equimolar pooling high concentration samples likely have higher quality DNA.
After the samples were pooled, they were dehydrated and shipped to the UK where they were re-hydrated and sent for sequencing on an Illumina HiSeq 2x250. The pooled libraries were re-run on the same Illumina HiSeq after the results from the first sequencing run were obtained, due to the low number of contigs obtained from the first run, and the lower than
21 expected number of overall reads that the first run achieved.
Assemblies
Post sequencing quality analysis for each library was carried out using fastqc, and remaining Illumina adapters were removed using Trimmomatic (Bolger, Lohse and Usadel, 2014). Prior to assembly the dataset was filtered for potential mitochondrial reads against a database of Dipteran mitochondrial genomes, using dc-megabast under low stringency conditions, minimizing loss of target reads. Putative mitochondrial reads from this step were extracted using FastqExtract3 and subject to genome assembly using three different methods: Ray (Boisvert, Laviolette and Corbeil, 2010), SPAdes (Bankevich et al., 2012) and IDBA. Assemblies from each procedure were imported into Geneious (Kearse et al., 2012) and de novo assembled to produce super-contigs from primary assemblies, which generally produced more and longer contigs than any one assembler on its own.
Gene predictions were obtained using the MITOS server (Bernt et al., 2013), based on existing annotations for a range of invertebrate mitochondrial genomes. The annotations were manually edited to obtain the correct start and (full or partial) stop codons, selecting among possible alternative start and stop codons by minimizing the intergenic spaces and overlap of genes. For simplicity, once several full length contigs had been annotated in this way, these were used as a reference genome for the rest of the unannotated genomes. Homology of the start and stop codons was tested by alignment for each gene using Muscle (Edgar, 2004). Full mitochondrial genomes were circularised.
A database of CO1 barcodes was generated for the same specimens included in the MMG libraries, to be used as ‘baits’ for identification of contigs. All contigs over 2kb were identified by blasting against the baits, with a positive identification if >98% match was found. Some contigs were unable to be identified in this way because they lacked CO1, but additional identifications were obtained by placement on a phylogenetic tree of identified and unidentified contigs. Each gene was extracted and aligned separately using the Muscle aligner, then concatenated with the SeqCat.pl script, before a tree was run in RAxML (Stamatakis, 2014) on the Cipres Science Gateway (Miller, Pfeiffer and Schwartz, 2010).
22 Taxa with apparently misplaced positions based on the current taxonomy (Young et al. 2016) were investigated for potential chimera formation during assembly from mixed samples. All mitochondrial CO1 from the potential chimeric genera available from GenBank (Clark et al., 2015) and BOLD (Ratanasingham et al., 2007) were obtained, and aligned with the mitogenome CO1. Alongside this, all other genes were aligned to produce separate gene trees. Major discrepancies in the phylogenetic position of the focal sequences in the different gene trees were taken as evidence for chimera formation.
Phylogenetic analysis
Phylogenetic trees were generated from the 13 protein coding and two rRNA genes from all identified contigs. Maximum likelihood analysis was conducted using RAxML version 8.2.12 (Stamatakis, 2014) on the Cipres Science Gateway (Miller, Pfeiffer and Schwartz, 2010) and visualised in Figtree. A total of five different data different partitioning schemes were applied, performing the tree searches with each of the 15 genes partitioned separately, or partitioned into genes on the forward and reverse strand. Both of these schemes were also run with additional partitioning for each of the three codon positions. A fifth analysis was run only with three codon partitioning. The GTR+I+G model was used for all of the analyses, estimating model parameters for each partition separately.
The RAxML tree was used to estimate the rate of evolution and the relative divergence dates of the nodes on the tree. This was done using least squares analysis (To et al., 2016), with the most recent common ancestor (the root) given a divergence time of 0 and the tips a time of 1. This allowed the calculation of relative divergence dates of nodes. The tree was rooted using Epalpus signifier, and the other outgroup sequences were removed as only the divergence times of the Syrphidae family were of interest. Confidence intervals were produced for the relative divergence times by running a simulation of 1000 trees.
Finally, a Bayesian tree was created using BEAST V1.8.4 (Drummond et al., 2012) under a molecular clock model. An xml file was created in BEAUTI, with 15 partitions for each of the protein coding genes and two rRNA. The best evolutionary model was selected for each gene alignment in jModelTest (Darriba and Posada, 2014) based on the AIC value. As a result, for all but one of the partitions the GTR+I+G evolutionary model was used. For nad4l
23 HKY+I+G was used. A lognormal relaxed clock was applied, with a birth and death model, which allows each lineage to speciate or go extinct at a fixed rate. Unlike the evolutionary models, the tree prior and clock model were allowed to vary between the partitions. Finally, the syrphid ingroup was constrained as monophyletic, ensuring that the tree was rooted by the outgroup node. The BEAST analysis was run on the Cipres Science Gateway for 50M generations, for 150 hours. The log file was visualised in Tracer to determine the burn-in. TreeAnnotator was used to summarise the sampled tree and determine Posterior Probability values, to generate a maximum clade credibility tree, keeping target node heights. The tree was visualised in FigTree and the relative clock scale bar shown on the tree, along with the relative divergence dates of the nodes.
Larval life history evolution
Life history information for larval stages of Syrphidae species were available for the UK (Ball and Morris, 2015), and so CO1 barcodes from UK syrphids were added to the phylogeny. This was done using sequences from 59 specimens collected over the summer of 2016 in East Anglia, which were sequenced using Illumina barcoding to give the 418-barcoding region of CO1 (Chapter 3). These sequences were aligned with the CO1 gene from the 93 mitochondrial genomes using Muscle (Edgar, 2004), and this alignment was concatenated together with the other 12 protein coding gene alignments and the two rRNA alignments. This concatenated alignment was used to run a RAxML tree on the Cipres Science Gateway. The mitochondrial genome tree was used as a backbone to constrain the topology, so that the barcodes were added into the existing topology. This meant that the large amount of missing data would not affect the tree topology.
The tree was visualised in FigTree and then used to construct the larval life history evolution. The larval life histories of the mitochondrial genome specimens were generalised to genus level life histories. The evolution of these traits was mapped on to the tree in R using the package Phytools (Revell, 2012) in which simmap was used to create a simulation of how the character traits mapped onto the tree. The simulation was run 1000 times, and the output trees used to map the posterior probability of a larval life history occurring at each node.
24 Results Assemblies
By combining the results of the two sequencing runs, a total of 94 contigs were obtained. The breakdown of this across the two runs can be seen in table 1, with the second run adding a total of 30 mitochondrial genomes to the dataset. It also increased the length of 28 contigs obtained in the first run, adding more genes to the dataset. The contigs ranged in length from 2,655bp to 17,574bp, with an average length of 12,493bp. 60 of the contigs had a length of over 10,000bp, and 58 of the contigs contained all 13 protein coding genes.
library First run contigs First & second run tree identified contigs 1 9 10 0 2 7 15 2 3 9 11 0 4 8 12 2 5 7 12 1 6 10 16 0 7 9 13 0 Total 59 89 5
Table 1. The number of identified contigs recovered from each of the seven libraries in the first sequencing run and once the repeat sequencing run data had been added. The third column shows the extra contigs which were identified based on tree placement in each library.
Overall, 94 identified contigs were obtained. Of these, 89 contained CO1 and thus could be identified with the CO1 reference database, while others were identified by their placement on the phylogenetic tree. A total of 81 of the contigs belonged to species of syrphid, and 13 were outgroup species. Three contigs with improbable phylogenetic placements were investigated for chimeric structure based on incongruence among gene trees. Two of them, Nausigaster and Lejota, were consequently removed from the analysis, whereas no clear evidence for a chimeric sequence was obtained for the third, Milesia, which was retained (Supplementary Material).
25 Gene Number of sequences nad2 81 cox1 90 cox2 90 ATP8 84 ATP6 82 cox3 80 nad3 79 nad5 75 nad4 73 nad4l 69 nad6 67 cytb 67 nad1 67 rrnL 65 rrnS 51
Table 2. The number of sequences present in each of the 13 protein coding gene datasets. This varies between genes due to the incompleteness of many of the contigs.
Phylogenetic analysis
The five RAxML trees obtained under different partitioning schemes were compared for topology and support of clades expected based on the current higher-level taxonomy (table 3). All of the trees found the Syrphidae family to be monophyletic. At the subfamily level, all found Microdontinae to be monophyletic, and a sister to the rest of the Syrphidae family. In all trees Eristalinae was found to be paraphyletic with respect to a monophyletic Syrphinae that was embedded in Eristalinae near Rhinginii. Out of 18 tribes initially included in the MMG, sequencing for four of them (Cheilosiini, Paragini, Pipizini and Spheginobaccha) was not successful. A further four tribes (Callicerini, Merodontini, Sericomyiini and Toxomerini) were only represented by a single mitochondrial genome. Of the 10 tribes remaining, all RAxML trees found two Eristalinae tribes, Brachyopini and Milesiini, to be non- monophyletic. Bachini was found to be paraphyletic, as has been found in previous studies (Mengual, Ståhls and Rojo, 2015; Young et al., 2016), and Toxomerini was embedded within Syrphini (table 3). The only difference between the trees at tribe level was that the monophyletic Rhinginii clade differed in the levels of support, with partitioning by gene and codon giving the highest bootstrap value of 81.
26 RAxML RAxML RAxML RAxML genes RAxML fwd+reverse genes codon and codons fwd+reverse and codon only Syrphidae monophyletic YES (100) YES (100) YES (100) YES (100) YES (100) Syrphinae monophyletic YES (93) YES (95) YES (85) YES (97) YES (85) within paraphyletic Eristalinae (Mengual, Ståhls and Rojo, 2015; Young et al., 2016) Microdontinae sister to YES (100) YES (100) YES (100) YES (100) YES (100) the rest of Syrphidae (Young et al., 2016) Bachini polyphyletic YES YES YES YES YES (Mengual, Ståhls and Rojo, 2015; Young et al., 2016) Brachyopini NO NO NO NO NO monophyletic Ceriodini monophyletic YES (100) YES (100) YES (100) YES (100) YES (100) Eristalini monophyletic YES (100) YES (100) YES (100) YES (100) YES (100) Eumerini monophyletic YES (100) YES (100) YES (100) YES (100) YES (100) (Mengual, Ståhls and Rojo, 2008) Milesiini monophyletic NO NO NO NO NO Rhinginii monophyletic YES (69) YES (74) YES (43) YES (81) YES (48) (Mengual, Ståhls and Rojo, 2015) Toxomerini embedded YES YES YES YES YES within Syrphini as in (Mengual, Ståhls and Rojo, 2015) Volucellini monophyletic YES (100) YES (100) YES (100) YES (100) YES (100) (Mengual, Ståhls and Rojo, 2015) Parhelophilus NO NO NO NO NO monophyletic Allograpta YES (92) YES (95) YES (89) YES (97) YES (87) monophyletic Criorhina monophyletic NO YES (79) YES (83) YES (64) NO Ocypatamus NO NO NO NO NO monophyletic
Table 3. The five different partitioning methods used in the RAxML analysis, showing the monophyly of different clades on the trees, in relation to those found in two recent phylogenetic studies of the Syrphidae family. If a clade is monophyletic then the bootstrap support value is shown in brackets.
27 At the genus level, there were seven genera with more than one mitochondrial genome present on the tree, and all but three of them were monophyletic on all trees. Allograpta was monophyletic but with differing clade support, with the genes and codon partitioned tree having the highest support (table 3). Parhelophilus and Ocypatamus were paraphyletic on all trees. The two trees partitioned by forward and reverse strands also found Criorhina to be paraphyletic. There were differing levels of support for the Criorhina clade, with the codon partitioned tree having the highest support (bootstrap = 83). Overall the two forward and reverse strand partitioned trees had lower support than the gene and codon only partitioned tree.
Selecting the 15 gene partitioned tree as a better partitioning scheme was supported by comparing the bootstrap values across the five trees. The two forward and reverse partitioned trees had 46.67% of branches with bootstrap > 80, compared to 51.69% in the gene partitioned tree and 51.11% for the gene and codon partitioned tree. Overall the gene partitioned tree had the highest bootstrap values, with an average of 70, and 30% with support values of 100. This was compared to 26.67% of bootstrap values = 100 for the gene and codon partitioned tree and for both the forward and reverse partitioned trees. The codon only tree had an even lower percentage of bootstrap values = 100 (23.33%). The overall higher support on the 15 gene partitioned tree, along with the high support for specific clades, meant that it was selected as the final tree, which is shown in figure 1.
The final RAxML tree in figure 1 shows the Syrphidae family as monophyletic, with the subfamily Microdontinae sister to the other two subfamilies. The subfamily Syrphinae is monophyletic, within a paraphyletic Eristalinae. The tribe level relationships on the tree are indicated in figure 1. Two tribes in Eristalinae, Milesiini and Brachyopini, are paraphyletic and found across the subfamily. Merodontini is found to be sister to the rest of the Eristalinae, with Ceriodini and then Volucellini branching off afterwards. Eumerini and Callicerini form a clade, as do the Eristalini and a large portion of the Milesiini sequences. Rhinginii is found to be sister to the Syrphini, along with one of the Milesiini sequences. Within the Syrphinae, the Bachini tribe is sister to the rest of the subfamily, although one Bachini sequence is placed within the Syrphini tribe. The Toxomerini tribe is found to be within Syrphini, making that tribe paraphyletic.
28 There are not many genera that contain more than one sequence on the tree, but of the seven that do, five are monophyletic on the tree in figure 1. The genus Ocypatamus has been recently reorganized and the sequences from the genera Hybobathus, Nuntianus and Victoriana are thought to belong within Ocypatamus. In figure 1 these sequences do form a clade with one of the Ocypatamus sequences, however the other Ocypatamus sequence forms a clade with Orphanabaccha, which is sister to the Ocypatamus clade and Toxomerus. The other genus which is found to be paraphyletic is Parhelophilus, where the two sequences form a clade with Lejops.
29
Themira_nigriconis Argyra 59 93 Apystomyia_elinguis 37 Iteaphila_macquarti 100 Lindneromyia 100 Platypeza 83 Ironomyia_nigromaulata 100 Anevrina_luggeri Verallia 95 Dorylomorpha_alaskensis 80 100 Claraeola_sicilis Nephrocerus_lapponicus Aristosyrphus Domodon_peperpotensis 97 Pseudomicrodon 100 54 Paramixogaster 45 Stipomorpha_sp 65 100 100 Hypselosyrphus Microdont 100 Microdon 47 Serichlamys ini Microdon_globosus Alipumilio_avispas Merodonti Ceriana_cacica 100 Ceriana_willistoni ni 100 100 Ceriana_alboseta Ceriodini 100 Ceriana_vespiformis Orthoprosopa_grisea 94 Tropidia_rostrata Milesiini 73 Calcaretropidia_sp Parhelophilus_rex 75 100 Parhelophilus_ 98 Lejops_lunulatus 100 100 Triatylosus_dibapha 89 Senaspis_dentipes Erisalinus_aeneus Eristalini 10079 94 Pseuderistalis_violascens 35 Chasmomma_nigrum 83 Austalis_copiosus 51 Eristalis_pratorum Lycopale_wygodzinskyi 96 Mallota_florea 28 Pterallastes_thoracicus Milesiini 7 Sericomyia_flagrans Chamaesphegina Sericomyii Neoplesia_analis Brachyopi 100 Orthoprosopa_multicolor 82 35 Matsumyia_nigrofacies 3117 100 Criorhina_nigrventris Milesiini 39 79 Criorhina_coquilletti Cyphipelta_rufocyanea_ Brachyopi Brachypalpus_oarus 100 Brachypalpus_sp. ni 10 75 Chalcosyrphus_chalybeus 9360 Hadromyia_pulchra Xylota_quadrimaculata 57 100 Crepidomyia_Sterphus Milesiini Somula_decora 100 Blera_eoa 13 86 Caliprobola 100 Psilota_atra 49 Psilota_anthracina Eumerini Callicera_aenea Callicerini Argentinomyia 100 Talahua Bachini 75 Rohdendorfia_alpina Salpingogaster_sp 62 Asarkina_ericetorum DQ866050.1_Simosyprhus_grandicornis 95 25 100 52 Meliscaeva_auricollis 100 Allograpta_fascifrons Syrphin 17 95 Allograpta_sp. Allograpta_obliqua i 31 Allobacha_monobia 50 Toxomerus_saphiridiceps Toxomeri 19 1997 Hybobathus_norina 99 Nuntianus_cubana ni 29 88 94 KT272862.1_Ocyptamus_sativus 100 Victoriana_melanorrhina Ocyptamus_dimidiatus Syrphin 40 44 Orphnabaccha_priscilla Doros_destillatorius i 1821 syrphus_rectus 22 45 Betasyrphus_seraruis Dideopsis Baccha_elongata Bachini Milesia_pendleburyi Milesiini Pelecocera_tricincta 74 Cheilosia_albitarsis Rhingin 57 KM244713.1_Syrphidae_sp. Orthnevra_nitida Brachyopiii 24 Ornidia 100 Copestylum Volucellini Epalpus_signifer
0.08 Figure 1. A rooted phylogeny of the Syrphidae family made using RAxML, with 15 gene partitions. The bootstrap values are shown on the branches. The three subfamilies are coloured, with Syrphinae in green, Eristalinae in pink and Microdontinae in orange. The outgroup is made up of closely related Diptera and is shown in black. The tribes are labelled next to the tree, with each tribe coloured individually. Non-monophyletic tribes appear more than once across the tree.
30 The least squares analysis gave an overall evolutionary rate of 0.3 with confidence intervals [0.29, 0.31]. The relative divergence times for each node are shown in figure 2, and the confidence intervals indicate high confidence in these times. The Microdontinae diverge from the rest of the Syrphidae family early on (0.0616), and the early divergences within this group are also before any of the divergence times in the rest of the Syrphidae. The Syrphinae subfamily diverged from Eristalinae at a relative time of 0.35. There is a great deal of variation in genus level nodes, with the two Psilota species having a most recent common ancestor at 0.9, and so only diverging very recently. This is unlike other genera such as Ceriana, for which there are four species present and have a divergence between two clades at 0.4694, a much earlier split.
The Bayesian tree produced in Beast (figure 3) found the same sub-family level relationships as the RAxML tree in figure 1. The relative divergence dates are shown on the tree (figure 3) and can be compared to the least squares analysis. The Beast analysis gives a much faster rate of evolution (1.93), given as the mean birth rate, than the least squares analysis. The scale of evolution on the tree is reversed in the Beast analysis, with the root at 1 rather than 0. The divergence dates of the node put the divergence of Eristalinae and Syrphinae at 0.51, which is later than in the least squares analysis. Once again Microdontinae diverged from the rest of the Syrphidae family early on, at a relative time of 0.96. The genus level divergence times show the same pattern as the least squares analysis, with the two Psilota species diverging at 0.04, very close to the tips of the tree.
31 Ceriana_vespiformis 0.6931 Ceriana_alboseta 0.4694 Ceriana_cacica 0.6429 Ceriana_willistoni Lejops_lunulatus 0.694 0.6408 Parhelophilus_ Parhelophilus_rex Lycopale_wygodzinskyi 0.6643 Mallota_florea 0.444 Austalis_copiosus 0.5656 0.6483 Eristalis_pratorum Pseuderistalis_violascens 0.518 0.5857 Chasmomma_nigrum Erisalinus_aeneus Senaspis_dentipes 0.7745 Triatylosus_dibapha Calcaretropidia_sp 0.3953 Tropidia_rostrata Orthoprosopa_grisea Hadromyia_pulchra 0.6592 Chalcosyrphus_chalybeus 0.6267 Brachypalpus_oarus 0.8019 0.5568 Brachypalpus_sp. Crepidomyia_Sterphus 0.6618 Xylota_quadrimaculata 0.2642 0.4934 Caliprobola 0.7408 0.6885 Blera_eoa 0.5272 Somula_decora Sericomyia_flagrans 0.4112 Criorhina_nigrventris 0.7091 0.6384 Criorhina_coquilletti 0.4582 Matsumyia_nigrofacies Pterallastes_thoracicus 0.4137 Neoplesia_analis 0.3752 0.4253 Orthoprosopa_multicolor Callicera_aenea Psilota_atra 0.9082 0.4846 Psilota_anthracina 0.4067 Chamaesphegina Cyphipelta_rufocyanea_ Pelecocera_tricincta 0.3249 0.5397 0.5322 Cheilosia_albitarsis 0.4078 KM244713.1_Syrphidae_sp. Milesia_pendleburyi Argentinomyia 0.8068 Talahua Betasyrphus_seraruis 0.5953 0.1533 syrphus_rectus 0.3549 Doros_destillatorius Orphnabaccha_priscilla 0.5884 Ocyptamus_dimidiatus KT272862.1_Ocyptamus_sativus 0.5741 0.8068 0.7248 Victoriana_melanorrhina 0.703 Nuntianus_cubana 0.47 0.6719 Hybobathus_norina Toxomerus_saphiridiceps Allobacha_monobia Allograpta_fascifrons 0.7767 0.5213 0.6661 Allograpta_sp. Allograpta_obliqua DQ866050.1_Simosyprhus_grandicornis 0.7361 Meliscaeva_auricollis Asarkina_ericetorum Salpingogaster_sp 0.4918 Dideopsis Baccha_elongata Rohdendorfia_alpina Ornidia 0.514 0.3453 Copestylum Orthnevra_nitida Alipumilio_avispas 0 Microdon 0.3878 0.0786 Serichlamys Microdon_globosus 0.0616 Domodon_peperpotensis 0.2332 Pseudomicrodon 0.0744 Stipomorpha_sp 0.706 0.0993 Hypselosyrphus Paramixogaster Aristosyrphus Epalpus_signifer
0.2 Figure 2. Tree with results of least squares dating performed on the RAxML tree rooted with Epaulus signifier. The relative node ages are displayed on the tree, with the most recent common ancestor at time point 1 and the tips 0. The confidence intervals are show as bars and were calculated for 1000 tree simulations.
32 Themira_nigriconis 0.71 Epalpus_signifer Lindneromyia 0.8 0.42 0.65 Platypeza 0.51 Anevrina_luggeri 0.67 Ironomyia_nigromaulata Apystomyia_elinguis 0.82 0.6 Argyra 0.45 Iteaphila_macquarti Nephrocerus_lapponicus 0.81 0.47 Claraeola_sicilis 0.67 Dorylomorpha_alaskensis Verallia 0.17 Caliprobola 0.19 Blera_eoa Somula_decora 0.34 Xylota_quadrimaculata 0.12 Crepidomyia_Sterphus 0.23 0.16 Chalcosyrphus_chalybeus 0.37 Hadromyia_pulchra 0.18 Brachypalpus_sp. 0.11 Brachypalpus_oarus 0.41 Pterallastes_thoracicus 0.23 Sericomyia_flagrans 0.46 Orthoprosopa_multicolor 0.26 Neoplesia_analis Callicera_aenea 0.44 Psilota_anthracina 0.51 0.04Psilota_atra Criorhina_coquilletti 0.27 0.17 Matsumyia_nigrofacies 0.51 Criorhina_nigrventris Chamaesphegina 0.47 Cyphipelta_rufocyanea_ Milesia_pendleburyi 0.49 Cheilosia_albitarsis 0.54 0.36 KM244713.1_Syrphidae_sp. 0.15 Pelecocera_tricincta Argentinomyia 0.11 Talahua Rohdendorfia_alpina Salpingogaster_sp 0.27 Asarkina_ericetorum Allograpta_obliqua 0.51 0.43 0.23 0.11 Allograpta_fascifrons 1 Allograpta_sp. 0.32 Orphnabaccha_priscilla 0.27 0.22 Toxomerus_saphiridiceps 0.4 Nuntianus_cubana 0.2 0.12 Victoriana_melanorrhina 0.23 0.150.1 0.330.29 KT272862.1_Ocyptamus_sativus Hybobathus_norina 0.46 Ocyptamus_dimidiatus 0.19 Meliscaeva_auricollis 0.55 0.34 DQ866050.1_Simosyprhus_grandicornis Allobacha_monobia 0.38 0.28 Betasyrphus_seraruis 0.32 syrphus_rectus Doros_destillatorius Dideopsis Baccha_elongata Chasmomma_nigrum Triatylosus_dibapha 0.33 0.15 Senaspis_dentipes Erisalinus_aeneus 0.320.28 Austalis_copiosus 0.59 0.23 Pseuderistalis_violascens 0.350.31 0.22 Eristalis_pratorum Lycopale_wygodzinskyi 0.17 Mallota_florea 0.48 Lejops_lunulatus 0.15 Parhelophilus_ 0.12 Parhelophilus_rex 0.610.54 Calcaretropidia_sp 0.41 Tropidia_rostrata 0.38 Orthoprosopa_grisea Orthnevra_nitida 0.25 Ornidia 0.68 Copestylum Ceriana_willistoni 0.24 Ceriana_cacica 0.36 Ceriana_vespiformis 0.11 Ceriana_alboseta Alipumilio_avispas 0.91 Microdon_globosus 0.5 Serichlamys 0.33 Microdon 0.57 Domodon_peperpotensis 0.36 Pseudomicrodon 0.41 0.61 Paramixogaster 0.28 Stipomorpha_sp 0.11 Hypselosyrphus Aristosyrphus
0.2 Figure 3. Bayesian tree created using BEAST v1.8.4 using the 13 protein coding genes and two rrnLs, partitioned by gene. The scale bar shows the relative evolutionary time scale of the tree, calculated using a relaxed clock model, and the relative divergence dates are shown on the tree, with the most recent common ancestor at time point 1 and the tips at time point 0. The three subfamilies are shown in different colours: orange = Microdontinae, green = Syrphinae and pink = Eristalinae 33 Larval life history evolution
The RAxML tree with added CO1 barcodes from UK species can be seen in figure 4. There were no species from subfamily Microdontinae in the UK sample, but there were species from both Eristalinae and Syrphinae, which were all placed in the correct subfamilies on the tree. Including the barcodes added some genera which were not present in the mitogenome only tree, including Volucella and Eupeodes. It also increased the number of species in some genera, such as Cheilosia and Eristalis. All genera with multiple species were found to be monophyletic, except for Parhelophilus and Syrphus. Parhelophilus was paraphyletic, as in the mitogenome only tree, but another species, Anasimyia lineata was also found in the same clade. All three of the Syrphus barcodes and the one Syrphus mitogenome came out in a single clade, but the barcode Xanthandrus comtus was also found in the same clade.
34 Aristosyrphus Domodon_peperpotensis 100 Pseudomicrodon 98 100 Paramixogaster 100 Stipomorpha_sp 100 100 Hypselosyrphus 100 Microdon 100 Serichlamys Microdon_globosus Rhingia_campestris_barcode 13 99 Ceriana_cacica 38 Ceriana_willistoni Ceriana_alboseta 97 Ceriana_vespiformis Neoascia_podagrica_barcode 48 Orthoprosopa_grisea 38 Tropidia_rostrata 38 Calcaretropidia_sp 53 Syritta_pipiens_barcode 36 Tropidia_scita_barcode Parhelophilus_rex 22 81 41 Anasimyia_lineata_barcode 43 Parhelophilus_ Lejops_lunulatus Triatylosus_dibapha 100 Senaspis_dentipes 87 Helophilus_pendulus_barcode 98 Helophilus_hybridus_barcode 27 Eristalinus_sepulchralis_barcode 100 Erisalinus_aeneus 5647 Pseuderistalis_violascens 84 Chasmomma_nigrum 65 Austalis_copiosus Eristalis_pertinax_barcode 69 Eristalis_intricaria_barcode 59 29 Eristalis_nemorum_barcode 49 94 39 Eristalis_horticola_barcode 52 Eristalis_tenax_barcode Eristalis_pratorum 37 Eristalis_abusiva_barcode 100 Eristalis_arbustorum_barcode Lycopale_wygodzinskyi 99 Myathropa_florea_barcode 100 Mallota_florea 82 Pterallastes_thoracicus 59 Sericomyia_flagrans Chamaesphegina 100 Neoplesia_analis 83 Orthoprosopa_multicolor 89 Matsumyia_nigrofacies 5 26 97 Criorhina_nigrventris 13 92 100 Criorhina_coquilletti Cyphipelta_rufocyanea_ Brachypalpus_oarus 98 Brachypalpus_sp. 48 72 77 Chalcosyrphus_chalybeus 64 Hadromyia_pulchra Xylota_quadrimaculata 55 96 Crepidomyia_Sterphus Somula_decora 94 Blera_eoa 55 95 Caliprobola 100 Psilota_atra 100 Psilota_anthracina Callicera_aenea Melanostoma_mellinum/scalare_barcode 95 Argentinomyia 98 Talahua Platycheirus_peltatus_barcode 1553 Platycheirus_scutatus_barcode 10 Platycheirus_albimanus_barcode 19 Platycheirus_granditarsis_barcode 100 Platycheirus_rosarum_barcode 16 Platycheirus_clypeatus_barcode 46 Platycheirus_occultus_barcode Rohdendorfia_alpina Salpingogaster_sp 66 Asarkina_ericetorum 24 Episyrphus_balteatus_barcode 100 DQ866050.1_Simosyprhus_grandicornis 9 65 5 Meliscaeva_cinctella_barcode 82 Meliscaeva_auricollis_barcode 99 Meliscaeva_auricollis 94 Sphaerophoria_philanthus_barcode 13 99 Sphaerophoria_interrupta_barcode Sphaerophoria_scripta_barcode 20 92 Allograpta_fascifrons 51 Allograpta_sp. Allograpta_obliqua 12 Allobacha_monobia 7 Toxomerus_saphiridiceps 298 37 70 Hybobathus_norina 96 Nuntianus_cubana 97 54 100 KT272862.1_Ocyptamus_sativus 44 Victoriana_melanorrhina Ocyptamus_dimidiatus 86 Orphnabaccha_priscilla 26 Xanthogramma_pedissequum_barcode 28 Doros_destillatorius Syrphus_torvus_barcode 75 96 Xanthandrus_comtus_barcode 54 Syrphus_ribesii_barcode 179 Syrphus_vitripennis_barcode 99 syrphus_rectus 11 Melangyna_compositarum_barcode 43 Eupeodes_latifasciatus_barcode 95 16 Eupeodes_luniger_barcode 3 60 Eupeodes_corollae_barcode Leucozona_lucorum_barcode 17 13 Epistrophe_grossulariae_barcode 1213 Paragus_haemorrhous_barcode 1 Betasyrphus_seraruis Epistrophe_eligans_barcode Chrysotoxum_festivum_barcode 18 Dideopsis Baccha_elongata Milesia_pendleburyi Ferdinandea_cuprea_barcode 81 Pelecocera_tricincta Cheilosia_impressa_barcode Cheilosia_soror_barcode 67 66 8256 Cheilosia_illustrata_barcode Cheilosia_variabilis_barcode 48 100 Cheilosia_ranunculi_barcode 81 72 Cheilosia_albitarsis 53 Cheilosia_bergenstammi_barcode Cheilosia_pagana_barcode KM244713.1_Syrphidae_sp. Orthnevra_nitida Eumerus_funeralis_barcode 9 65 Merodon_equestris_barcode 6 Heringia_heringi_barcode Volucella_inanis_barcode 13 19 87 Volucella_bombylans_barcode 53 Volucella_pellucens_barcode Ornidia 97 Copestylum Alipumilio_avispas 9 Lejogaster_metallina_barcode Epalpus_signifer Themira_nigriconis 93 Argyra 94 100 Apystomyia_elinguis 100 Iteaphila_macquarti 95 100 Lindneromyia 83 Platypeza 98 Ironomyia_nigromaulata 99 Anevrina_luggeri Verallia 93 Dorylomorpha_alaskensis 90 100 Claraeola_sicilis Nephrocerus_lapponicus
0.08 Figure 4. Maximum likelihood tree created in RAxML with 15 gene partitions. A backbone tree of mitochondrial genomes was used to constrain the topology, with CO1 barcodes from UK Syrphidae species added on to the tree. Barcodes are labelled as ‘barcode’ on the tree. The three subfamilies are coloured on the tree, orange = Microdontinae, green = Syrphinae and purple = Eristalinae. Bootstrap values are shown on the branches. 35 Nephrocerus lapponicus Dorylomorpha alaskensisClaraeola sicilis Verallia Anevrina luggeri Ironomyia nigromaulata
Platypeza
Lindneromyia Iteaphila macquarti Apystomyia elinguis Argyra Themira nigriconis Epalpus signifer Lejogaster metallina barcode Alipumilio avispas Copestylum Ornidia Volucella pellucens barcode Volucella bombylans barcode Volucella inanis barcode Heringia heringi barcode Merodon equestris barcode Eumerus funeralis barcode Orthnevra nitida KM244713.1 Syrphidae sp. Cheilosia pagana barcode Cheilosia bergenstammi barcode Cheilosia albitarsis Cheilosia ranunculi barcode Cheilosia variabilis barcode Cheilosia illustrata barcode Cheilosia soror barcode Cheilosia impressa barcode Pelecocera tricincta Ferdinandea cuprea barcode Milesia pendleburyi Baccha elongata Dideopsis Chrysotoxum festivum barcode Epistrophe eligans barcode Betasyrphus seraruis Paragus haemorrhous barcode Epistrophe grossulariae barcode Leucozona lucorum barcode Eupeodes corollae barcode Eupeodes luniger barcode Eupeodes latifasciatus barcode Melangyna compositarum barcode syrphus rectus Syrphus vitripennis barcode Syrphus ribesii barcode Xanthandrus comtus barcode Syrphus torvus barcode Doros destillatorius Xanthogramma pedissequum barcode Orphnabaccha priscilla Ocyptamus dimidiatus Victoriana melanorrhina KT272862.1 Ocyptamus sativus Nuntianus cubana Hybobathus norina Toxomerus saphiridiceps Allobacha monobia Allograpta obliqua Allograpta sp. Allograpta fascifrons Sphaerophoria scripta barcode Sphaerophoria interrupta barcode Sphaerophoria philanthus barcode Meliscaeva auricollis Meliscaeva auricollis barcode Meliscaeva cinctella barcode DQ866050.1 Simosyprhus grandicornis Episyrphus balteatus barcode Asarkina ericetorum Salpingogaster sp Rohdendorfia alpina Platycheirus occultus barcode Platycheirus clypeatus barcode Platycheirus rosarum barcode Platycheirus granditarsis barcode Platycheirus albimanus barcode Platycheirus scutatus barcode Platycheirus peltatus barcode Talahua Argentinomyia Melanostoma mellinum/scalare barcode Callicera aenea Psilota anthracina Psilota atra Caliprobola Blera eoa Somula decora Crepidomyia Sterphus Xylota quadrimaculata Hadromyia pulchra Chalcosyrphus chalybeus Brachypalpus sp. Brachypalpus oarus Cyphipelta rufocyanea Criorhina coquilletti Criorhina nigrventris Matsumyia nigrofacies Orthoprosopa multicolor Neoplesia analis Chamaesphegina Sericomyia flagrans Pterallastes thoracicus Mallota florea Myathropa florea barcode Lycopale wygodzinskyi Eristalis arbustorum barcode Eristalis abusiva barcode Eristalis pratorum Eristalis tenax barcode Eristalis horticola barcode Eristalis nemorum barcode Eristalis intricaria barcode Eristalis pertinax barcode Austalis copiosus Chasmomma nigrum Pseuderistalis violascens Erisalinus aeneus Eristalinus sepulchralis barcode Helophilus hybridus barcode Helophilus pendulus barcode Senaspis dentipes Triatylosus dibapha Lejops lunulatus Parhelophilus Anasimyia lineata barcode Parhelophilus rex Tropidia scita barcode Syritta pipiens barcode Calcaretropidia sp Tropidia rostrata Orthoprosopa grisea Neoascia podagrica barcode Ceriana vespiformis Ceriana alboseta Ceriana willistoni Rhingia campestrisCeriana cacica barcode Microdon globosus Serichlamys Microdon Hypselosyrphus Stipomorpha sp Paramixogaster Pseudomicrodon Domodon peperpotensis Aristosyrphus
Figure 5. Tree showing the evolution of larval life histories using stochastic character mapping. Five life histories are shown on the tree, represented by different colours. Turquoise = myrmecophiles, purple = saprophagous, green = insectivorous, blue = phytophagous and yellow = fungus feeding. The posterior probability for each trait over 1000 simulations is show for each node in the form of a pie chart.
36 The trait mapping analysis is shown in figure 5 and shows the five broad categories of larval life histories mapped onto the phylogeny. This shows that in almost all of the simulations saphrophagy comes out as the ancestral state, with Microdontinae breaking away early from the rest of the Syrphidae family and evolving into myrmecophiles. However, this was not found in all simulations, with a few finding myrmecophily as the ancestral state. There appears to be one large radiation of predatory life histories, occurring in Syrphinae, and then two smaller isolated radiations into predation in Volucella and Heringia. In a few simulations the ancestral nodes between Volucella and Heringia were also predatory states, suggesting a single evolution of this life history for these two genera. The fourth larval life history, phytophagy, appears twice on the tree, once evolving in the genus Cheilosia and once in the clade of the bulb flies, Merodon equestris and Eumerus funeralis. A single member of Cheilosia present in this phylogeny, Cheilosia soror, has switched from phytophagy to a fungal diet, but the sample size of species here is not enough to detect whether this is an isolated evolution or a small radiation into a new niche within this genus.
Discussion Taxa choice and Assemblies
Specimens were selected from a broad a geographic range, to produce a phylogeny which reflects the relationships of the whole family and not just a specific geographic region. It was also important to select specimens for the phylogeny based on representation of the subfamilies and tribes within the Syrphidae family. Both these things were achieved by collaborating with the Skevington lab in Canada in the CDC, which has expert taxonomic knowledge of this family and the means to obtain specimens. Out of the 207 specimens sampled, contigs were recovered for 94. This is a lower number than expected, mostly due to low DNA concentrations. Each library was designed to contain over 200ng of DNA, so that it was suitable for TruSeq nano library preparation (Illumina, 2015). However, the DNA amounts of the libraries were close to the minimum amounts required or below, especially after quality assessment. This may have been exacerbated by DNA transport, with any DNA degradation reducing the long DNA fragments needed for MMG. Sequencing the libraries again increased the number of mitochondrial genomes obtained by 30, suggesting there
37 was a stochastic loss of data. It is important to note that a large number of mitochondrial genomes were obtained for a cost which would not have been feasible without MMG. Despite the fact that this study did not obtain all of the mitochondrial genomes it set out to, it increases the number of syrphid genomes available from eight (Sonet et al., 2019) to 92 and thus makes an important contribution to the public database of syrphid DNA. It also provides data for the most complete and well supported Syrphidae phylogeny to date (Skevington and Yeates, 2000; Ståhls et al., 2003; Mengual, Ståhls and Rojo, 2015; Young et al., 2016).
Not all of the contigs were able to be identified using CO1 baits. To try to increase the number of identified contigs, a tree was built containing all the contigs over 2kb, however this was only suitable for contigs which belonged to a genus already containing an identified specimen. In the future, sequencing more bait mitochondrial genes may enable more contigs to be identified. For other groups, it may be possible for genes to be found on GenBank to be used as baits, but these were not available for the syrphid species sequenced here.
The preliminary tree showed three specimens in unlikely positions based on prior knowledge of the Syrphidae family (Young et al., 2016), especially as two were in different subfamilies to those they are assigned. It was important to investigate these sequences empirically, as it could have been the case that this placement was due to the evolution of the mitochondrial data. The individual gene trees showed that for Nausigaster and Lejota there was a clear split along the genome as to where they placed on the tree. This showed that in both cases the contig was a chimera which was causing the strange placement. Chimeras can be an issue in MMG, as there are many specimens from different species included in a single pool of DNA. The likelihood of them occurring is minimised by only including one representative from a single genus in each pool, so that closely related specimens are not in the same pool.
38 Phylogenetic analysis
The gene partitioned trees both performed better in terms of clade support and overall bootstrap values than the other strategies. However, the differences between all the trees were quite small, suggesting that the high-quality alignments and large amount of data were robust enough to be largely unaffected by different partitioning schemes. The gene and codon partitioned tree had lower support than the gene partitioned tree, suggesting that the increase in information was not outweighed by the increase in the number of partitions, which was tripled from 15 to 45. This is a large number of partitions and it is not unexpected that this tree therefore had lower support. There is a balance between finding an evolutionary sound partitioning scheme and not over-partitioning the data, which can result in a higher rate of error.
The placement of Microdontinae as sister to the rest of Syrphidae was established in 2000 by Skevington and Yeates (2000), using 12S and 16S sequences to look at broad relationships between the subfamilies in Syrphidae. The topology of this tree is congruent with the topology found in this study, with Microdontinae sister to the rest of the Syrphidae, and Syrphinae monophyletic within paraphyletic Eristalinae. Other studies have also looked at these relationships (Ståhls et al., 2003), however the small amount of data and specimens meant that relationships between Syrphinae and Eristalinae were unable to be resolved. This study supports the subfamily level topology found in Skevington and Yeates, (2000), and firmly places Syrphinae within Eristalinae. This is the most complete phylogeny of the Syrphidae family to date, with 83 species and 15 genes, resulting from 14,333 bases of DNA.
In this study three out of the ten tribes represented by more than one species were found to be polyphyletic, and one was found to be paraphyletic. This was the case for all of the RAxML trees. The tribe Syrphini is paraphyletic, with the one specimen from Toxomerini within it. Toxomerini was also found to be embedded in Syrphini in the study by Mengual, Ståhls and Rojo (2015), which looked at relationships within Syrphinae. The third Syrphinae tribe in this study, Bachini, was found to be paraphyletic in agreement with Mengual, Ståhls and Rojo (2015) and Young et al. (2016). The Milesiini tribe was heterogenous and the
39 members of the tribe were found across Eristalinae, as in Mengual, Ståhls and Rojo (2015), where Milesiini were found to be polyphyletic and distributed among the rest of Eristalinae. The Brachyopini tribe also formed multiple clades within Eristalinae in this study and Mengual, Ståhls and Rojo, (2015). The current accepted tribal level relationships in Syrphidae are based on morphological characters and it is clear that further research is required with more species to review these relationships.
In this study Rhinginii were found to be sister to the Syrphinae subfamily. In Mengual, Ståhls and Rojo, (2015) and Young et al. (2016) Pipizini were found to be sister to Syrphinae, and are a tribe with aphidophagous larval, like a lot of Syrphinae species. However, this tribe was not recovered from the libraries in this study, and so was not represented in this dataset. This study is therefore limited as to resolving the relationships between the Eristalinae and Syrphinae subfamilies, and in determining the sister group to Syrphinae. Further research into these tribal level relationships should focus on obtaining mitochondrial genomes for the tribes not recovered here, so that the gaps in the tribal level relationships can be resolved. Particular focus on tribes with divergent larval histories, such as the Pipizini, would be especially informative.
The least squares dating and Beast analysis show different rates of evolution across the tree. This shows the importance of using a fossil calibrated phylogeny, so that the rates can be calculated with greater accuracy. In this study only relative rates were used, because data on the Syrphidae fossil record is not readily available, and many recorded fossils are difficult to verify (Popov, 2015). Further research to include fossil calibration would allow these relative dates to be turned into time calibrated dates as has been done for other groups (Espeland et al., 2018) and would give more information on the evolution of this family. Despite this, the two analyses show the same patterns for divergence of different clades. In both analyses the subfamily Microdontinae splits early on from the rest of the Syrphidae, and as a result the evolutionary distances between groups in this subfamily are greater than in the rest of the family. This supports previous calls to elevate Microdontinae to family level (Thompson, 1972). This has not been disputed by previous phylogenies (Young et al., 2016), but the addition of relative divergence times adds support to the theory that the divergence from the rest of Syrphidae is deep enough to warrant family
40 status. Both trees also show the divergence time of Syrphinae from Eristalinae, which is later than many of the divergences within Eristalinae, further providing evidence that the relationships within these subfamilies are complex and may not be best explained by the current taxonomy. These relative rates may also go some way to explaining the inability of CO1 to separate some species, as was found to be the case in Chapter 2. Some of the genus level splits in the tree occur close to the tips, suggesting that species within these genera have only recently diverged in their mitochondrial DNA.
Larval life history evolution
Adding in CO1 barcodes from UK species allowed the larval life histories to be examined in more detail, as these are generally species for which more is known about their life histories (Ball and Morris, 2015). Alongside this, it added in genera that were not included in the tree, including Volucella, where a small radiation of predatory lifestyles has occurred. The tree showed that the predatory lifestyle has radiated within Syrphinae, but that there was also a smaller, separate evolution of a predatory lifestyle in Volucella. This is not the only larval life history to have evolved more than once, with phytophagous larvae species evolving in two separate places within Eristalinae. The stochastic mapping analysis also placed the saprophagous life history as being the ancestral state on the tree, with Microdontinae splitting off from the rest of the rest of the family and radiating into a new life history niche. This analysis gives an idea of the ways in which a robust phylogeny such as this one can be used to look in more detail at the evolution of this family of pollinators in the future.
Conclusion This study resulted in 81 new syrphid mitochondrial genomes, massively increasing the amount of genetic data available for this family, which will continue to benefit syrphid research beyond this study. This is despite sequencing issues, which mean there is still a need to expand the list of mitochondrial genomes for tribes that were not captured here. The phylogenetic tree produced in this study is the most complete tree of this family to date, with more genetic data and more species than have been included previously. The topology supports the findings of previous studies and finds Microdontinae sister to the rest
41 of the Syrphidae, and Syrphinae monophyletic within a paraphyletic Eristalinae. The tribal level relationships found here show several tribes are polyphyletic, and suggest the need for a revision of the tribes in this family. Finally, the relative ages of divergence found here support an elevation of Microdontinae to family level, and highlight the need for a fossil calibrated phylogeny. This topology has implications for the evolution of larval life histories, supporting multiple points where predatory larvae have evolved, and supporting an ancestral saprophagous lifestyle. Having this robust phylogeny also enables further research to incorporate phylogenetic diversity into studies of community diversity, adding another layer to the data available for monitoring and conservation of these important pollinators.
42 Chapter 2: Developing and testing a Syrphidae CO1 reference database
Introduction In the UK a recent focus on pollinators has resulted in a national monitoring programme, with the aim to monitor pollinator populations long-term. This will enable the detection of changes which may indicate conservation concerns (Carvell et al., 2016). Monitoring in the UK is carried out mainly by volunteers (Barlow et al., 2015; Pescott et al., 2015; Newson et al., 2016). However, reliance on volunteers can be challenging when trying to monitor a diverse or cryptic group. Both of these issues apply to monitoring pollinators, which are made up of many different insect groups, and which contain many species which are difficult to identify to species level. The current situation relies on paid expert taxonomists identifying bees and syrphid flies from pan trap samples, which is expensive and time intensive. The Syrphidae family are less well studied than bee pollinators, but Diptera are thought to be the second most important group of pollinators in the UK, and so monitoring this family provides an indicator for a super-diverse flower visiting group.
There is increasing interest from stakeholders and policy makers in new technology and research, and how it can enable us to address environmental issues in new and innovative ways. Several studies have used DNA to monitor animals and plants, showing the potential of DNA to make monitoring more time and money efficient. The current applications of DNA in monitoring have mainly been of environmental DNA (eDNA), which is already used to detect great crested newts in ponds across the UK (Rees et al., 2014; Biggs et al., 2015). However, there are also applications for using DNA to identify individuals and bulk samples, particularly of invertebrates (Arribas et al., 2016). These methods are best applied to diverse groups which are challenging to identify using morphology. There are around 280 syrphid species in the UK (Ball and Morris, 2015), as well as many other insect flower visitors, including other Diptera, Coleoptera and Lepidoptera, making it a very diverse group to survey.
43 DNA barcoding is a widely used method which aims to distinguish as species quickly and in a cost effective manner. DNA barcoding was first used as a tool for identifying microbial communities, and in 2003 Hebet et al. proposed adapting it for identifying animals, using a region of CO1 as an appropriate barcode. This barcode is still the most widely used for animals, including insects, today. Since then DNA barcoding has been adapted for animals, plants and bacteria, although different barcoding regions are used for different groups.
A DNA barcode is a section of DNA which has more between species variation than within species variation, resulting in a barcode gap between species. For insects a region of CO1 is widely accepted as the barcode of choice, and there is a wealth of data publicly available on online databases such as NCBI and Genbank. This can be very useful for identifying unknown specimens, however there are also issues with online databases, as they are not curated and mistakes can be difficult to identify. Whilst the CO1 barcode is thought to be able to distinguish species, in reality there is a large range of variability between groups, with some fast evolving or newly separated groups unable to be split into species.
DNA barcoding of individual specimens can now be done in large volumes due to the large number of reads provided by next generation Illumina sequencing, which provide enough reads for many samples. The method introduced by Shokralla et al., (2015) allows dual tagging, firstly of the primary PCR using a 6 base pair tag, and secondly during the library preparation. Large numbers of samples can be stacked together in the same library, and separated post-sequencing using bioinformatics.
One of the biggest challenges of using DNA for monitoring is a lack of reference databases. Without a comprehensive reference database it is still possible to get valuable information which can be used for wide scale biodiversity studies (Creedy, Ng and Vogler, 2019), but identification to species level is used for all major monitoring schemes in the UK, and thus it is important that robust databases are created and curated. There are several public databases such as GenBank (Clark et al., 2016) and BOLD (Ratanasingham and Hebert, 2007) where sequences are stored, and these provide a valuable resource for identification. However, the nature of public databases is that they contain information which cannot always be verified, and so it is possible that errors in identification occur. They are also
44 incomplete, as not all species or even all known species have been barcoded. When setting up a methodology for monitoring a group of species, it is important that the reference databases are curated, to reduce any online database errors, to fill gaps caused by missing sequences and to ensure that the CO1 barcode is suitable for species delimitation. The creation of a reference database requires specimens which have been collected and identified by expert taxonomists, and ideally are held in public collections so that they can be validated into the future. This highlights the ongoing importance of taxonomists, and of the institutions which house natural history collections. These provide vital sources of DNA which is robustly identified. They also contain storage facilities for DNA collections, which are important for retaining the physical DNA of reference specimens.
Metabarcoding studies often use operational taxonomic units (OTUs) as a proxy for species. These are clusters of closely related sequences which are closer in distance to each other than to other clusters. In general, for the CO1 barcode a 3% barcode gap is used, with sequences that are ³97% similar clustered together. However, the barcode distance between species is not static. For cryptic species, which may be the result of recent rapid evolution, the CO1 barcode may not have diverged enough to distinguish these species based on 97% clustering (Čandek and Kuntner, 2015). Alongside this, clustering results in a loss of data within communities, as population level differences within species are masked by the clustering process. There have been moves towards using individual sequences, or haplotypes, with metabarcoding data (Elbrecht et al., 2018; Turon et al., 2019) as this generally removes confusion between species. These are unique sequences, and if they are mapped on to a reference sequence with a very high threshold for sequence matches, different haplotypes within a species can also be monitored. This could be important for conservation, as it would show how connected populations are, and whether there are unique differences in a species between sites. Using haplotypes rather than OTUs gives a much more detailed picture of populations and results in more data for conservation decisions.
The aim of this study is to develop a reference dataset for UK hoverfly species, which will be a vital resource for future research, as well as for future monitoring programmes. Alongside this the study aims to investigate the success of DNA for species delimitation in a family
45 containing many cryptic species, by using a monitoring sample as a test of identification. The use of OTUs for DNA identification will be compared to unique haplotypes, which are hypothesised to be more accurate and informative. This study complements that by Creedy et al. (2019), where bee specimens from the national pollinator monitoring programme were used to create a reference database for UK bees.
Methodology Samples for the reference database
The sequences used to form the reference database for UK Syrphidae species were compiled using specimens from three sources: The reference samples from the National Pollinator Monitoring Framework (NPPMF) pilot, fresh specimens from the Natural History Museum collections and online sequence records from the BOLD database.
Samples were caught during the 2015 pilot of the NPPMF, using line transects with hand netting, and pan traps at 12 sites across the UK throughout the summer (Carvell et al. 2016). The pan trap samples were transferred to ethanol, and then bees and syrphids sorted out from the rest of the samples. The syrphids were sent to professional taxonomists and identified, and then sent for DNA analysis. A reference set of specimens were selected to represent the species present in the dataset, and these were kept separate from the other syrphids. These reference specimens were used to create the DNA reference dataset.
DNA was extracted from a single leg from each specimen, which was crushed and incubated overnight in lysis buffer made up of 180µl ATL and 20µl proteinase K, before following the Quiagen blood and tissue extraction protocol. A 600bp barcoding fragment of CO1 was amplified using the primers BEEf and BEEr, following the same protocol as in Creedy et al (2019). PCR reactions and conditions consisted of between 1-2.5 μl of DNA, 0.4 μM of each primer (at 10 mM), 2.5μl of the NH4 reaction buffer (Bioline, London, UK), 1.5 mM of MgCl2 solution, 200nM of each dNTP, 1 unit of BioTaq™ (Bioline), and ddH2O up the final volume of 25μl. Standard PCR cycle conditions for COI developed by the Canadian Centre for DNA Barcoding (CCDB) were used: initial denaturation at 94°C for 2 minutes, followed by 5 cycles
46 of 94°C for 30 seconds, annealing at 45°C for 40 seconds, and extension at 72°C for 1 minute, 35 cycles of 94°C for 30 seconds, annealing at 51°C for 40 seconds, and extension at 72°C for 1 minute, and a final extension at 72°C for 10 minutes. The success of the PCR was checked by gel electrophoresis, and PCR products were purified using a QIAquick PCR Purification Kit (Qiagen) and sequenced in both directions using ABI dye terminator sequencing (Thermo Fisher Scientific, Waltham, MA, USA). Sequence chromatograms were assembled into contigs and manually edited using Geneious v5.3.6. (Kearse et al., 2012). The sequences were uploaded to BOLD in the BEEEE database along with bee reference sequences from the NPPMF.
The second source of data was from specimens caught over the summers of 2015 and 2016 by a curator of Hymenoptera at the Natural History Museum London (NHM), D. Notton. Samples were pinned and identified by DN and Diptera curator and taxonomist, N. Wyatt, before being accessioned into the collections at the NHM. DNA was extracted from a single leg which was crushed and incubated in lysis buffer made up of 180µl ATL and 20µl proteinase K, overnight on a shaking incubator at 56°C, and then DNA was extracted using the Quiagen blood and tissue DNA extraction kit. The 418bp barcoding region of CO1 was amplified for each specimen using primers BF and foldR. PCR reactions and conditions consisted of between 2μl of DNA, 0.4 μM of each primer (at 10 mM), 2.5μl of the NH4 reaction buffer (Bioline, London, UK), 1.5 mM of MgCl2 solution, 0.25µl of each dNTP, 1 unit of BioTaq™ (Bioline), and ddH2O up the final volume of 25μl. The PCR conditions were as follows: An initial denaturation at 94°C for 4 minutes, followed by 40 cycles of 94°C for 30 seconds, annealing at 48°C for 30 seconds, and extension at 72°C for 45 seconds, and a final extension at 72°C for 10 minutes. The success of the PCR was checked by gel electrophoresis, and the PCR product was purified using the AMPure XP bead purification kit to remove sequences below 200bp. The samples were sent for library preparation and sequencing on an Illumina HiSeq lane.
Post sequencing, forward and reverse sequences were merged using Pear, with a quality score of 26, and primers were removed. The paired end reads were put through NAPselect (Creedy, 2019), which selected the most abundant unique sequence for each sample, bootstrapping to check that it was significantly more abundant than the other reads. The
47 top sequence was also blasted against all of the Diptera on GenBank (Clark et al., 2016), to reject any sequences that were contamination from other organisms such as fungi. To check for contamination between samples, a Muscle (Edgar, 2004) alignment of these sequences and the NPPMF barcodes was made in Geneious (Kearse et al., 2012), and a maximum likelihood phylogenetic tree ran in RAxML (Stamatakis, 2014) to ensure that genera were monophyletic, and no sequences were misplaced. Sequences were then uploaded to BOLD and can be found in the SyrUK database.
Finally, the data was supplemented by the online database BOLD. A list of all UK hoverfly species was compiled using Ball and Morris Britain’s Hoverflies (Ball and Morris, 2015). These species were searched for on BOLD and any public sequences were downloaded in fasta format.
Database cleaning and species delimitation
Before a well curated reference dataset could be collated, the sequences were quality filtered and the identifications tested. Sequences with ambiguities or missing data were removed, as these are indicative of lower quality data.
To determine whether the CO1 barcodes could separate species, three species delimitation analyses were carried out on genera with more than one sequence. Species were defined by the name of each sequence as identified by a taxonomist. Species could be lumped or split, or a combination of both. In lumped species the barcode gap was not large enough to separate them from another species of the same genera, and so the sequences clustered together. For split species the sequence variation within species meant that the barcode gap was just as wide within as between species, and resulted in the species being split into more than one group of sequences. For some species both of these issues were found and therefore some of the sequences were split between two barcode groups whilst others were found with sequences from another species. This method of defining the barcode clustering was used to align with that used in Creedy et al. (2019).
48 Firstly, for each genus the barcodes were aligned in Geneious (Kearse et al., 2012) using Muscle (Edgar, 2004), including an outgroup from a closely related genus. An xml file was created in BEAUTI with a strict clock, yule speciation and the evolutionary model GTR+I+G, and a tree produced using BEAST v.8 (Drummond et al., 2012) on the Cipres Science Gateway (Miller, Pfeiffer and Schwartz, 2010). This tree was checked in Figtree and then used to run a GMYC analysis in R using the package Splits. The number of species per genus was recorded, as was the monophyly of each species and whether they were split or lumped with another species. Secondly, ABGD analysis was carried out by uploading the same Muscle alignments to the ABGD web server. The ABGD analysis was run using a Kimura distance of 2.0 (Puillandre et al., 2012), and the output was once again checked for whether species were split, lumped or both. The final species delimitation method used was BOLD BINs, which are used to define species on the BOLD database. The number of BINs a species occupied, and any other species in that BIN, was recorded from the BOLD website.
For each of the analyses the results were investigated for obvious misidentifications, for example single sequences lumped with another species. Secondly sequences were removed where there was a geographic split. Since this is a UK reference database, those split separately to UK barcodes were removed. After running the species delimitation methods, a final filtered dataset was produced containing sequences which had been established to be correct and geographically relevant. This filtered reference database was used to re-run the species delimitation analyses to get a final picture of the species delimitation capacity of the database.
After the barcodes were filtered, all of the remaining barcodes were clustered at 97% similarity using Usearch10. Clustering is the method most commonly used in metabarcoding to obtain OTUs, and so this tested whether this method can be used to find OTUs in syrphid datasets. Firstly, the barcodes were de-replicated so that only unique sequences were included in the clustering. Usually at this stage singleton sequences are discarded, but some species were only represented by one sequence, and these were retained by specifying - sizeout 1. Then the sequences were clustered at 97% similarity. The parameter -uparseout was used to produce a table showing each barcode and whether it formed a new OTU or was a >97% match to an existing OTU. Each species was checked to see whether it was split
49 between OTUs, lumped in with another species, or split between OTUs which also contained other species.
The sequences were also split into individual haplotypes. This was done by blasting the reference sequences against themselves and then grouping them based on a 100% match along the whole of the CO1 barcoding region. The haplotypes were analysed to see how many were present for each species, and to check whether any haplotypes were shared between species. The ability of haplotypes to delimit species was then compared to the clustering analysis.
Testing the reference database
To test the ability of the reference database to accurately identify a large number of syrphid specimens, it used to identify specimens caught as part of the UK wide pilot pollinator monitoring scheme, NPPMF (Carvell et al., 2016). As has been mentioned previously, some of these specimens were removed by taxonomists and were used in developing the reference database. However, the rest of the specimens were identified by the taxonomists and returned to the pan trap samples and thus provided a large sample of syrphid specimens which could be used to test the reference database.
DNA was extracted from each syrphid specimen by piercing the abdomen and incubating the whole specimen in lysis buffer overnight on a shaking incubator at 56°C. DNA was extracted using the Biosprint, according to the Quiagen Biosprint protocol. The 418bp barcoding region of CO1 was amplified for each specimen using the primers BF and FoldR. The primers contained Illumina tails for library preparation, as well as a six base pair tag, so that samples could be pooled together (Shokralla et al., 2015). PCR was carried out using
2.5µl of Bioline buffer, 1µl of MgCl2, 0.25µl of dntps mix, 0.75µl each of forward and reverse primer, 0.1µl of BioTaq™ (Bioline), 2µl of DNA, and 15.4µl of ddH2O to give a total volume of 25µl. The PCR was run with an initial denaturation of 94°C for 4 minutes, and then 45 cycles of 94°C for 30 seconds, 45°C for 30 seconds and 72°C for 45 seconds. The PCR products were visualised on an agarose gel and then purified using AMPure XP beads to retain DNA
50 above 200bp. 96 libraries were made by amplifying using i5 and i7 NextEra XT indices with 96 unique tags, and then quantified using the Tape Station. The libraries were pooled in equimolar concentrations and sequenced on an Illumina MiSeq.
The libraries were separated by the sequencing facility back into 96 libraries based on the NextEra indices. NAPdemux (Creedy, 2019) was used to separate out the samples based on the 6 base pair tags using cutadapt, while also pairing forward and reverse reads based on sequence names. The sequences were then combined, and the primers removed using Pear, with a quality score of 26.
Two methods were used to obtain the barcode sequence for each specimen. The first was using NAPselect, as for the reference barcodes from the NHM. The second method used clustering to obtain OTUs, using usearch90. The OTUs were blasted against the reference database and the top OTU blasting to Syrphidae was selected. This differed to the NAPselect method because a 97% similarity was used, allowing some sequence variation. The NAPselect barcodes were identified in two ways, firstly using 97% similarity against the reference dataset, and secondly 100% similarity to the reference dataset. The 100% similarity method matched the haplotype method used for separating the reference barcodes. The species determinations for each method were compared against each other and against the morphological identifications using ggplot2 (Wickham, 2016) in R. Secondary OTUs
Although the NPPMF specimens were sequenced individually, they came from mixed samples and were sequenced at a greater depth than was required for obtaining a single barcode. The samples were therefore also investigated for secondary OTUs sequenced alongside the target specimen. This was done using the NAPcluster script which utilises usearch90. A strict target length of 418bp was used, as any length variation is likely to be caused by pseudogenes and PCR and sequencing errors. The minimum group size was increased from the default of 2 sequences to 5, as very rare OTUs are likely to be sequencing and PCR errors. Finally, denoising was carried out on the samples to detect any remaining errors.
51
Results Samples for the reference database
The number of species and sequences from each dataset is shown in table 1, although this was reduced after filtering, and resulted in coverage for 73% (209/284) of syrphid species present in the UK. There was a high level of species overlap between the datasets, with the two specimen datasets used in this study each having fewer than ten unique species, as can be seen in table 1. The split of missing species across genera can be seen in figure 1a, with seven genera not represented in the database, although these are all genera with under three UK species. The majority of genera are well represented in the reference dataset. The 20% of species that were not included in the reference set included species which are critically endangered or nationally scarce, as is shown in figure 1b. This figure also shows that for 16% of species with no barcode, they do not fall into these official categories but have fewer than ten records on GBIF, and so have been rarely found in studies of UK Syrphidae.
NHM NPPMF BOLD Total samples reference sequences
Species 43 (4) 45 (3) 211 (123) 228 Barcodes 129 75 3222 3682 Table 1. Number of barcodes and species found in each of the three datasets used in this study before filtering. The number of species unique to that dataset are shown in brackets.
52
B.
2% 16% critically endangered A. 15% Nationally scarce near threatened 1% none 2% priority species sp with under 10 records 20%