<<

GENOMIC ANALYSES OF THE ANOPHELES PUNCTULATUS GROUP: INSIGHTS INTO

MOSQUITO BIOLOGY AND IMPLICATIONS FOR VECTOR CONTROL AND DISEASE

TRANSMISSION

by

KYLE JOSEPH LOGUE

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation Advisors: Dr. Peter A. Zimmerman, Ph.D. and Dr. David Serre, Ph.D.

Department of Biology

CASE WESTERN RESERVE UNIVERSITY

May, 2016

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

Kyle Joseph Logue

Candidate for the degree of Doctor of Philosophy*.

Committee Chair

Ryan Martin, PhD

Committee Member

Christopher Cullis, Ph.D

Committee Member

Michael Benard, Ph.D

Committee Member

Peter A. Zimmerman, Ph.D

Committee Member

David Serre, Ph.D

Date of Defense

January 25, 2016

*We also certify that written approval has been obtained

for any proprietary material contained therein

ii Table of Contents

Table of Contents iii

List of Tables ix

List of Figures xii

Acknowledgements xv

List of Abbreviations xvii

Glossary xix

Abstract 1

I. Chapter 1. The biology of Anopheles punctulatus Species in Papua New Guinea 3

A. Introduction: Anopheles mosquitoes and disease 4 B. Systematics and distribution of Anopheles species in Papua New Guinea 7 1. Discovery of the Anopheles punctulatus group 7 2. Species determination 8 a. Morphology 8 b. Cross-mating studies 9 c. Molecular techniques 10 3. Geographic distribution 12 4. Species distribution and larval habitats 14 a. An. punctulatus s.s. 15 b. An. koliensis 16 c. An. farauti s.s. 16 d. An. hinesorum 17 e. An. farauti 4 17 f. Minor An. punctulatus species 18 5. Species relationships 19 C. Vector ecology of An. punctulatus species 22 1. Host feeding patterns 22 a. Methods used to identify the meals of hematophagous 22 b. Host feeding patterns of Anopheles in Papua New Guinea 26 2. Vectorial status 29 a. Malaria 29 b. Lymphatic filariasis 31 D. Vector control 33

iii 1. History of vector control 33 2. Vector control in Papua New Guinea 36 E. Genomic data available for Anopheles 38 F. Dissertation aims 40

II. Chapter 2. Genome Sequencing of the Anopheles punctulatus Sibling species and examination of divergence among species 44

A. Introduction 45 B. Methods 48 1. Mosquito collections and species identification 48 2. Whole genome sequencing and de novo assembly of mitochondrial genomes 49 a. Whole genome sequencing of five An. punctulatus mosquitoes 49 b. De novo assembly of mitochondrial genome 50 3. Targeted sequencing of mitochondrial genomes of An. punctulatus mosquitoes 52 a. Long range primer design for targeted sequencing of mitochondrial genomes 52 b. Long range amplification and multiplex sequencing 55 c. De novo assembly of mitochondrial genomes 55 4. De novo assembly of the nuclear genome of four An. punctulatus species 58 5. Genome assembly comparisons between Anopheles mosquitoes 59 a. An. punctulatus species 59 b. An. punctulatus species and An. gambiae complex 59 c. An. farauti s.s. genome assemblies 60 6. Phylogenetic analysis and molecular dating of Anopheles mosquitoes using protein-coding genes of the mitochondrial genomes 60 a. Reconstruction of evolutionary history 60 b. Molecular dating 62 7. Phylogenetic analysis of four AP species using orthologous regions of the nuclear genome 62 C. Results 63 1. Sequencing and assembly of mitochondrial and nuclear genomes 63 a. Mosquitoes sequenced 63 b. Assembly of mitochondrial genomes 64 c. Sequencing and de novo assembly of the nuclear genome of four wild-caught mosquitoes of the An. punctulatus group 65 2. Genome comparison of Anopheles mosquitoes 67

iv a. Comparison of the genome assemblies of An. punctulatus species 67 b. Comparison of An. farauti s.s. genome assembly to An. farauti s.s. genome assembled by the Broad 70 3. Phylogenetic analysis 70 a. Phylogeny of Anopheles mosquitoes using the mitochondrial genome 70 b. Molecular dating of Anopheles using mitochondrial genome 73 c. Phylogeny of four AP species using multiple nuclear loci 75 D. Discussion 78 1. Genome assembly and comparative analyses 79 2. Molecular dating of Anopheles species using mitochondrial genome 81 3. Evolutionary relationships of An. punctulatus species 83 4. Implications for vector control initiatives 84

III. Chapter 3. Using Genomic Data to Investigate if Introgression is Occurring Between An. punctulatus Sibling Species 86

A. Introduction 87 B. Methods 90 1. Identification of putative introgression candidates using sequence divergence between orthologous regions of the nuclear genome 90 2. Identification of Single Nucleotide Polymorphisms (SNPs) in orthologous regions of the An. punctulatus s.s. and An. farauti 4 genomes 91 3. Number of shared polymorphisms expected under different population histories 92 4. Characterization of the historical demography of An. farauti 4 and An. punctulatus s.s. 94 C. Results 96 1. Phylogenetic analysis does not provide any evidence of introgression 96 2. Analyses of shared polymorphisms between An. farauti 4 and An. punctulatus s.s. does not support recent gene flow between these species 98 3. Characterizing the demographic history of An. punctulatus s.s. and An. farauti 4 103 D. Discussion 105 1. No evidence of contemporary gene flow among An. punctulatus sibling species 105

v 2. Discordant demographic history of An. farauti 4 and An. punctulatus s.s. species 108 3. Implications for disease transmission and implementation of vector control strategies in Papua New Guinea 109

IV. Chapter 4. Unbiased characterization of host feeding patterns of Anopheles punctulatus species by targeted high-throughput sequencing of the mammalian mitochondrial 16S rRNA 111

A. Introduction: Current and historical techniques used to evaluate host blood meals 112 B. Methods 115 1. Ethics 115 2. Sample collections 115 3. DNA isolation and molecular species identification 116 4. In silico assessment of mammalian mt 16S rRNA 116 a. Amplification range of mt 16S rRNA primers 116 b. Number of mammalian species amplified 117 c. Ability of primers to differentiate among mammalian species 117 5. Targeted high-throughput sequencing of mosquito host blood meals 118 a. Primer design of mammalian mt 16S ribosomal RNA genes and human mt genome hypervariable 118 b. Primer design for genotyping polymorphisms within the nuclear genome of An. punctulatus s.s. 118 c. Targeted high-throughput sequencing 120 6. Bioinformatic assessment of blood meal composition and population structure from individual mosquitoes 122 a. Filtering sequencing data for analysis 122 b. Identification of host blood meals from Anopheles mosquitoes 122 c. Identification of human individuals fed on by mosquitoes 123 d. Examination of population structure within An. punctulatus s.s. mosquitoes 124 C. Results 126 1. In silico assessment of mammalian mt 16S rRNA 126 a. Amplification range of universal mammalian 16S rRNA primers 126 b. Specificity of the universal mammalian 16S rRNA gene primer pairs 128 2. Application of assay to field-caught female Anopheles mosquitoes 128 a. Mosquitoes collected and sequencing reads 128

vi b. Mammalian blood hosts identified from blood fed Anopheles mosquitoes 134 c. Blood meal composition of individual An. punctulatus species 136 3. Examination of population structure within An. punctulatus s.s. 141 4. Evidence of mosquito blood meals from multiple human hosts 142 D. Discussion 145 1. Strength and utility of targeted sequencing approach to identify mammalian hosts blood meals from Anopheles mosquitoes 146 2. DNA profiling of human maternal lineages from field collected mosquitoes 148 3. Host feeding patterns of An. punctulatus mosquitoes in Papua New Guinea 150

V. Chapter 5. Project Summary 153

A. Summary 154 1. Generation of genomic data for four wild-caught An. punctulatus species 155 2. An. punctulatus species are deeply diverged and reproductively isolated 156 3. Development of an unbiased method to identify mammalian hosts fed on by An. punctulatus 160 B. Contribution of results to vector control initiatives and disease transmission 162 1. Vector control 163 2. Disease transmission and vector competence 165 C. Utility of high-throughput sequencing for understudied species 166 D. Future directions 169

VI. Appendix A. Accession numbers and location of deposited sequencing data 173

VII. Appendix B. Distribution of the sequence coverage per contig for the initial assemblies 175

VIII. Appendix C. Origin and dispersal of Anopheles mosquitoes based on analysis of mitochondrial genome sequences 178

IX. Appendix D. Putative genes that are introgressing in An. punctulatus mosquitoes 181

vii

X. Appendix E. Primers used for SNP amplification and barcoding amplicons for high-throughput sequencing 190

XI. Appendix F. Composition of AP sibling species blood meals by village and AP species 199

XII. References 205

viii List of Tables

Chapter 1

Table 1 Number of An. punctulatus sibling species screened by ELISA for

the presence of P. falciparum, P. vivax and P. malariae 31

Table 2 Anopheles genomes assembled 39

Chapter 2

Table 3 Sample sequencing information 50

Table 4 Primers used to amplify mitochondrial genomes by long range

PCR 53

Table 5 List of the samples used with their collection site or colony ID and

corresponding NCBI accession numbers 57

Table 6 Summary of initial de novo genome assemblies before filtering

out redundant contigs. 66

Table 7 Summary of the Anopheles punctulatus species genome

Assemblies 67

Table 8 Summary of the An. farauti 4 assembly after sub-sampling

sequencing reads to the coverage of An. punctulatus s.s. 68

Table 9 Percentage of DNA sequencing reads that align among species in

the An. punctulatus group and An. gambiae complex 69

Table 10 Mean divergence times and 95% credibility intervals for selected

Nodes 74

Table 11 Divergence times using the mitochondrial DNA mutation

rate 75

ix Table 12 Number of pairwise nucleotide differences (lower diagonal) and

percent divergence (upper diagonal) 78

Chapter 3

Table 13 Summary of the genetic diversity in Anopheles farauti 4 and

Anopheles punctulatus s.s. 100

Table 14 Summary of the genetic diversity in An. farauti 4 estimated after

sub-sampling An. farauti 4 reads 101

Chapter 4

Table 15 Primers used to amplify mammalian host blood meals and the

human mitochondrial hypervariable region I 119

Table 16 Summary of the amplification range and discriminatory power

predicted for the mammalian 16S rRNA primers 129

Table 17 Proportion of nucleotide differences, including deletions,

between sequences of species 130

Table 18 Number of mosquitoes that fed, at least partially, on each

identified mammalian host according to their species 131

Table 19 Summary of blast results 133

Table 20 Summary of the hosts identified in the mosquito blood meals 135

Table 21 Mixed human blood meals 144

Appendix D

Appendix D Putative genes that are introgressing in An. punctulatus

Mosquitoes 182

x Appendix E

Appendix E.1 Primers used to amplify polymorphisms in

An. punctulatus s.s. 191

Appendix E.2 Primers used to add barcode and Illumina adaptor

sequences for high-throughput sequencing 196

xi List of Figures

Chapter 1

Figure 1 Global distribution of the dominant malaria vectors 6

Figure 2 Distribution of the five main disease vectors of the An.

punctulatus group 13

Figure 3 Representation of published phylogenies of members of the An.

punctulatus group 21

Chapter 2

Figure 4 Coverage of whole genome sequencing reads on mitochondrial

and nuclear genes 51

Figure 5 Multiplex sequencing method. Diagram of the steps used to

amplify and sequence multiple mitochondrial genomes 54

Figure 6 Support of the Anopheles phylogeny using the concatenated DNA

sequences of all mitochondrial protein coding genes 71

Figure 7 Phylogenetic tree of Anopheles using the concatenated DNA

sequences of all mitochondrial protein coding genes 72

Figure 8 Multiple-alignment statistics for An. punctulatus group 76

Figure 9 Consensus tree topology of An. punctulatus sibling species based

on analysis of 30,907 nuclear loci 77

Figure 10 Distribution of the internal branch length 78

Chapter 3

Figure 11 Identification of putative introgression candidates based on

sequence divergence 90

xii Figure 12 Distribution of the sequence coverage per locus 92

Figure 13 Distribution of reference allele frequencies 93

Figure 14 Representation of branching patterns observed for 30,907

phylogenies reconstructed using PhyML 97

Figure 15 Distribution of the external branch lengths 98

Figure 16 Distribution of nucleotide differences between species 99

Figure 17 Amount and age of gene flow that can be excluded given the

number of shared polymorphisms observed 102

Figure 18 Demographic history of An. farauti 4 and An. punctulatus s.s. 104

Figure 19 Influence of sequence coverage on the estimates of the

demographic history of An. farauti 4 105

Chapter 4

Figure 20 Overview of the sequencing assay used to characterize the

blood meal composition of individual mosquitoes 121

Figure 21 Summary of the sequencing depth for each mosquito sample 123

Figure 22 Number of heterozygous sites called at various levels of

coverage 125

Figure 23 Neighbor-joining tree reconstructed using the DNA sequences

predicted to be amplified 127

Figure 24 Box plot showing the number of sequencing reads generated

per mosquito according to the visual blood meal status 134

Figure 25 Neighbor-joining phylogenetic tree showing the species

relationships among species and 136

xiii Figure 26 Composition of the blood meals for mosquitoes collected 138

Figure 27 Composition of the blood meals for each AP sibling species

Collected 140

Figure 28 Inference of population structure among An. punctulatus s.s.

mosquitoes collected 142

Figure 29 Neighbor-joining trees showing the relationships among the

human mtDNA haplotypes 143

Appendix B

Appendix B Distribution of the sequence coverage per contig for the

initial genome assemblies 176

Appendix F

Appendix F.1 Composition of blood meal for mosquitoes collected in

the village of Dimer 200

Appendix F.2 Composition of blood meal for mosquitoes collected in

the village of Wasab 201

Appendix F.3 Composition of blood meal for mosquitoes collected in

the village of Kokofine 202

Appendix F.4 Composition of blood meal for mosquitoes collected in

the village of Matukar 203

Appendix F.5 Composition of blood meal for mosqutioes collected in

the village of Mirap 204

xiv Acknowledgements

First I would like to express my sincere appreciation and gratitude to both of my graduate advisors, Peter Zimmerman, PhD and David Serre, PhD, who have invested countless hours mentoring me over the past 5 and half years. I would like to thank both of you for your encouragement and for pushing me to be a better scientist. Pete has been instrumental in talking with collaborators and acquiring samples for my projects and, as he puts it, being a traffic controller. He has also pushed me to dig deeper in analyzing my data and has provided valuable insights into what it takes to successfully coordinate and execute scientific projects. David has invested many hours patiently training me how to program in perl and properly analyze genomic data and has taught me that if the data looks to good to be true it is either an artifact, or I messed up my code. He has also been able to keep his sense of humor while lamenting having to read and edit the first drafts of my papers.

I would also like to thank my committee members Mike Benard, PhD and

Christopher Cullis, PhD who have provided helpful criticisms and stimulating conversations that have been invaluable during my graduate school years. I am fortunate to have been surrounded by a group of scientists that are so passionate about my research.

Many members of both the Zimmerman and Serre Labs have provided training, moral support and comic relief over the past five years. I must thank Ricky

Chan, PhD and Jim Hester for fielding my numerous questions about coding, Unix, and bioinformatics programs, they practically trained me in bioinformatics. Ricky has also provided sound advice and as always been there when I had a question. I

xv also want to thank Matt Cannon, PhD for his assistance over the years and being a knowledge bank when it comes to R commands. In addition, I want to express my appreciation for Scott Small, PhD. Scott has allowed me to bombard him with population genetics questions over the past couple of years and has contributed both his time and abilities to several projects that we have worked closely on. This research would not have been possible without the efforts of Lisa Reimer, PhD, John

Keven, Ned Walker, PhD and Cara Henry-Halldin, PhD who have provided field collections of mosquitoes.

A special thanks to my family who have shown unwavering support and been available when I needed them most. My mom, dad and brother have prayed for me continuously and encouraged me to achieve my goals. I especially want to thank my wife who has been there to encourage me day and night, prayed for me, and sacrificed immensely so that I could finish my dissertation. She has spent numerous days and nights taking care of our precious daughter while I was writing my dissertation and conducting last minute analyses. She is one of the strongest women

I know and I consider myself fortunate to be her husband. I would also like to thank my daughter for the immense joy, fun, and motivation she provided while I was writing this document. Lastly, and most importantly, I would like to thank God for guiding my steps and enabling me to complete this enormous task. He has provided hope when I was hopeless, rest when I was weary and strength when I was weak.

None of this would have been possible without Him.

xvi List of Abbreviations

AF s.s.- Anopheles farauti sensu stricto (formerly An. farauti 1)

AH – Anopheles hinesorum (formerly An. farauti 2)

AF4 – Anopheles farauti 4

AK – Anopheles koliensis

AP – An. punctulatus group

AP s.s.– Anopheles punctulatus s.s.

COI – Cytochrome Oxidase I gene

COX2 – Cytochrome Oxidase II gene

ELISA – Enzyme-linked immunosorbent assay gDNA – Genomic DNA

ITS2 – Internal transcribed spacer unit 2

LLIN – Long lasting insecticidal nets

MRCA – Most Recent Common Ancestor

Mt - Mitochondrial genome

MYA – millions of years ago

PCR-RFLP – Polymerase chain reaction- restriction fragment length polymorphism

PNG – Papua New Guinea

PNGIMR – Papua New Guinea Institute of Medical Research

16S rRNA – 16S ribosomal RNA

SEA – Southeast Asia

SW - Southwest

VGSC – Voltage Gated Sodium Channel gene

xvii WB – Wuchereria bancrofti

xviii Glossary

ABySS – de novo short read assembly algorithm that assembles short DNA sequencing reads into larger contiguous sequences (contigs). aLRT – The approximate Likelihood Ratio Test is a method that evaluates the branch support of maximum likelihood trees by comparing the likelihood of the most likely tree to the likelihood of the second most likely tree inferred. The branch support values are scaled from 0 to 1, where 0 means there is no likelihood of the branch existing while 1 indicates that there is a 100% likelihood of the branch existing.

Arlequin – A population genetics data analysis program that enables intra- population and inter-population analyses.

BEAST – (Bayesian evolutionary analysis by sampling trees) Phylogenetics program for Bayesian analysis of molecular sequences using MCMC. This program is used to reconstruct phylogenic trees and to test molecular clock models and evolutionary hypothesis based on phylogenetic trees.

BLAT - (BLAST-like alignment tool) A sequence alignment algorithm that conducts local searches against a known reference sequence and only reports sequence alignments that are 95% or greater in similarity.

Bowtie – Ultrafast short read alignment algorithm that enables the alignment of short DNA sequence reads to genome assemblies. The algorithm is optimized to quickly align large DNA sequencing data sets (hundreds of millions of reads). A limitation of this program is its inability to identify inserts and deletions.

xix Bowtie2 – Updated version of Bowtie that is better at aligning short DNA sequencing reads to larger reference genome assemblies (e.g. mammalian genomes). Unlike its predecessor, this algorithm enables the identification of insertions and deletions.

BWA – (Burrow- Wheeler Aligner) Program for aligning short DNA sequences against large reference genome assemblies. Both short DNA sequencing reads and longer DNA sequences from 70bp to 1Mb in size can be aligned allowing sequencing reads from various sequencing platforms to be aligned including Illumina, 454, ion torrent and Sanger reads.

Covaris – Instrument for mechanically shearing DNA sequences into shorter fragments for preparation of sequencing libraries for high-throughput sequencing.

FigTree – Graphical viewer for displaying phylogenetic trees, generally used for showing the phylogenies of trees reconstructed using the program BEAST. jModeltest – Statistical program that identifies the best nucleotide substitution model for reconstructing phylogenetic trees.

MUMmer – Algorithm for aligning whole genome assemblies. The program identifies exact DNA sequence matches of 20bp or longer. mlRho – Program that estimates the population mutation rate and recombination rate of an organism using short DNA sequencing reads from the genome of a single diploid individual. ms – Population genetic simulation program that generates sample data for a variety of neutral population genetic models for use in studies of genetic diversity.

xx The program can be used to test a variety of assumptions that influence the patterns of genetic diversity including migration, recombination rate, and population size.

Pandaseq – Program to collapse overlapping pair-end Illumina sequencing reads that also corrects sequencing errors in the overlapping regions based on the quality values of each overlapping nucleotide.

Phred quality score (Q) - A quality score that is assigned to each nucleotide base sequenced by an automated DNA sequencer including Sanger and Illumina sequencing platforms. The Phred scores are logarithmic and correspond to the probability of an incorrect base call. For example a Phred score (Q) of 10 indicates a

1 in 10 chance of the base pair being incorrect, a Q of 20 is 1 in 100 while a Q of 30 is

1 in 1,000.

Primer3 – Tool to design and optimize PCR primers

PSMC – The Pairwise sequential Markovian coalescent model infers the population history of a diploid organism by examining changes in the distribution of heterozygous sites across the genome of a single individual. The program uses this distribution to identify loci that have the same Time to the Most Recent Common

Ancestor (TMRCA) (i.e. both alleles coalesce at the same time in history indicating that these polymorphisms were inherited together). Since hundreds of thousands of loci, each with a TMRCA, are present across the genome of a diploid organism a distribution of TMRCA’s can be generated that encompasses the history of the organism. This distribution is then used to infer the effective population size across the history of an organism’s population.

xxi QUAKE – Program that identifies and corrects errors in DNA sequencing reads using quality values and known sequencer error rates and biases for Illumina sequencing instruments. The program is able to detect sequencing errors by generating all possible substrings of the sequencing reads (i.e. kmers) and then evaluating the distribution and frequency of all unique kmers. Since sequencing errors are rare, any kmers that are rarely observed are assumed to contain sequencing errors. The program compares these kmers to well supported kmers that are very similar and, when possible, corrects sequencing errors.

QUAST - A tool for comparing genome assemblies.

Structure – Population genetic software package that uses multi-locus genotype data to determine if there is any population structure. The program infers distinct populations by assigning individuals to populations based on allele frequencies.

Threaded Blockset Aligner (TBA) – Software package for conducting local multiple DNA sequence alignments that is able to align a large number of genome sequences that are assembled into chromosomes or contigs.

Tracer – Graphical tool for assessing the mixing of markov chains to determine if tree space has been adequately sampled for a BEAST (phylogenetic tree reconstruction program).

Translator X – Alignment program that translates DNA sequences to protein, aligns the protein sequences and then translates the alignment back into DNA.

xxii Genomic Analyses of the Anopheles punctulatus Group: Insights into Mosquito

Biology and Implications for Vector Control and Disease Transmission

Abstract

by

KYLE JOSEPH LOGUE

Members of the Anopheles punctulatus (AP) group are the primary vectors of malaria and filariasis in Papua New Guinea. The AP group includes 13 morphologically similar sibling species (the An. farauti complex includes 8 identical species) that have overlapping geographic distributions. Currently, little is known about the biology of the AP group as few genetic markers are available and laboratory studies are difficult. Here, I utilized high-throughput sequencing methods to generate genetic data for five members of the AP group. I use this data to understand the divergence, extent of introgression and blood feeding patterns of An. punctulatus s.s., An. koliensis, An. farauti s.s., and An. farauti 4. I interpret these results in the context of vector control and disease transmission in Papua

New Guinea.

I sequenced the mitochondrial and nuclear genomes of these mosquitoes and used phylogenetic analysis to estimate species divergence. Phylogenetic analysis indicates that AP species rapidly diverged in the past and current species are 6-9% divergent. I confirmed that species have evolved independently by searching the nuclear genome for signs of introgression. Analyses of more than

1 50Mb of orthologous nuclear DNA sequences revealed no evidence of contemporary introgression among AP species studied.

Currently, the host feeding patterns of AP species are not well defined. I developed a novel targeted high-throughput sequencing technique that provides an unbiased and comprehensive perspective on the composition of each mosquito blood meal. I tested this method on 442 female AP mosquitoes from

Papua New Guinea. My analyses revealed that 16.3% of the mosquitoes fed on more than one host and predominately fed on humans, and , but also fed on a marsupial species and .

Overall, my findings reveal the utility of high-throughput sequencing for exploring the biology of understudied species. The data and techniques generated in my dissertation will aid future studies of understudied mosquitoes and enable us to better understand vector related traits. My method to query blood meals can then be used in addition to genomic data to understand mosquito behavior but could also be used to study the feeding patterns of other insect vectors of diseases.

2 I. CHAPTER 1

The Biology of Anopheles punctulatus Species in Papua New Guinea

3 A. Introduction: Anopheles mosquitoes and disease

Anopheles mosquitoes are the vectors of several important human infectious diseases including malaria, lymphatic filariasis and arboviruses

(Krzywinski & Besansky 2003). Malaria is one of the most common vector borne diseases and is widely distributed throughout the subtropical and tropical regions of the world including , Africa, Southeast Asia and the

Southwest Pacific (WHO 2015). In 2015, there were 214 million malaria cases and half a million deaths, predominately African children, recorded worldwide

(WHO 2015). Additionally, it is estimated that over half of the world’s populations lives at risk of malaria infection (Gething et al. 2012; Gething et al. 2011; WHO

2015).

Five Plasmodium species are the causative agents of malaria: P. falciparum,

P. vivax, P. malariae, P. ovale, and, more recently, P. knowlesi and are transmitted by Anopheles mosquitoes. Plasmodium species undergo sexual reproduction in the gut of female Anopheles mosquitoes, their definitive host, and then develop into an infective sporozoite stage and travel to the mosquito salivary glands. The parasites are transmitted to humans by exploiting the female Anopheles mosquito’s requirement of a blood meal to obtain nutrients necessary for egg production. During host feeding, the infective sporozoites are injected into the human hosts through the mosquito’s proboscis.

Anopheles species that transmit malaria are typically found near human habitations and have a preference for feeding on human hosts (anthropophilic).

However, feeding preferences are influenced by the abundance and availability of

4 hosts, and, in the absence of a preferred host, anthropophilic species are known to feed on non-human hosts. Feeding preferences vary between Anopheles species and, within closely related species complexes, it is common to find species that prefer human hosts while others prefer non-human hosts (zoophilic).

Additionally, the time and location of blood-feedings vary between Anopheles species. Female anophelines feed between dusk and dawn hours, but the peak feeding time varies between species, and has been shown to change after the implementation of vector control measures. Also, some species prefer to feed indoors (endophagic) while others prefer to feed outdoors (exophagic). It is important that the feeding patterns and behaviors of Anopheles mosquitoes are comprehensively examined and understood as this enables us to understand the disease transmission potential of malaria and other mosquito-borne pathogens.

Anopheles mosquitoes are distributed worldwide, with the exception of

Antarctica (Sinka et al. 2012), and are the only species known to transmit human malaria. Of the approximately 500 species within the Anopheles genus, at least 70 are able to transmit malaria (Hay et al. 2010). These include the well-known species An. gambiae, An. arabiensis and An. funestus that are the main vectors of malaria in Africa, An. darlingi and An. albitarsis in South America, An. dirus and An. minimus in Southeast Asia and An. punctulatus sibling species in the Southwest

Pacific (Figure 1). Anopheles species are closely related sibling species that can be organized into species complexes or groups. Members of the African An. gambiae complex have been the most extensively studied of all Anopheles mosquitoes since several members, particularly An. gambiae, are highly competent vectors of

5 malaria and are responsible for the majority of malaria transmission globally..

Unfortunately, most non-African Anopheles species have been less well studied and their biology remains poorly understood. As genome data is readily generated today, whole genome sequencing can be used to increase our understanding of the biology of non-African Anopheles species. This genomic data enables us to conduct comparative genomic analyses to, for example, examine species definitions and evaluate if introgression is occurring between Anopheles sibling species.

Figure 1. Global distribution of the dominant malaria vectors adapted from Sinka et al 2012 Figure 1. The red box indicates the location of Papua New Guinea where members of the An. punctulatus group are the main malaria vectors.

6 B. Systematics and distribution of Anopheles species in Papua New Guinea

1. Discovery of the Anopheles punctulatus group

Members of the Anopheles punctulatus (AP) group were initially characterized in the early 1900’s into two species, An. punctulatus (Dontiz 1901) collected from the Madang region of Papua New Guinea (PNG), and An. farauti

(Laveran 1902) collected from Efafe, Vanuatu (previously New Hebrides). Later on, two additional species were described, An. koliensis (Owen 1945; Rozeboom &

Knight 1946) and a rarely observed species collected from Humboldt Bay,

Hollandia (present day Jayapura), West New Guinea (present day Indonesia), An. clowi (Rozeboom & Knight 1946). After extensive taxonomic studies of the adult, larval and pupal morphology of these species during World War II these four closely related species were grouped together into the An. punctulatus complex

(Rozeboom & Knight 1946). However, there was some doubt as to the species classifications as some workers believed that An. koliensis was an intermediate form of An. punctulatus and An. farauti (Woodhill 1946) while others supported that An. koliensis was a separate species (Belkin 1962; Rozeboom & Knight 1946).

The three main species, An. punctulatus, An. farauti and An. koliensis, are widely distributed throughout the Southwest Pacific from the Moluccan islands in the west (formerly Spice Islands) through New Guinea into the Solomon Islands and

Vanuatu (formerly New Hebrides). An. farauti is also widely distributed throughout the northern portion of Australia (Rozeboom & Knight 1946).

7 2. Species determination a. Morphology

Rozeboom and Knight described the morphological characteristics of the original four members of the An. punctulatus complex (Rozeboom & Knight

1946). They differentiated adult female AP species based on the coloration of the proboscis scales and, to a lesser extent, a sector spot on the wings. The morphology of the proboscis separated AP species into An. punctulatus, An. farauti and An. koliensis. Female anophelines that had the distal half of the proboscis covered with black scales, with the remaining half covered in white scales, were characterized as An. punctulatus. In An. farauti the proboscis was covered with all black scales with the exception of the terminal end that was white. The morphology of the proboscis of An. koliensis was less distinct and varied considerably in the portion of the proboscis that was white and black.

Generally, a patch of white scales on the apical half of the proboscis was characteristic of An. koliensis, but the size of the patch varied considerably. In fact, the variability in proboscis morphology led some to believe that An. koliensis was an intermediate, or hybrid, form of An. punctulatus and An. farauti as the proboscis, at one extreme, look like that of An. punctulatus, while at the other extreme was like An. farauti (Woodhill 1946). In addition to proboscis morphology, a dark sector spot on the costal wing vein was also used to differentiate An. koliensis from the other AP species, as this spot was absent from

An. koliensis but present in the other species (Belkin 1962). Since An. clowi has

8 been rarely observed (Cooper et al. 2000; Rozeboom & Knight 1946) I will not discuss the morphology of its proboscis here. b. Cross-mating studies

Due to the inability to establish colonies using the laboratory methods available in the 1940s and 50s the species definitions of the An. punctulatus complex remained unchanged and the hypothesis that An. koliensis was a hybrid from of An. punctulatus and An. farauti could not be tested. However, in the early

1960’s an induced mating technique was developed that enabled laboratory colonies of mosquitoes to be established (Baker et al. 1962). Using this technique, in the 1970s, Joan Bryan established laboratory colonies of An. punctulatus, An. koliensis and An. farauti, based on proboscis morphology, by collecting eggs laid by female mosquitoes from 12 locations in New Guinea and Northern Australia.

Bryan conducted cross-matings within and between the established colonies and found that the colonies could be classified into four distinct groups.

When mosquitoes in the same colony were mated the F1’s were fertile, however, cross-matings between colonies resulted in F1 generations that were either sterile or non-viable. Using these results, and morphological characters, the colonies were determined to be An. punctulatus, An. koliensis and two morphologically identical, yet reproductively isolated, An. farauti colonies - when cross-matings were conducted all F1 adults were sterile. Bryan classified these colonies as two distinct species; one colony established from Rabaul New Guinea was named An. farauti 1 (currently An. farauti s.s.) while the other colony, established from Queensland Australia, was named An. farauti 2 (currently An.

9 hinesorum). Cross-matings also revealed that An. koliensis was a separate species and not a hybrid of An. punctulatus and An. farauti (Bryan 1973c). In the 1980’s

Mahon and Miethke identified another distinct species, An. farauti 3 (currently

An. torresiensis), that was morphologically identical, but reproductively isolated, from An. farauti s.s. and An. hinesorum. This species was found sympatrically with

An. farauti s.s. and An. hinesorum and all F1 hybrids were found to be sterile

(Mahon and Mietheke 1982).

In addition to cross-mating experiments, Bryan also compared the chromosome banding patterns between An. farauti s.s. and An. hinesorum colonies along with that of the sterile F1 hybrids and identified two paracentric inversions, one on chromosome 2R and the other on chromosome 2L (Bryan &

Coluzzi 1971). These inversions were later confirmed by Mahon using colonies established from 11 different locations in Northern Australia and Vanuatu. Using the same techniques Bryan, Coluzzi and Mahon identified two inversions on the X chromosome that were unique of An. torresiensis (Mahon 1983). c. Molecular techniques

Molecular studies conducted in the 1990’s through 2008 revealed that the

AP species complex was much larger than the six species described above: An. punctulatus, An. farauti s.s., An. hinesorum, An. torresiensis, An. koliensis and An. clowi. Allozyme assays, conducted in the early 90’s, revealed the presence of four additional species of the An. farauti complex: An. farauti 4 (Madang, PNG), An. farauti 5 (near Goroka, PNG) and An. farauti 6 (currently An. oreios) (near Tari),

PNG (Foley et al. 1993) and An. farauti 7 (currently An. irenicus) (Guadalcanal,

10 Solomon Islands) (Foley et al. 1994). These morphologically indistinguishable species were considered to be distinct species based on the electrophoretic banding patterns of 35 enzyme loci that displayed patterns different from the controls: An. farauti s.s., An. hinesorum, and An. torresiensis colonies (Foley et al.

1994; Foley et al. 1993). A fifth species named An. species near punctulatus, collected from the Western Province of PNG, was also identified by examining allozyme banding patterns (Foley et al. 1995). In 2008, an additional cryptic species within the An. farauti complex, An. farauti 8, was identified in the Central

Province of PNG. This species was considered a distinct species based on the divergence of its Internal transcribed spacer unit 1 DNA sequence compared to the other seven An. farauti species (Bower et al. 2008). However, further studies have not been conducted to validate this species. Currently, the 12 An. punctulatus species identified above are all considered members of the An. punctulatus group.

In addition to allozyme assays, other molecular methods were developed to distinguish among the morphologically identical species of the An. punctulatus group. A squash blot technique was developed that enabled the discrimination of

5 AP species: An. punctulatus, An. koliensis, An. farauti s.s., An. hinesorum and An. torresiensis, using species-specific, radioactive labeled, species-specific, genomic probes constructed for each species. The abdomen of an individual mosquito was

‘squashed’ onto a nylon sheet and subsequently probed to detect species-specific

DNA (Beebe et al. 1994).

11 PCR-based technologies have also been developed to differentiate among cryptic anopheline species and have become the main methods of species diagnosis for AP species. A PCR-RFLP (Restriction-fragment length polymorphism) technique was developed that enabled the differentiation of 10

An. punctulatus sibling species using the internal transcribed spacer unit 2 (ITS2) gene. A 750bp portion of the ITS2 gene was amplified and then digested using the restriction endonuclease Msp I. Species-specific banding patterns were identified from established colonies where the AP species identity was already known. The benefit of this approach was that it was relatively fast, more accurate than using morphology and could identify AP species across a wide range of geographic regions (Beebe & Saul 1995). Today, two high-throughput multiplex assays are available that can rapidly, and accurately, identify the main disease vectors of the

AP group: An. punctulatus s.s., An. koliensis, An. farauti s.s., An. hinesorum, An. farauti 4, by combining PCR and the use of probes targeting species-specific polymorphisms in either the ITS2 (Henry-Halldin et al. 2011) or voltage gated sodium channel (VGSC) (Henry-Halldin et al. 2012) gene. The benefit of these approaches is that 96 samples can be quickly processed inexpensively and are easily adapted to add additional species-specific probes.

3. Geographic distribution

Members of the An. punctulatus group are distributed across the

Southwest Pacific including Papua New Guinea, the Solomon Islands, Vanuatu and

Northern Australia. In Papua New Guinea, 11 of the 13 members of the An. punctulatus group have been found, with four species, An. punctulatus s.s., An.

12 koliensis, An. farauti s.s. and An. hinesorum, being widespread throughout the country. An. farauti 4 has a more limited distribution but is widely distributed

North of the central ranges in mainland PNG (Cooper et al. 2002) (Figure 2). Five of the species have either been rarely observed or have a limited distribution throughout PNG: An. torresiensis, An. farauti 5, An. oreios, An. spp. near punctulatus, and An. clowi (Cooper et al. 2002; Foley et al. 1995). The most recently identified member of the AP group, An. farauti 8, was found in the central ranges (Bower et al. 2008) but its distribution has yet to be studied.

Figure 2. Distribution of the five main disease vectors of the An. punctulatus group adapted from Cara Henry-Halldin (originally adapted from Cooper 2002 Figures 3-7). The red box depicts the Madang Province of PNG.

13 In the Solomon Islands and Vanuatu, seven members of the An. punctulatus group have been found: An. farauti s.s., An. hinesorum, An. farauti 4, An. irenicus,

An. koliensis, An. punctulatus s.s. and An. rennellensis. Of these species, An. punctulatus s.s. is not commonly found, having been only observed on the islands of Guadalcanal and Malaita (Beebe et al. 2000a; Samarawickrema et al. 1992), and

An. koliensis was possibly eradicated by vector control measures as it has not been observed since the 1990’s (Beebe et al. 2000a; Foley et al. 1994;

Samarawickrema et al. 1992). An. renellensis has only been found on a remote island and no molecular based studies have been conducted to verify its species status and placement in the AP group (Maffi 1973). An. farauti s.s. is the most widely distributed species and is the only member of the AP group that has been found in Vanuatu, likely because it is one of the few members of the AP group that can breed in brackish water (Foley et al. 1993; Sweeney 1987). In Northern

Australia, An. farauti s.s., along with An. hinesorum and An. torresiensis, are the only AP species that have been found (Cooper et al. 1995; Sweeney et al. 1990).

4. Species distribution and larval habitats

The most comprehensive study of the distribution of AP species in mainland Papua New Guinea was conducted between 1992 and 1998 in collaboration with the Australian army. Through the use of army helicopters and four-wheel drive vehicles a large portion of PNG was surveyed enabling a total of

795 sites to be examined and a total of 22,970 anophelines collected. This extensive survey utilized a variety of techniques to collect anophelines from each life-cycle stage (larvae, pupa, and adult), these methods included larval

14 collections from a variety of both permanent (e.g. ponds, rivers, swamps) and transient (e.g. wheel tracks, foot prints, pot holes) water bodies; adult female anophelines were collected using CO2 traps and human landing catch methods conducted during sundown and sunrise. By using a helicopter, this study provided the unique ability to collect anophelines that were not in close proximity to human habitations and where roads were not available (Cooper et al. 2002). Because of the unique and apparently comprehensive geographic coverage I will focus on this study to summarize the distribution of the main malaria vectors of PNG: An. punctulatus s.s., An. koliensis, An. farauti s.s., An. hinesorum and An. farauti 4 (Beebe et al. 2015). All of these species, with the exception of An. punctulatus s.s., were found to be positively associated with human habitations (Cooper et al. 2002). However, I will also briefly discuss the distribution and larval habitats of the minor disease vectors where available. a. An. punctulatus s.s.

An. punctulatus s.s. was collected throughout PNG from a total of 211 of the

795 study locations except from areas with drier climates, for example, South of the river and was only sparsely observed in the Southern plains. This species was readily found in lowland river valleys, but was also common on the coast and in the highlands. Unlike other members of the AP group, An. punctulatus s.s. almost exclusively utilizes unestablished, transient, breeding pools such as road ruts and wheel tracks creating by humans. Artificial addition of other AP species into these sites were unsuccessful (Charlwood et al. 1986), but larvae of An. farauti s.s., An. hinesorum and An. koliensis were found, with An. punctulatus s.s., at

15 several sites. Of the main vectors of the AP group, An. punctulatus s.s. was not found to be associate with humans; however, Cooper and colleagues noted difficulties collecting adult An. punctulatus s.s. mosquitoes using CO2 traps in locations where An. punctulatus s.s. larvae were readily collected, most likely confounded their analysis (Cooper et al. 2002). b. An. koliensis

An. koliensis was found in 246 of 795 locations distributed across PNG and was commonly found on the North side of the central ranges in the Sepik, Ramu and Markham River valleys. On the south side of the central ranges this species had a more limited distribution and, like An. punctulatus s.s., was rarely collected

South of the Fly River, but was commonly found east of the Papua Gulf. An. koliensis larvae were commonly found in natural breeding sites including ground pools and along the edge of creeks and rivers. In addition to these sites, this species was also found in wheel ruts and drains located near human dwellings;

An. koliensis was commonly found in breeding pools with other members of the

An. punctulatus group with the exception of An. farauti 6 (Cooper et al. 2002). c. An. farauti s.s.

An. farauti s.s. (formerly An. farauti 1) was collected from 239 of 795 sample locations and was typically found within 1km of the coastline of PNG.

However, An. farauti s.s. was found in a few inland locations that were >10km (42 locations) and >100km (2 locations) from the coast. Larvae of this species were typically found in natural, freshwater, pools or swamps that had varying salinity levels, but were also found in brackish pools (Cooper et al. 2002). This species is

16 one of the few AP species that can breed in brackish water (Cooper et al. 1996;

Sweeney 1987). An. farauti s.s. was also collected from man made breeding pools

(e.g. wells, wheel tracks) sympatrically with An. hinesorum, An. punctulatus s.s. and An. koliensis. From these collections, the salinity tolerance of An. farauti s.s. appears to have promoted its widespread distribution along the coastal environments throughout PNG. Additionally, the presence of An. farauti s.s. was positively correlated with human habitations, but this species was also found in remote locations away from human habitations (Beebe & Cooper 2002). d. An. hinesorum

An. hinesorum (formerly An. farauti 2) was the most common and widespread member of the AP group South of the Central ranges. This species was also widely distributed North of the Central ranges where it was frequently found sympatrically with other members of the AP group including An. punctulatus s.s., An. koliensis and An. farauti 4. In total, An. hinesorum was collected from 324 of 795 locations throughout PNG and was commonly found in lowland river valleys, 10 to 100km from the coast, and along the coast. This species frequently oviposited in small ground pools or pools along riverbeds and creeks, but also utilized artificial, human made, habitats. This species, like the other AP species, had a positive association with human habitations (Cooper et al.

2002). e. An. farauti 4

An. farauti 4 had a limited distribution compared to other members of the

AP group and was only found in 43 of 795 locations North of the Central Ranges

17 in lowland river valleys (less than 300m elevation). However in the Sepik, Ramu and Markham river valleys it was the dominant AP species and was commonly found in manmade breeding pools such as drains, wallows and wheel tracks.

Larval collections showed that An. farauti 4 was readily found in breeding sites with An. punctulatus s.s. and An. koliensis. This species was readily found near humans and, in the village of Lae, where >90% of the biting anopheline mosquitoes collected by human landing catch were An. farauti 4. Interestingly, in

Madang, similar collection strategies yielded different results. An. farauti 4 was readily found in light-traps but were not captured by human landing catch

(Cooper et al. 2002). f. Minor AP species

An. torresiensis (formerly An. farauti 3) had a limited distribution and was only collected in the Southern plains of the Fly River as it is adapted to drier climates, like that of Northern Australia. The larvae of this species were found in natural, well established, breeding pools. An. farauti 5 has been only rarely observed in the highlands of PNG (Foley et al. 1998; Foley et al. 1993) and was not found during this collection. An. oreios (formerly An. farauti 6) was common in highland river valleys and intramontane plains where the elevation was

>1,000m above sea level. This species is the largest, in terms of size, of the An. farauti species and is common in regions with high elevation (>1,500m) (Cooper et al. 2002). An. species near punctulatus had a very limited distribution and was only sporadically found in lowland river valleys around the central ranges. This species was typically found in areas uninhabited by humans and used natural

18 rock and ground pools for ovipositioning. An. spp. near punctulatus larvae were occasionally found in the same breeding sites as An. hinesorum and An. punctulatus s.s.. Lastly, a rarely observed member of the AP group, An. clowi, was collected from one location approximately 600 km west of where it was originally discovered (Cooper et al. 2002).

5. Species relationships

The species relationships among members of the An. punctulatus group are not well defined. Initially, based on proboscis morphology and cross mating studies, it was thought that An. farauti s.s. and An. hinesorum were most closely related as they both had a black proboscis, a dark sector spot on their wing, and

45% of F1 hybrids survived to adults (all F1 hybrids were sterile) (Bryan 1973b).

Based on the percentage of viable, albeit sterile, F1 hybrids, An. koliensis was thought to be more closely related to An. farauti than An. punctulatus, which was the most diverged member of the four An. punctulatus colonies examined (Bryan

1973b). However, determining the relationships of members of the An. punctulatus group using morphological characters is difficult as species in the An. farauti complex are morphologically indistinguishable. Another approach to understand the relationships of AP sibling species is to evaluate DNA sequences that are phylogenetically informative, meaning they provide sufficient information (i.e. nucleotide differences) to define species relationships.

Phylogenetic studies of the AP group have been conducted using very few

(n=6) DNA sequences: mitochondrial (mt) cytochrome oxidase I (COI) (Ambrose et al. 2012), mt cytochrome oxidase II (COX2) (Foley et al. 1998), cytochrome b

19 (Cytb) (unpublished, Henry-Halldin), and the nuclear genes ITS2 (Beebe et al.

1999), voltage gated sodium channel (VGSC) (Henry-Halldin et al. 2012), ribosomal protein S9 (rpS9) (Ambrose et al. 2012), and 18S ribosomal RNA (18S rRNA) (Beebe et al. 2000b). However, reconstructing species relationships using a single locus (or few loci) will likely not reveal the actual species relationships, but reveal the history of that particularly locus amongst species. Further studies that examine multiple nuclear loci (each possibly representing a different history) across the genome are needed to construct a robust phylogeny of the AP species.

The current relationships among AP sibling species based on phylogenies independently reconstructed across three (COX2, 18S rRNA, ITS2) loci, has consistently shown that An. farauti s.s. and An. irenicus (formerly An. farauti 7) group together and An. farauti 4, An. punctulatus s.s. and An. species near punctulatus grouped together, however the relationship of the latter species are incongruent among phylogenies (Beebe et al. 2000b; Beebe et al. 1999; Foley et al. 1998). The relationship of An. koliensis to the other AP species is also inconsistent: An. koliensis is basal in the AP group in the COX2 tree, but is more often grouped in-between the Farauti and Punctulatus clades (see below) in the

18S rRNA (Beebe et al. 2000b), ITS2 (Beebe et al. 1999), VGSC (Henry-Halldin et al. 2012), Cytb (unpublished, Henry-Halldin) phylogenies (Figure 3).

Beebe et al grouped the members of the An. punctulatus group into the

Farauti and Punctulatus clades based on the topology of the 18S rRNA tree and proboscis coloration (Beebe et al. 2000b). The Farauti clade consisted of all An.

20 farauti species, excluding An. farauti 4, as they were group together phylogenetically and had an all black-colored proboscis. The Punctulatus clade

Figure 3. Representations of published phylogenies of members of the An. punctulatus group adapted from (A) Foley 1998 (COX2), (B) Beebe 2000 (18S), (C) Beebe 1999 (ITS2) and (D) Cara Henry-Halldin 2012 (VGSC). For clarity only one individual per species was included in each phylogeny. The (D) VGSC phylogeny was constructed using DNA sequences from the five main disease vectors of the AP group. All bootstrap values are displayed and reveal that all current topologies are not well supported. Note that bootstrap values were not available for (A) Foley 1998. contained An. farauti 4, An. punctulatus and An. sp near punctulatus that grouped together in the 18S rRNA tree topology (Figure 3B) and had either a polymorphic

21 (AF4) or half-black, half-white colored proboscis (AP and An. sp near punctulatus)

(Beebe & Cooper 2002).

It is thought, based on proboscis and wing morphology, that members of the Farauti clade are closely related. In fact, a recent study examined the species relationships of three An. farauti species (An. farauti s.s., An. hinesorum, and An. irenicus) from locations across the Southwest Pacific and identified possible introgression occurring between An. farauti s.s. and An. hinesorum in Southern

Papua New Guinea. Putative introgression between these species was detected when examining the topology of the mt COI tree but was not detected in either of the topologies reconstructed using two nuclear genes (rpS9 and ITS2) (Ambrose et al. 2012). One possible scenario is that sterile F1 hybrids of An. farauti s.s. and

An. hinesorum were examined or the ancestral mt COI lineage is still present in both species from this area (i.e. incomplete lineage sorting). As a single locus does not provide the resolution necessary to identify gene flow between species it is important that studies are conducted that use multiple nuclear loci, preferably distributed throughout the genome, to evaluate if gene flow is occurring among AP species in PNG.

C. Vector ecology of An. punctulatus group

1. Host feeding patterns a. Methods used to identify the blood meals of hematophagous insects

A variety of methods have been developed to evaluate the blood hosts fed on by hematophagous (blood-sucking) insects. Traditionally, host blood meals

22 have been identified using serological techniques such as precipitin tests and

ELISAs. While these methods have provided a wealth of valuable information, issues such as the cross-reactivity of antibodies against blood hosts and, generally, only being able to delineate blood hosts to the Order or Family taxonomic levels have limited their current use (Kent 2009). Additionally, the ability of these assays to detect blood hosts is limited by the availability of species-specific antibodies With the advent of PCR-based techniques, the massive amount of publicly available DNA sequences for a wide range of species and the ability to generate new data through DNA sequencing, a number of molecular techniques have been developed that have greater specificity in identifying blood meal hosts. These techniques include direct DNA sequencing, group-specific multiplex PCR and PCR-RFLP.

The most direct way to identify contents in the blood meal of a blood- sucking insect is to DNA sequence a specific, homologous, locus using conserved primers that amplify a wide range of, for example, species. The locus amplified needs to have a sufficient number of informative sites to differentiate between putative blood hosts. After the blood meal sequence is obtained, it is compared against all sequences in Genbank, or anther DNA sequence repository, to identify the blood host, or, if the host sequence is not in the database, its closest relative. This method is very effective for identifying blood meals in zoophilic insects that may feed on a wide range of unknown hosts. Despite the utility of this technique it is not practical for quickly processing large amount of samples.

Additionally, in instances where mixed blood meals are present (i.e. where two or

23 more hosts were fed on), additional, expensive and labor intensive, procedures have to be conducted to identify blood meals. This technique is widely used in the identification of mosquito blood meals, but, instead of using conserved primers, species-specific primers are designed that target known hosts. Typically, primers that target the Cytb gene (Cupp et al. 2004; Kent & Norris 2005; Molaei &

Andreadis 2006; Molaei et al. 2006; Ngo & Kramer 2003) are used in the analysis of mosquito blood meals. However, the Cytb primers are designed to target either a subset of (Kent 2009; Molaei & Andreadis 2006), specific mammalian species (Kent & Norris 2005; Ngo & Kramer 2003), birds (Molaei & Andreadis

2006; Ngo & Kramer 2003) or reptiles/amphibian species (Cupp et al. 2004) and are generally not able to amplify a wide range of species.

Another technique that is employed to identify the blood meal of mosquitoes is the group-specific multiplex PCR. This assay uses conserved reverse primers and species-specific forward primers to amplify differentially sized DNA fragments of Cytb from pre-selected host species. These DNA fragments are analyzed by gel electrophoresis to identify the blood meal source

(Kent & Norris 2005). This assay is adaptable and specific primers have been developed for several avian orders (Ngo & Kramer 2003), and for reptiles and amphibians (Cupp et al. 2004). This approach is advantageous for quickly identifying expected hosts, but is not useful if a comprehensive examination of blood meals is desired because unanticipated blood meal hosts may not be represented in the primer set used.

24 Similar to the method above, PCR-RFLP of Cytb has been used to identify the blood meals of ticks (Kirstein & Gray 1996), mosquitoes (Ngo & Kramer 2003;

Oshaghi et al. 2006), and tsetse (Steuber et al. 2005). The difference between the techniques is that species-specific polymorphisms within the Cytb gene are targeted by restriction enzymes and the digested PCR product is then differentially sized by gel electrophoresis. The drawback of this approach is that a library of restriction fragment patterns from a wide range of species is required to identify blood meals. This technique would not be useful for identifying novel blood hosts because, like the other techniques, it cannot identify mixed blood meals.

Other methods, utilizing species-specific DNA probes, have also been developed including TaqMan real time PCR assays and reverse line-blot hybridization (van den Hurk et al. 2007). A TaqMan assay was developed to identify mosquito blood meals using fluorescently labeled probes, targeting Cytb, to identify blood meals from native Australian mammals (van den Hurk et al.

2007). The advantages of this technique include increased sensitivity compared to traditional PCR, enables the quantitation of the amount of DNA template and large batches of samples can be quickly analyzed. Reverse line-blot hybridization also utilizes probes to identify host blood meal sources but, unlike the TaqMan assay, biotinylated blood meal PCR products are hybridized to oligonucleotide probes that are covalently bound to a nylon membrane. Host blood meal identification is determined by colorimetric visualization of hybridized PCR products with streptavidin-labeled peroxidase and a chemiluminescent substrate.

25 Reverse line-blot assays have been developed using 18S rRNA and have been field tested to assess the blood meals of ticks (Pichon et al. 2003). The disadvantage of these assays is they only target preselected species and non-specific binding of probes could confound assay results. b. Host feeding patterns of Anopheles in Papua New Guinea

Understanding host feeding patterns of Anopheles mosquitoes is important for assessing the vector competence (i.e. ability to both develop and transmit a pathogen) of a species. An important component of determining the vector competency of a mosquito species is evaluating its ability to transmit a pathogen to a human host. For example, Anopheles mosquitoes that feed more often on humans than other species (assuming all other parameters are the same) would be considered more competent human disease vectors.

Currently, very little is known about the feeding patterns of members of the An. punctulatus group as only two studies have been published (Burkot et al.

1988; Charlwood et al. 1985). Both studies were conducted in the 1980’s when only four AP sibling species had been identified: An. punctulatus s.s., An. farauti s.s., An. hinesorum and An. koliensis. Since the 1980’s 9 additional AP sibling species have been identified, 6 of them members of the An. farauti complex.

Therefore, interpretation of the results of these studies is difficult as there is uncertainty as to which species were analyzed. Additionally, these studies were conducted in one location, the Madang province of PNG, providing a limited understanding of the feeding behaviors of AP species across PNG.

26 The first study to assess the blood feeding patterns of AP mosquitoes was conducted in the Madang province of PNG in 1985 (Charlwood et al. 1985). An. farauti s.s. (n=6,419), An. punctulatus s.s. (n=19) and An. koliensis (n=133) mosquitoes were collected from resting sites from eight villages. Precipitin tests were conducted to identify the hosts fed on using anti-sera specific to a wide range of mammalian hosts including: humans, pigs, dogs, , opossum, and flying foxes, and non-specific anti-sera to mammals and birds. The results revealed that the AP species fed most often on humans, dogs and pigs, however not all blood hosts could be identified using specific antisera. The human blood index (HBI), a metric for determining the anthropophily of a mosquito, was also calculated for the most abundant AP species collected, An. farauti s.s. The HBI for

An. farauti s.s. varied widely between villages with a range of 9% to 83% (mean was 49.5%). This means that, depending on the village, an average of 49.5% of An. farauti s.s. mosquitoes collected, fed on a human host. However, these observations suggested that host availability was the reason the HBI varied considerably between villages. In villages where more non-human hosts were present the HBI was lower and vice versa for villages with a lower abundance of non-human hosts (Charlwood et al. 1985). This suggests that An. farauti s.s. is unspecialized in regards to it’s feeding preferences, however it is unclear if all An. farauti mosquitoes collected were An. farauti s.s. as only a subset of them were species typed.

Burkot and colleagues also investigated the host blood meals of AP mosquitoes in the late 1980’s in the Madang province (Burkot et al. 1988). An.

27 punctulatus s.s., An. farauti s.s. and An. koliensis mosquitoes were collected from outdoor and indoor resting sites in nine villages. The host blood meal of each mosquito was identified using enzyme linked immunosorbent assays (ELISA) and antibodies specific to the following hosts: humans, pigs, dogs, cats, cows, horses, , rats, birds and opossums. The study found that AP species predominately fed on humans, dogs and pigs. Burkot et al also evaluated the feeding preferences of An. punctulatus, An. farauti and An. koliensis by calculating human blood feeding indices (HBI) and feeding indices (FI) (an index that examines the feeding preference of a species based on blood meal sources and the abudance of host available). They found that, based on HBI, An. punctulatus s.s. and An. koliensis fed more often on humans than An. farauti s.s. FI statistics showed that An. punctulatus s.s. and An. koliensis both preferred to feed on dogs compared to humans and humans compared to pigs. An. farauti s.s., preferred to feed on dogs compared to pigs and pigs compared to humans. Overall, as in the

Charlwood study described above, host-feeding patterns varied widely among villages according to the host abundance and provided further evidence that AP species are unspecialized, or generalist, with regards to their feeding preferences.

Burkot’s study also found that 4.2% of mosquitoes collected fed on multiple blood hosts with mixed blood meals containing any combination of human, and pig

(Burkot et al. 1988).

Members of the An. punctulatus group have been observed to have different biting times. In Madang and East Sepik Provinces of PNG, An. punctulatus s.s. and An. koliensis seek blood meals after midnight between the

28 hours of 0200 and 0600 while An. farauti species tend to feed in the earlier hours of the evening between 1800 and 2200 hours, depending on the species. For example, An. farauti tends to seek blood hosts from 1800 to 0600 hours, with a peak biting time of 2100 to 2200 hours, whereas biting occurs between 1800 and

2200 hours for An. hinesorum and An. farauti 4 (Benet et al. 2004). The biting times of An. farauti s.s. differ slightly on the Solomon Islands, in the Temotu

Province An. farauti s.s. the peak biting time was between 1800 and 2000 hours

(Bugoro et al. 2011), whereas, in Northern Guadalcanal the peak biting time started an hour later at 1900 and stopped at 2000 hours(Bugoro et al. 2014).

2. Vectorial status a. Malaria

The vectorial status of members of the An. punctulatus group has not been clearly established for either of the main diseases they transmit in PNG, malaria and lymphatic filariasis. The first evidence to implicate that AP species were vectors of malaria was in 1923 when it was observed that malaria parasites could develop in the gut and salivary glands of An. punctulatus after feeding on child with malaria (Heydon 1923). Throughout the 20th century, An. farauti, An. punctulatus and An. koliensis were considered the main disease vectors based on their distribution and abundance, and propensity for feeding on humans in regions where malaria and lymphatic filariasis were present (Black 1955;

Charlwood et al. 1985; Charlwood et al. 1986; Heydon 1923; Spencer et al. 1974).

As molecular techniques became available, and species definitions became more clearly defined, a better understanding of the vectorial status of members of

29 the An. punctulatus group was achieved. In the early 2000’s, An. farauti s.s., An. farauti 4, An. punctulatus s.s, and An. koliensis, collected from either the Madang or

Wosera region of PNG, were indirectly shown to be infected with Plasmodium falciparum and/or P. vivax by the presence of each Plasmodium species’ circumsporozoite antigen (Benet et al. 2004). In Madang, 13 of 583 AP mosquitoes analyzed were positive for both Plasmodium species, with the exception of An. punctulatus s.s.. In Wosera, 1.5% of An. punctulatus s.s. (n=5,612),

0.3% An. koliensis (n=6,437) and 0% of An. farauti s.s. (n=467) mosquitoes were infected with both P. falciparum and P. vivax,

The most recent, and comprehensive, study to examine the presence of

Plasmodium species in AP mosquitoes was conducted using samples collected between 1992 and 1998 (Cooper et al. 2009; Cooper et al. 2002). A total of 21,572

Anopheles punctulatus mosquitoes were screened, by ELISA, for P. falciparum, P. vivax and P. malariae circumsporozoite antigens. Six of the seven AP species examined were infected with Plasmodium: An. punctulatus s.s., An. farauti s.s., An. hinesorum, An. farauti 8, An. farauti 4 and An. koliensis (Table 1). The percentage of Plasmodium infected mosquitoes varied between 0.4% and 1.22%, depending on the species examined, with An. punctulatus s.s. having the highest infectivity rates (1.22%) and An. farauti s.s. the lowest (0.4%). An. torrensiensis was the only

AP species not infected with Plasmodium (Table 1). Based on infectivity rates and the distribution of each species in PNG, An. punctulatus s.s., An. hinesorum, An. farauti 4 and An. koliensis were determined to be the major vectors of malaria. P. falciparum was present in 62% of the infected mosquitoes, followed by P. vivax

30 (32%) and P. malariae (6%). All AP species were mainly infected with P. falciparum except for An. koliensis and An. farauti s.s. that were more often infected with P. vivax (Table 1) (Cooper et al. 2009) .

Table 1. Number of An. punctulatus sibling species screened by ELISA for the presence of P. falciparum, P. vivax and P. malariae circumsporozoite anitgens, adapted from Cooper et al 2009.

Pf # mosquitoes P. falciparum P. vivax P. and Species tested (Pf) (Pv) malariae Pv An. farauti s.s. 9692 11 23 7 0 An. hinesorum 1189 6 3 1 0 An. torresiensis 3 0 0 0 0 An. farauti 8 308 2 0 0 0 An. farauti 4 1535 12 3 0 0 An. punctulatus s.s. 245 0 2 0 1 An. koliensis 8600 13 23 2 3

b. Lymphatic filariasis

Members of the An. punctulatus group are also vectors of lymphatic filariasis in PNG. The development of Wuchereria bancrofti, (WB) the causative agent of lymphatic filariasis in PNG, was first described in An. punctulatus s.s. in

1934. Captured An. punctulatus s.s. mosquitoes were fed on the arm of an infected individual and then dissected over a period of 14 days; the various lifecycle stages of the parasite were present in the mosquitoes (Backhouse 1934). Further work in the 1940’s, using the same techniques, revealed that An. koliensis and An. farauti were also naturally infected with WB (Perry 1950; Toffaleti & King 1947).

Bryan, in 1986, further validated the vector status of An. punctulatus s.s. and An. koliensis by dissecting mosquitoes collected from two regions of the Sepik

31 Province of PNG. Bryan found that, in Yankok, 4.4% of An. punctulatus s.s. (n=411) and 13.2% of An. koliensis (n=334) mosquitoes were infected with WB, of these,

0% and 0.67%, respectively, contained infective L3 Larvae (the WB stage that is transmitted to humans). In contrast to these findings, in the village of Yauatong, where An. punctulatus s.s. (n=647) was the only species collected, 47.3% of the mosquitoes were infected with WB and 3.4% contained the infective L3 larval

(Bryan 1986). Another study conducted in the late 90’s, in the East Sepik

Province of PNG, supported Bryan’s findings, but the percentage of An. punctulatus s.s. (n=9,551) and An. koliensis (n=585) mosquitoes infected with WB was 7.3% and 6.3%, respectively (Bockarie et al. 1996).

Recently, Erickson et al assessed the vectorial status of AP sibling species from three districts in the Madang Province and one district in the East Sepik

Province of PNG (Erickson et al. 2013). Mosquito larvae were collected, reared in the lab and fed on human blood, with varying amounts of microfilaremia, collected from WB infected human volunteers. The prevalence of the infective L3 stage of WB varied substantially between An. punctulatus species with An. hinesorum (61.9-100%) having the highest density of infective L3’s followed by

An. farauti s.s. (24.5-68.6%) and An. punctulatus s.s. (4.2-23.7%). Overall, based on the current evidence, the main vectors of lymphatic filariasis in PNG are An. punctulatus s.s., An. farauti s.s. and An. koliensis. However, recently, An. hinesorum was shown to be a highly competent vector and may play an important role in disease transmission across its geographic range (Erickson et al. 2013).

32

D. Vector control

1. History of vector control

Since Anopheles mosquitoes were implicated as the vectors of malaria in

1897, vector control methods have been implemented to decrease human disease burden (Ross 1910). Initially, vector control measures included the destruction of larval habitats by the draining or filling of swamps, and other water bodies, or addition of larvacides, for example oil and Paris green, to larval habitats

(Rieckmann 2006). After the discovery of DDT as an effective insecticide in the early 1940’s, extensive malaria control programs were initiated as part of the

Global Malaria Eradication Program in 1955. The program resulted in a remarkable reduction in malaria cases globally and the elimination of malaria in several countries including the United States, former Soviet Union and European countries (Mabaso et al. 2004; Najera et al. 2011). However, one major challenge encountered by this program was the emergence of insecticide resistance in

Anopheles mosquitoes (Corbel & N'Guessan 2013; Kelly-Hope et al. 2008; Najera et al. 2011), which, coupled with significant logistical and financial constraints, resulted in the eradication program being abandoned, and malaria control programs being adopted, in 1969 (Najera et al. 2011; WHO 1973).

Since 1969, malaria control programs have primarily used insecticides to control vector populations. In 2007, there was a new call for the elimination of malaria that prompted the widespread distribution of long lasting insecticide treated bednets (LLINs) and the implementation of IRS campaigns throughout

33 malarious regions, particularly Africa (2007; Roberts & Enserink 2007). These programs have been effective in decreasing the malaria burden but their sustainability is uncertain as insecticide resistance and behavioral adaptions have evolved in anopheline populations limiting the effectiveness of the main vector control methods in some regions.

Insecticide resistance has been detected in more than 50 Anopheles species that transmit malaria (Hemingway & Ranson 2000) and for all four classes of insecticides currently used: pyrethroids, organochlorines, carbamates and organophosphates. All 12 insecticides in these classes are used for IRS, but only pyrethroids are used in LLINs. Unfortunately, when resistance develops to one insecticide it confers resistance to all insecticides within that class

(Hemingway & Ranson 2000). This is especially troubling for pyrethroid resistance, as LLIN’s remain the primary vector control measure implemented globally. Today, insecticide resistance has been documented in Anopheles populations from Africa (Namountougou et al. 2012; Ranson et al. 2011),

Southeast Asia (Somboon et al. 2003; Van Bortel et al. 2008) and South America

(Fonseca-Gonzalez et al. 2009; Zamora Perea et al. 2009). The molecular basis of insecticide resistance is the result of either point mutations in specific genes encoding proteins targeted by insecticides, or metabolic resistance. Point mutations have been found in the voltage-gated sodium channel conferring resistance to both pyrethroids and DDT (Martinez-Torres et al. 1998; Ranson et al. 2000) and in the acetylcholinesterase enzyme (ace-1), conferring resistance to carbamate and organophosphates (Alout & Weill 2008). Metabolic resistance has

34 also been found and is the result of detoxifying enzymes, for example, cytochrome

P450 and glutathione-S-transferases being overexpressed (Djouaka et al. 2008;

Nikou et al. 2003). In addition to this physiological adaptation to insecticides, behavioral resistance has been reported in several Anopheles populations: in these populations, that were initially feeding preferentially indoors (endophagic), the deployment of insecticide treated bed nets led to a marked change in feeding behavior and these mosquitoes now seek host blood meals outdoors (Fornadel et al. 2010; Russell et al. 2011; Sougoufara et al. 2014).

In Africa, insecticide resistance alleles seem to have appeared independently in multiples locations (Diabate et al. 2004; Pinto et al. 2007) and have rapidly spread (Dabire et al. 2009; Jones et al. 2012) due to gene flow between populations, for example, between the sympatric M and S forms of An. gambiae (Clarkson et al. 2014; Diabate et al. 2003; Etang et al. 2009; Weill et al.

2000). Putative introgression of pyrethroid resistance (kdr allele) between An. gambiae and An. arabiensis has also been observed (Mawejje et al. 2013). These results are concerning since extensive introgression has recently been observed among members of the An. gambiae complex (Fontaine et al. 2015) and it is possible that resistance alleles are being spread among other anopheline species in the complex. Furthermore, given the lack of genetic data available for other non-African Anopheles species, it is possible that adaptive introgression is occurring and promoting the rapid spread of insecticide resistance in other non-

African Anopheles species around the world. Therefore it is important that genomic studies be conducted for non-African anopheline species to evaluate if

35 introgression is occurring between sibling species, as these results could enable the implementation of more effective vector control programs.

2. Vector control in Papua New Guinea

Widespread vector control efforts in Papua New Guinea started during the

Pacific campaign of World War II when the allies realized the substantial impact of malaria on troops (Parkinson 1974; Spencer 1992). The majority of vector control measures were implemented around military installations and troops were instructed to use mosquito nets and repellants, and wear protective clothing. From 1942-1945, the allies conducted aerial spraying of DDT and oiled and drained larval habitats to decrease disease transmission (Spencer 1992). In the late 1950’s through the 70’s the World Health Organization (WHO) implemented a program for the eradication of malaria. During this time, indoor residual spraying of DDT was conducted across PNG covering 50% of the population (Avery 1974; Parkinson 1974; Spencer 1992). Despite the programs success in controlling malaria, lack of funding and the recognition that elimination was not feasible resulted in spraying campaigns being abandoned in

1973 (Parkinson 1974).

Throughout the 1980’s vector control strategies were not implemented in

PNG except in a few regions where local governments funded IRS campaigns.

However, after a successful test of the effectiveness of insecticide impregnated nets (ITNs) in decreasing the transmission of malaria in the Madang Province, the

National government of PNG promoted the use of ITNs (Genton et al. 1995;

Graves et al. 1987; Hii et al. 2001), but widespread distribution of ITNs did not

36 occur. However, from 2004-2009, with the support of the Global Fund, the

National malaria control program of PNG conducting its first round of free LLIN distribution across PNG (Hetzel et al. 2012). Despite the increase in LLIN ownership, surveys conducted after the first round of LLIN distribution revealed that, of households that owned a bednet, only 32.5% of them slept under their bednets. Also, the first round of LLIN distributions did not cover all region of PNG.

Recently, a second round of free LLIN distributions took place between 2009 and

2013 targeting the regions of PNG that were poorly covered during the first round. Surveys conducted to assess ownership and usage of LLINs reported that, after this round, 82% of households owned a bednet and 48% of households slept under a bednet (Hetzel et al. 2014). These results are encouraging but further public health work is required to increase the number of households that sleep under bednets, as LLINs are the main vector control methods implemented in

PNG. In addition to the distribution of LLINs, indoor residual spraying campaigns are also being planned in PNG (Kazura et al. 2012). Fortunately no insecticide resistance to pyrethroids, used in LLINs, has been detected in PNG yet (Henry-

Halldin et al. 2012; Keven et al. 2010). However, with the recent widespread distribution of LLINs, and the use of IRS in PNG, studies will need to be conducted to monitor for insecticide resistance in Anopheles populations. It is also important that we develop a better understanding of the biology and ecology of Anopheles populations to understand how insecticide resistance alleles could spread in AP populations.

37 E. Genomic data available for Anopheles

In 1998, as insecticide resistance threatened the success of vector control programs around the world, the international Anopheles genome project began with the goal to assemble the main disease vector in Africa, An. gambiae. This project was an international collaboration between Genoscope, Celera Genomics, the Institute for Genomic Research (TIGR) and National Institute of Health; funding was provided by the National Institute of Allergy and Infectious Disease, the WHO, and the French Ministry of Research. By 2002, the first draft genome assembly of An. gambiae, consisting of 278 Mb of assembled DNA sequence, was published (Holt et al. 2002) (Table 1).

When my research project began in the fall of 2010 only the genome of An. gambiae had been sequenced and little was known about the genome sequences of any other Anopheles species. Subsequently, novel insights into the biology of

Anopheles mosquitoes, including mechanisms of insecticide resistance (Djouaka et al. 2008; Edi et al. 2014; Mitchell et al. 2014; Nikou et al. 2003; Ranson et al.

2004) and identification of genes involved in vector immunity (Blandin et al.

2009; Fraiture et al. 2009; Menge et al. 2006; Osta et al. 2004; Riehle et al. 2006) and vector-parasite interactions (Alout et al. 2014; Felix et al. 2010), have, predominately, been discovered in An. gambiae. However, as the cost of genome sequencing decreased coupled with the development of high-throughput sequencing platforms, the ability to sequence and assemble the genome of a species was no longer just attainable for large genome consortiums, but was also within the grasp of independent labs. Between 2013 and 2014, the genomes of

38 Table 2. Anopheles genomes assembled

Publication Genome Geographic Anopheles species Reference date size (Mb) distribution Science.298 An. gambiae 2002 278 Africa (5591): 129-49 Nucleic Acids Res An. darlingi 2013 201 South America 41(15):7387- 400 India/Middle Genome Biol. 15 An. stephensi 2014 221 East (9): 459 BMC Genomics. An. sinensis 2014 221 Southeast Asia 15: 448 Anopheles 16 genomes project Central and Science. 347 An. albimanus 2015 171 South America (6217):1258522 An. arabiensis 2015 247 Africa

An. atroparvus 2015 224 Europe

An. christyi 2015 173 Africa

An. coluzzi 2015 225 Africa

An. culicifacies 2015 203 Southeast Asia

An. dirus 2015 216 Southeast Asia

An. epiroticus 2015 224 Southeast Asia Southwest An. farauti 2015 183 Pacific An. funestus 2015 225 Africa

An. maculatus 2015 142 Southeast Asia

An. melas 2015 224 Africa

An. merus 2015 288 Africa

An. minimus 2015 202 Southeast Asia

An. quadriannulatus 2015 284 Africa

An. sinensis 2015 376 Southeast Asia India/Middle An. stephensi 2015 225 East

three Anopheles species were assembled: An. darlingi from South America (Marinotti et al. 2013), An. stephensi from India (Jiang et al. 2014), and An. sinensis from China (Zhou et al. 2014). Recently, in 2015, as part of the Anopheles 16

39 genomes project, a consortium between the Broad Institute and University of

Notre Dame, the genomes of 16 Anopheles species from South America

(n=1),Africa (n=7), the Middle East (n=1), Europe (n=1), Southeast Asia (n=5) and the Southwest Pacific (n=1) were assembled and putatively annotated (Neafsey et al. 2015) (Table 1).

One member of the An. punctulatus group in PNG, An. farauti s.s., was assembled by the Anopheles 16 genome project. However, the other main disease vectors of the AP group: An. punctulatus s.s., An. hinesorum, An. farauti 4 and An. koliensis, have not been sequenced. Genomic studies of the An. punctulatus group are pertinent, as widespread vector control measures are being conducted in

PNG, and the biology of this cryptic species group is poorly understood.

F. Dissertation Aims

Anopheles mosquitoes are the vectors of devastating diseases such as malaria and lymphatic filariasis. Since the 1940’s the main method to control malaria has been to implement vector control measures. Intense vector control campaigns were conducted in the 1940’s during World War II coinciding with the discovery of DDT as an insecticide. Given the success of these vector control measures the Global Malaria Eradication program was initiated in 1955 with the goal to eradicate malaria. However, by 1969 this program was abandoned due to growing insecticide resistance in Anopheles vectors and logistical problems

(Najera et al. 2011; WHO 1973). Since 1969 malaria control programs have been implemented to decrease the burden of disease by deploying ITNs and

40 conducting IRS campaigns, but are challenged by the occurrence and spread of insecticide resistance in Anopheles populations.

Today, a new call has been made for the elimination of malaria resulting in the widespread distribution of ITNs and IRS campaigns (2007; Roberts &

Enserink 2007). However, insecticide resistance is still a challenge and resistance has been reported in Africa (Namountougou et al. 2012; Ranson et al. 2011),

Southeast Asia (Somboon et al. 2003; Van Bortel et al. 2008) and South America

(Fonseca-Gonzalez et al. 2009; Zamora Perea et al. 2009). Furthermore, insecticide resistance alleles have rapidly spread across Africa as a result of gene flow between closely related, morphologically indistinguishable, Anopheles species (Clarkson et al. 2014; Diabate et al. 2003; Etang et al. 2009; Weill et al.

2000). For elimination programs to be successful, it is important that we better understand the biology of Anopheles mosquitoes including understanding the evolutionary relationships and extent of gene flow between Anopheles sibling species, and host feeding patterns of Anopheles species. This is particularly important for the understudied, non-African Anopheles mosquitoes, for example members of the An. punctulatus group, as very little is known about their biology due to the availability of few genetic loci and the difficulties associated with establishing and maintaining An. punctulatus colonies for use in laboratory studies. Additionally, it is important that the feeding patterns of members of the

An. punctulatus group are evaluated to understand disease transmission in Papua

New Guinea.

41 The purpose of my dissertation is to better understand the biology of members of the An. punctulatus group by using high-throughput sequencing methods to generate genomic data for the main disease vectors in this group. I utilized this data to examine the divergence, extent of introgression and blood feeding patterns of AP species and relate how these results could be used to aid in the implementation of vector control strategies and to further understand disease transmission in PNG. To achieve these goals my project has the following three aims:

Aim 1 – Sequence genome of Anopheles punctulatus disease vectors and examine the divergence between sibling species

1. Sequence and assemble the first mitochondrial and nuclear genomes for

the main disease vectors in PNG.

2. Examine the evolutionary history and date the divergence of the AP group,

and other Anopheles species, using mitochondrial genome sequences

3. Reconstruct a robust phylogeny of four AP species using nuclear genome

assemblies

Aim 2 – Investigate whether introgression is occurring between An. punctulatus species

1. Examine genome-wide patterns of divergence and DNA sequence diversity

among the assembled nuclear genomes of AP species, generated in Aim 1,

42 to determine if contemporary gene flow is occurring between sibling

species.

2. Evaluate the demographic history of An. punctulatus s.s. and An. farauti 4.

Aim 3 – Examine the host feeding patterns of An. punctulatus species

1. Develop a targeted high-throughput sequencing technique to

comprehensively examine the blood meals of Anopheles mosquitoes and

test the robustness of the method to differentiate amongst mammalian

species.

2. Identify mammalian host(s), and number of human individuals, fed on by

An. punctulatus species.

3. Evaluate if subpopulations of An. punctulatus s.s. exists that display distinct

feeding patterns.

43 II. Chapter 2

Genome Sequencing of Anopheles punctulatus Sibling Species and Examination of Divergence among Species

Portions of the contents in Chapter 2 are published in the following manuscripts:

Logue K, Chan ER, Phipps T, Small ST, Reimer L, Henry-Halldin C, Sattabongkot J, Siba PM, Zimmerman PA, Serre D. 2013. Mitochondrial genome sequences reveal deep divergences among Anopheles punctulatus sibling species in Papua New Guinea. Malar J. doi: 10.1186/1475-2875-12-64.

Logue K, Small ST, Chan ER, Reimer L, Siba PM, Zimmerman PA, Serre D. 2015. Whole-genome sequencing reveals absence of recent gene flow and separate demographic histories for Anopheles punctulatus mosquitoes in Papua New Guinea. Mol Ecol. 24:1263-1274.

44 A. Introduction

There is little known about the biology of most Anopheles mosquitoes except for An. gambiae, the primary malaria vector in Africa, for which an extensive amount of genetic information has been generated including genome sequences (Holt et al. 2002; Lawniczak et al. 2010; Neafsey et al. 2015) and a catalogue of genetic polymorphisms (Lee et al. 2014; Neafsey et al. 2010;

Turissini et al. 2014; Wilding et al. 2009). Utilizing this genetic data, a wealth of biological knowledge has been acquired and has resulted in the genetic basis of adaptive phenotypes: identification of alleles conferring insecticide resistance

(Edi et al. 2014; Mitchell et al. 2012; Riveron et al. 2013; Riveron et al. 2014), detection of gene flow among An. gambiae mosquitoes (Clarkson et al. 2014;

Fontaine et al. 2015; Weetman et al. 2014), and identification of two species undergoing sympatric speciation in the An. gambiae complex, An. gambiae s.s. and

An. coluzzi (Lawniczak et al. 2010; Lee et al. 2013; Neafsey et al. 2010; Turner et al. 2005). This data has also enabled the generation and analysis of gene- expression data that has provided insights into, for example, host blood preference: RNA-seq revealed that specific olfactory receptors are expressed in mosquitoes that are attracted to humans (Carey et al. 2010; Rinker et al. 2013).

Unfortunately, as genome assemblies are not available for other Anopheles mosquitoes, we have limited understanding of the biology of non-African

Anopheles mosquitoes.

The lack of genetic information for Anopheles mosquitoes is, at least partially, due to the difficulties associated with genome sequencing and assembly.

45 The genome of An. gambiae was assembled using traditional Sanger sequencing methods that require the cloning and sequencing of large, overlapping, DNA fragments. These methods are very expensive, labor intensive and can only be conducted by large genome consortiums. For example, using these techniques it cost 2.7 billion dollars and 13 years to assemble the first human genome.

However, advances in sequencing technologies, particularly high-throughput sequencing of short DNA fragments, and genome assembly algorithms have enabled the cost-effective sequence generation of organisms. For comparison, today the cost of generating the sequence of a human genome is as little as one thousand dollars and can be completed in ~24 hours (on an Illumina MiSeq, times vary depending on sequencing platform). Recently, high-throughput sequencing techniques have been used to assemble the genome of several non-model organisms including the giant panda (Li et al. 2010), collard-flycatcher (Ellegren et al. 2012), bats (Zhang et al. 2013), and others (Guo et al. 2013; Heliconius

Genome 2012; Qiu et al. 2012; Yim et al. 2014; You et al. 2013). However, at the start of this study, high-throughput sequencing techniques had not been applied to Anopheles mosquitoes.

Besides the lack of genomic sequences, the current understanding of the

Anopheles phylogeny- the relationship among species as well as the times they diverged from each other also remains limited. Studies of this nature are complicated in Anopheles by the existence of species groups and complexes

(including morphologically identical sibling species) (Bryan 1974; Proft et al.

1999; Scott et al. 1993; Wilkerson et al. 1995) sympatric speciation and the

46 potential evolution of incipient species (della Torre et al. 2001; Favia et al. 1997;

Guelbeogo et al. 2005). However, determining the evolutionary relationships among Anopheles species has important clinical and vector control implications as it could clarify whether traits required for transmission of human blood-borne pathogens, avoidance of long-lasting insecticide-treated nets or insecticide resistance evolved only once in an ancestral population or, alternatively, whether different species acquired these traits independently.

The focus of this chapter, and dissertation, is on members of the An. punctulatus (AP) group that are the principal vectors of malaria in PNG. As described in Chapter 1, historically the An. punctulatus group was separated into four species based on morphological differences in the proboscis and wings: An. punctulatus, An. koliensis, An. farauti, and, a rarely observed species, An. clowi

(Rozeboom & Knight 1946). Later studies involving cross-mating (Bryan 1973a, b, c), allozyme analyses (Foley et al. 1993) and DNA sequence analysis (Beebe et al. 1999; Beebe et al. 1994; Beebe & Saul 1995; Cooper et al. 2002; Foley et al.

1995) provided evidence suggesting further differentiation of the AP group into

13 species, most of them morphologically indistinguishable. At least five of these species, An. punctulatus s.s, An. koliensis, An. farauti s.s, An. hinesorum, and An. farauti 4, have been described as competent vectors of malaria (Benet et al. 2004;

Cooper et al. 2009). Phylogenetic studies of this group have focused on DNA sequences of ribosomal RNAs (Beebe et al. 2000b; Cooper et al. 2000), mitochondrial genes (Ambrose et al. 2012; Foley et al. 1998), ITS2 (Beebe et al.

1999), and the voltage-gated sodium channel gene (Henry-Halldin et al. 2012).

47 However, the genetic information generated in these studies has not allowed robust assessment of the AP group phylogeny and has often yielded conflicting results that were summarized in Figure 3 (Beebe et al. 1999; Foley et al. 1998). In addition, no study has yet evaluated the relationships between AP sibling species and other Anopheles species from neighboring regions, such as species from SEA.

Here, I apply advancements in high-throughput sequencing technologies to circumvent the paucity of genetic data available for An. punctulatus mosquitoes. I de novo assemble the first genomes of the main disease vectors of the AP group in PNG using wild-caught mosquitoes. Using this genomic data, I show that the divergence among AP mosquitoes is extensive and that the AP group is deeply diverged from An. gambiae and other African Anopheles species.

In addition to the whole genome sequence, I also sequenced the mitochondrial genomes of 14 individual mosquitoes from the AP group and An. dirus complex from Southeast Asia. I employ the newly constructed nuclear and mitochondrial genome assemblies to reconstruct the evolutionary history of An. punctulatus mosquitoes and, using the mitochondrial genomes, date the divergence of AP mosquitoes.

B. Methods

1. Mosquito collections and species identification

Wild-caught An. punctulatus mosquitoes were collected by the Entomology unit of the Papua New Guinea Institute of Medical Research (PNGIMR) as previously described (Henry-Halldin et al. 2012; Henry-Halldin et al. 2011) An.

48 dirus samples from Thailand were kindly provided by Dr. Jetsumon Sattabongkot

(Faculty of Tropical Medicine, Mahidol University). I extracted genomic DNA

(gDNA) from individual mosquitoes using DNeasy blood and tissue kits (Qiagen®) according to the supplemental protocol for purification of insect DNA with one modification: each mosquito was placed in a 2mL tube with a 5mm steel bead and

180 μL of PBS and then homogenized by high-speed shaking at 15 Hz for 90 seconds using a Qiagen TissueLyser II instrument. The species identity of each sample was determined using a PCR-based assay targeting species-specific polymorphisms in the ribosomal internal transcribed spacer unit 2 (ITS2)

(Henry-Halldin et al. 2011). For whole genome sequencing, I selected individual mosquitoes that had a total DNA yield of ~2ug.

2. Whole genome sequencing and de novo assembly of mitochondrial genomes a. Whole genome sequencing of five An. punctulatus mosquitoes

I sequenced the whole genome of five mosquitoes (An. punctulatus s.s.

(n=1), An. koliensis (n=1), An. farauti s.s. (n=1), and An. farauti 4 (n=2)). I sheared

2 µg of genomic DNA from each individual mosquito into 250 to 300 bp fragments using a Covaris S2 instrument and prepared sequencing libraries using the New

England Biolabs (NEB) NEBNext® kit protocol and standard Illumina paired-end adaptors. I sequenced each library on one lane of an Illumina GA IIx or Illumina

Hiseq 2000 instrument to generate 80 to 300 million paired-end reads of 51 or

100 bp from each sequencing run (Table 3) (see Appendix A for accession numbers).

49 Table 3. Sample sequencing information. List of samples sequenced in this study, their sequencing method, read length and the number of paired-end reads generated.

Sample Seq Method Sequencer Read length Number of read pairs

a An. punctulatus s.s. Whole Genome Illumina GAIIx 51 39,399,308

An. farauti s.s. Whole Genome Illumina GAIIx 51 44,281,276

An. farauti 4 Whole Genome Illumina HiSeq 2000 100 122,694,078

An. farauti 4 Whole Genome Illumina HiSeq 2000 100 150,366,415

An. koliensis Whole Genome Illumina GAIIx 58 37,073,523

b An. punctulatus s.s. Multiplex Illumina HiSeq 2000 100 11,764,434

An. punctulatus s.s. Multiplex Illumina HiSeq 2000 100 17,419,685

An. punctulatus s.s. Multiplex Illumina HiSeq 2000 100 13,082,863

An. punctulatus s.s. Multiplex Illumina HiSeq 2000 100 10,495,346

An. hinesorum Multiplex Illumina HiSeq 2000 100 20,253,194

An. koliensis Multiplex Illumina HiSeq 2000 100 15,527,845

An. dirus s.s. Multiplex Illumina HiSeq 2000 100 13,399,633

An. dirus s.s. Multiplex Illumina HiSeq 2000 100 13,431,487

An. cracens Multiplex Illumina HiSeq 2000 100 11,527,398 aWhole genome sequence of one individual mosquito bAll multiplex sequences were given a unique barcode and pooled together for sequencing on one lane of a flow cell

b. De novo assembly of mitochondrial genomes

To identify reads from the mt genome and separate them from reads originating from the nuclear genome, I aligned all reads generated from a single sample, using Bowtie (Langmead et al. 2009), onto the cytochrome oxidase I

(COI), cytochrome oxidase II (COX2), and the voltage gated sodium channel

(VGSC) gene sequences previously generated for each species. As expected based

50 on the copy number difference between mt and nuclear genomes, the sequence coverage of mitochondrial genes, COI and COX2, was 50-60 fold greater than the coverage of the nuclear gene, VGSC (Figure 4). In fact, the ~500 X coverage of mtDNA implied that multiple identical reads mapped to the exact same nucleotide position along the entire mt genome sequence. I therefore hypothesized that most reads occurring twice or more were likely to originate

Figure 4. Coverage of whole genome sequencing reads on mitochondrial and nuclear genes. Coverage per base pair of whole genome sequencing read pairs aligned to two mitochondrial genes (COI and COX2) and one nuclear gene (VGSC). Each color represents one of the aligned read pairs. The numbers in the center of each bar represent the actual coverage per base pair.

from the mt genome and selected these reads (regardless of their sequence) for reconstructing the complete mtDNA sequence using ABySS (Simpson et al. 2009).

I performed all assemblies using ABySS with a k-mer size of 29 and C=70. I then

51 aligned the assembled contigs on the mtDNA of An. gambiae with MUMmer (Kurtz et al. 2004) and filled any remaining gaps by PCR and Sanger sequencing (Table

3). To identify possible artifacts or assembly errors, I aligned all reads generated from a given sample to the final mtDNA contig (assembled from a subset of these reads). If necessary, I replaced any base in the contig that differed from the nucleotide carried by a majority of the sequencing reads. Note that initially I attempted to assemble the mt genomes of AP sibling species by aligning DNA sequencing reads to the mt genome of An. gambiae. However, this was unsuccessful due to the extensive divergence between the mt genome of AP sibling species and An. gambiae.

3. Targeted sequencing of mitochondrial genomes of An. punctulatus mosquitoes a. Long range primer design for targeted sequencing of mitochondrial genomes

I used a multiplex approach to simultaneously sequence the mitochondrial genome of the remaining nine individual mosquitoes (Figure 5). First, I designed primers that amplified any Anopheles mt genome using seven overlapping long range PCRs. I generated an Anopheles consensus sequence by aligning the mt genomes of An. punctulatus s.s., An. farauti s.s, An. gambiae, An. quadrimaculatus and An. darlingi with ClustalW (Chenna et al. 2003) and masking any variant sites.

52 Table 4. Primers used to amplify mitochondrial genomes by long range PCR, to fill-in gaps between assembled contigs from whole genome sequencing of An. punctulatus s.s. and An. farauti s.s. (51 bp reads), and to amplify the control region (A+T rich region) of several mitochondrial genomes. Nucleotides in red indicate variable sites among Anopheles mosquitoes based on mitochondrial DNA alignments.

Primer Name Sequence Length (bp) Long Range PCR of Mitochondrial genomes 133_3048F AAAAAGATAAGCTAATTAAGCTATTGG 27 133_3048R TAAAGGAGAAGAACTATCTTGTAATCC 27 3032_5516F AACATGAGCAAATTTAGGATTACAAG 26 3032_5516R AAATTATTTAGTCCTTGTGATTGGAAG 27 5507_7060F TATATGTGACTTCCAATCACAAGGAC 26 5507_7060R ATATTGATTTGTGGTGTCAATGATATG 27 5952_10891F GAAATTCACCCATATTTTAGGGTAATAG 28 5952_10891R CAAATCCTCCTCAAATTCATTG 22 10771_13066F TTAACAATAGCAACAGGATTTTTAGG 26 10771_13066R AAAATATAATTAAAGGACGAGAAGACC 27 12814_14383F TAAAAATAACTCTTAATCCAACATCGAG 28 12814_14383R CGGTGTTTTAGTCTATTTAGAGGAATC 27 14234_745F TACTTAAATATAAACTGCACCTTGACC 27 14234_745R TTATTGCTAATAAAATTCATCCTAAATG 28 Amplification of gaps from Illumina GAIIx whole genome sequencing mtAF7aF CGCAGTAGCTGGCACAAAT 19 mtAF7aR TCCTTTTTATCAGGCAATTCA 21 mtAP2aF TGCCGAATTCTTCATTAAAACC 22 MtAP2aR AAAACACCGCCAAATTCTTT 20 MtAP3aF TTGTACCTTGTGTATCAGGGTTT 23 MtAF5aF GAATTAGAAGACCATCCAGCAA 22 MtAF6aF GCGGCCCTTTAAATTTCAGT 20 Amplification of Control Region APsibsCtrlF CGCAGTAGCTGGCACAAAT 19 APCtrlR AACCCTTTTATCAGGCAATTCA 22 APsibsCtrlR TCCTTTTTATCAGGCAATTCA 21 AF1CtrlF CCATTTGTATAACCGCAGTAGC 22 AF1CtrlR TTTCATGATTTACCCTATCAAGG 23 ADBCtrlR TCCTTTTTATCAGGCAATCCA 21

53

Sample'1' Sample'2' Sample'N'

PCR'amplifica1on'

Pooling'of'' PCR'products'

Library'prepara1on'w/' unique'barcode'

Pooling'of'barcoded'' libraries'from'different'samples'

Sequencing'

Assignment'of'reads'to'each'sample' using'barcode'sequences'

de#novo'assembly''

Mapping'of'reads'and' base'calling'

Figure 5. Multiplex sequencing method. Diagram of the steps used to amplify and sequence multiple mitochondrial genomes simultaneously on one lane of an Illumina HiSeq 2000 instrument after amplification by long range PCR. First, I performed long-range amplification of the mitochondrial genome of each sample using seven overlapping primers spanning the Anopheles mt genome. After amplification, I pooled amplicons together in equal concentrations per sample and added Illumina adaptors and a unique 6-nucleotide barcode. All samples were then pooled together and sequenced on an Illumina HiSeq 2000. After sequencing, I independently de novo assembled the mt genome of each sample and corrected any sequencing errors by aligning the reads back to each assembled mt genome and validate the nucleotide determination at each position across the mt genome.

I then designed primers based on this consensus sequence using Primer3 (Rozen

& Skaletsky 2000) following the recommended primer specifications described in the Roche Expand Long Range dNTPack® kit protocol. I was able to design

54 primers at overlapping sites with two or less variants and without known variants in the last 3’ positions. For each variant site, the allele that had the highest frequency among Anopheles was incorporated into the primer sequence

(Table 4).

b. Long range amplification and multiplex sequencing

I amplified each amplicon using the Roche Expand Long Range dNTPack® kit protocol with 20-40 ng of gDNA per PCR reaction and 3% DMSO (Figure 5).

Amplification conditions were as followed: 3 minute denaturation step at 94°C,

39 cycles of 94°C for 45 seconds, 50°C for 45 seconds, 60°C for 5 minutes followed by a 10 minute final elongation at 60°C. Product amplification was verified by electrophoresis on a 1% agarose gel. Following amplification I pooled all seven PCR products from a given individual and sheared DNA molecules into

300 bp fragments using a Covaris instrument (http://covarisinc.com). I then prepared a sequencing library for each individual mosquito using Illumina adapters including a unique 6mer barcode. Finally, I pooled libraries from all mosquitoes in equal concentration and sequenced the resulting pool on one lane of an Illumina HiSeq 2000 instrument, generating an average of 28 million paired-end reads of 100 bp per sample (Table 3) (see Appendix A for accession numbers). c. De novo assembly of mitochondrial genomes

I assembled the mtDNA sequence of each sample independently. To normalize the coverage along the mitochondrial genome necessary to de novo

55 assemble the DNA sequence as large variations in coverage among contigs results in overlapping contigs not being collapsed and remove reads containing sequencing errors, I calculated the number of reads carrying each DNA sequence.

Given the very high sequencing coverage obtained for each sample, any rarely observed DNA sequence likely corresponds to reads containing sequencing error(s). Therefore, I only used DNA sequences that were represented by >20 reads for de novo assembly. Additionally, to normalize coverage and facilitate assembly computations, I only used one instance of each sequencing read. Using this subset of reads, I de novo assembled each mt genome using ABySS (Simpson et al. 2009) with a kmer size of 31. The resulting contigs were aligned to the mtDNA of An. gambiae using MUMmer (Kurtz et al. 2004). I collapsed all overlapping contigs using nucleotide ambiguity codes in the SeqMan Pro™ program in DNASTAR’s LASERGENE®Core Suite (www.dnastar.com) to produce a consensus sequence for each sample. I then mapped all reads (i.e. not only the subset of reads used to generate the assembly) generated from a given sample to its consensus sequence using BWA (Li & Durbin 2009) and validated the nucleotide determination at each position of the genome sequence using mpileup in Samtools (Li et al. 2009) and perl scripts. All the mt genome sequences I assembled have been deposited in GenBank (see Table 5 or Appendix A for accession numbers).

56 Table 5. List of the samples with their collection site or colony ID and corresponding NCBI accession numbers

Species Location Length (bp)* Reference GenBank No. An. punctulatus s.s. Peneng, PNG 15,200 This study JX219738 An. punctulatus s.s. Dimer, PNG 15,198 This study JX219737 An. punctulatus s.s. Yagaum, PNG 15,085 This study JX219739 An. punctulatus s.s. Yagaum, PNG 14,965 This study JX219740 An. punctulatus s.s. Madang, PNG 15,045 This study JX219744 An. farauti s.s. Madang, PNG 15,069 This study JX219741 An. hinesorum Nale, PNG 15,336 This study JX219734 An. farauti 4 Naru, PNG 15,358 This study JX219735 An. farauti 4 Naru, PNG 15,359 This study JX219736 An. koliensis Nale, PNG 15,113 This study JX219743 An. koliensis Madang, PNG 15,061 This study JX219742 An. dirus s.s. Thailand 15,404 This study JX219731 An. dirus s.s. Thailand 15,126 This study JX219732 An. cracens Thailand 15,412 This study JX219733 An. albitarsis 15,413 Krzywinski et al 2011 HQ335344.1 An. albitarsis F Columbia 15,418 Krzywinski et al 2011 HQ335349.1 An. albitarsis G Brazil 15,474 Krzywinski et al 2011 HQ335346.1 An. deaneorum Brazil 15,424 Krzywinski et al 2011 HQ335347.1 An. janconnae Brazil 15,425 Krzywinski et al 2011 HQ335348.1 An. oryzalimentes Brazil 15,422 Krzywinski et al 2011 HQ335345.1 An. darlingi North Belize 15,386 Moreno et al 2010 GQ918272.1 An. darlingi South Brazil 15,385 Moreno et al 2010 GQ918273.1 An. quadrimaculatus North America 15,455 Cockburn et al 1990 L04272.1 An. gambiae G3 strain 15,363 Beard et al 1993 L20934.1 Cx. pipiens Tunisia 14,856 Unpublished HQ724614.1 Ae. aegypti unknown 16,655 Unpublished EU352212.1 Ae. albopictus unknown 16,665 Unpublished AY072044.1 D. melanogaster United States 19,517 Lewis et al 1995 U37541.1 D. yakuba Ivory Coast 16,019 Clary et al 1985 X03240.1 *The sequence length reflects the number of actual base pairs assembled (not including Ns)

57 4. De novo assembly of the nuclear genome of four An. punctulatus species

In addition to assembling mt genomes, I also de novo assembled the nuclear genomes of each of the four AP species sequenced above (section II.B.2.a;

Table 3) using only sequencing reads originating from the nuclear genome; mt reads were filtered out as described in section II.B.2.b. I corrected the nuclear reads for sequencing errors using the program QUAKE with a k-mer size of 17 and default parameters (Kelley et al. 2010). Finally, each sample was de novo assembled using the program ABySS (version 1.3.6) (Simpson et al. 2009). I determined the optimal k-mer for each sample by using various k-mers (ranging from 17 to 71) and selecting the k-mer that produced the highest N50 value and assembly size closest to that of An. gambiae (260 MB). The best assemblies were obtained with a k-mer of 51 for An. farauti 4 (AF4) and An. punctulatus s.s. (AP s.s.), and 31 for An. koliensis (AK) and An. farauti s.s. (AF s.s ) (see Appendix A for accession numbers).

To assess coverage, I mapped the original uncorrected reads onto each assembly using Bowtie2 (version 2.1.0) (Langmead & Salzberg 2012). I then calculated the average coverage for each contig greater than 1,000 bp. I collapsed overlapping contigs by first aligning each genome assembly to itself using BLAT

(Kent 2002) and (i) discarding any contig that completely aligned to another by only keeping the larger of the two; (ii) discarding all contigs where at least two contigs aligned to the same end of another; (iii) merging contigs that overlapped each other by at least 500 bp and were 95% identical. When merging contigs, any nucleotide differences between them were masked. To further optimize my

58 assemblies, I removed putative paralogous sequences by filtering out contigs with unusually high coverage (top 3%) and all contigs with less than half of the expected mean coverage. In each final assembly I used the base corresponding to the most frequently sequenced nucleotide at each position, if the coverage was

>10X, and masked any position sequenced by <=10 reads.

5. Genome assembly comparisons between Anopheles mosquitoes a. An. punctulatus species

I aligned the assembled contigs from all four An. punctulatus species using the Threaded Blockset Aligner (TBA) program (Blanchette et al. 2004). Briefly, I used the lastz program in the TBA package to generate pairwise alignments for all assemblies and kept only alignment blocks that were at least 500 bp in length and had 80% nucleotide identity. Non-unique alignment blocks were discarded.

Finally, a multiple alignment containing all four species assemblies was generated using the TBA program. b. An. punctulatus species and An. gambiae complex

I conducted pairwise comparisons between the An. punctulatus species and several members of the An. gambiae complex. I aligned sequencing reads generated for each An. punctulatus species to the genome of several members of the An. gambiae complex: An. gambiae (NCBI Accession AAAB01000001-

AAAB01069724), An. arabiensis (https://www.vectorbase.org assembly:

AaraD1), An. melas (https://www.vectorbase.org assembly: AmelC2), and An. coluzzii (https://www.vectorbase.org assembly: AcolM1) using the program

Bowtie2 (Langmead & Salzberg 2012) with default parameters and treating

59 paired-end reads as unpaired. I also conduct pairwise comparisons among species within the An. gambiae complex using sequencing reads from the

Sequencing Read Archive (SRA): An. gambiae (NCBI SRA: SRR1509742), An. arabiensis (NCBI SRA: SRR529989; SRR529992; SRR529994; SRR529995; Note: all these files were concatenated together), An. melas (NCBI SRA: SRR847597) and An. coluzzii (NCBI SRA: SRR1693253). c. An. farauti s.s. genome assemblies

To compare my An. farauti s.s. genome assembly to the Broad’s recently published An. farauti s.s. assembly (Neafsey et al. 2015) (GenBank:

AXCN00000000.2), I aligned the sequencing reads generated for An. farauti s.s. to the Broad’s AF s.s. assembly using Bowtie2 (Langmead & Salzberg 2012) with default parameters and treating pair-end sequencing reads as unpaired. I also compared my assembled AF s.s. contigs to the Broad assembly using the program

QUAST (Gurevich et al. 2013) with default parameters.

6. Phylogenetic analysis and molecular dating of Anopheles mosquitoes using protein-coding genes of the mitochondrial genomes a. Reconstruction of evolutionary history

Initially, I reconstructed the phylogeny of Anopheles mosquitoes using the mt genomes sequenced above (section II.B.3.c) coupled with all Anopheles mt genome sequences deposited in GenBank [35-37,43] (Table 5) along with the mt genomes of Drosophila [44,45], Aedes [GenBank: EU352212.1 and GenBank:

AY072044.1] and Culex [GenBank: HQ724614.1]. Six of the 15 Anopheles mt genomes downloaded for this analysis belong to the An. albitarsis group from

60 South America: An. albitarsis, An. albitarsis F, An. albitarsis G, An. deaneorum, An. janconnae and An. oryzalimnetes.

Since relying on the gene annotations from one of the Anopheles (i.e An. gambiae) may introduce systematic biases in the phylogeny, I determined the

DNA sequence of each gene for each mt genome (i.e. the sequences that were generated as well as those retrieved from NCBI) by comparing the Drosophila melanogaster mtDNA proteins (Genbank: U37541.1) against each samples’ mt genome using tBlastn. Briefly, for each protein coding sequence, the DNA sequences were translated into amino acid sequences, aligned to each other and the amino acid sequence was reverse-translated back into nucleotide sequences with Translator X (Abascal et al. 2010) using default parameters and the invertebrate mt genetic code. I concatenated the aligned coding protein sequences from all 13 mt genes (resulting in 10,770 (70%) nucleotides) and determined the best model of nucleotide substitutions using the program jModeltest v0.1.1 (Posada 2008). According to the Akaike Information Criterion, the best nucleotide substitution model for this data set was the General Time

Reversible with gamma distribution (GTR+G) model.

Bayesian phylogenies were reconstructed using BEAST v1.7.2 (Drummond

& Rambaut 2007) with the following parameters; an uncorrelated log-normal relaxed clock, allowing for rate heterogeneity among species; the GTR+G substitution model; the SRD06 model of partitioning, which allows the estimation of parameters separately for the 1st+2nd and 3rd codon positions, and I started with a randomly generated species tree with a Yule prior for tree construction.

61 Using the above parameters, I ran three independent runs of 20 million generations, with trees sampled every 1,000 generations. I then combined all runs after a burn-in of 10% (i.e. the first 10% of the 60 million generations are discarded to remove iterations of the Markov chain that may not be sampling the stationary distribution) using LogCombiner v1.7.2. I used Tracer v1.5

(http://tree.bio.ed.ac.uk/software/tracer/) to verify adequate mixing of the

Markov chains and to ensure that each parameter had been appropriately sampled (i.e. effective sampling size >200). I determined the maximum credibility tree using TreeAnnotator v1.7.2 and visualized the phylogenic tree with FigTree v1.3.1 (http://tee.bio.ed.ac.uk/software/figtree/). b. Molecular dating

I also used the program BEAST to simultaneously estimate the divergence times of Anopheles mosquitoes using the Drosophila-Anopheles divergence as a prior distribution normally distributed around a mean of 260 million years ago

(mya) and ranging from 243 to 276 mya as suggested in Gaunt et al (Gaunt &

Miles 2002). For comparison, I also estimated divergence times using a mutation rate of 0.0115 mutations per nucleotide per million years, which was estimated from the divergence times and sequence divergence of several insect mt genomes

(Brower 1994).

7. Phylogenetic analysis of four AP species using orthologous regions of the nuclear genome

In addition to examining the evolutionary history of Anopheles using mtDNA sequences, I also used whole genome data to reconstruct the phylogeny of

62 four of the AP species: An. punctulatus s.s., An. koliensis, An. farauti s.s., and An. farauti 4. I used three methods to determine species relationships. First, I generated a distance matrix of the total number of pairwise differences using all alignment blocks greater than 500 bp where all four species were represented (I refer to these alignment blocks as ‘loci’) (see section II.B.5.a for alignment) and built a UPGMA tree. Second, I analyzed all alignment blocks >5kb (accounting for

12.2. Mb) using RAxML (Stamatakis 2014) with the default parameters and reconstructed the most likely phylogeny. Third, I determined the species relationships for each locus independently and reconstructed unrooted trees using the maximum likelihood approach implemented in the program PhyML

(Guindon & Gascuel 2003). In PhyML, I used the following parameters: the general time-reversible model of nucleotide substitution with invariant sites, estimated nucleotide frequencies, and the approximate likelihood ratio test

(aLRT) with a chi-square distribution to assess the statistical support for each tree. To robustly assess the phylogeny of AP species, I only examined loci >1,000 bp and whose trees were supported by an aLRT score greater than 0.9 (scale is 0 to 1). Note that the aLRT score is the likelihood that the branch observed exists, so for a value of 0.9 there is a 90% likelihood that the branches observed exists

(see Glossary for how the aLRT test is conducted).

C. Results

1. Sequencing and assembly of mitochondrial and nuclear genomes a. Mosquitoes sequenced

63 I sequenced the mitochondrial genomes of 14 individual Anopheles mosquitoes representing 7 species. These include 11 individuals from the AP group in Papua New Guinea (An. punctulatus s.s. (n=5), An. koliensis (n=2), An. farauti s.s. (n=1, formerly An. farauti 1), An. hinesorum (n=1, formerly An. farauti

2), An. farauti 4 (n=2)) and 3 samples from the An. dirus complex in Thailand (An. dirus s.s. (n=2, An. dirus species A) and An. cracens (n=1, An. dirus species B))

(Table 5). b. Assembly of mitochondrial genomes

Among four of these samples, I sequenced the whole genome generating

~39-150 million paired-end reads of 51 or 100 bp, resulting in an average of 500

X coverage of mtDNA (Table 3 and Figure 4). I then de novo assembled each individual mitochondrial genome. For the remaining nine samples, I amplified the entire mtDNA genome in 7 overlapping PCR products, pooled and barcoded the

PCR products from each sample with a unique 6-nucleotide tag, and finally sequenced PCR products from all samples simultaneously on one lane of an

Illumina HiSeq 2000. I generated 392,983,105 million paired-end reads of 100 bp resulting in an average of 28 million reads per sample and 188,000X coverage of each mitochondrial genome. Finally, I de novo assembled each genome separately.

In all newly sequenced genomes, the genes and gene organization (i.e., orientation and order) were identical to that of previously sequenced Anopheles mt genomes (Beard et al. 1993; Cockburn et al. 1990; Krzywinski et al. 2006;

Krzywinski et al. 2011; Moreno et al. 2010).

64 c. Sequencing and de novo assembly of the nuclear genome of four wild- caught mosquitoes of the An. punctulatus group

I sequenced the genome of four individual mosquitoes, An. punctulatus s.s.,

An. koliensis, An. farauti s.s and An. farauti 4, from Papua New Guinea and generated ~37-170 million read pairs resulting in 34X to 131X coverage (Table

3). Initially, I attempted to assemble the genome of each An. punctulatus species by aligning the sequencing reads generated (Table 3), for each species, to the only

Anopheles genome available when I initiated the study, An. gambiae. I was only able to align an average of 1.25% of the DNA sequencing reads from each species to the genome of An. gambiae indicating that the divergence between An. gambiae and species within the An. punctulatus group was too extensive to use An. gambiae as a reference for genome assembly. Instead, I de novo assembled each genome independently and initially obtained 389,743 to 927,145 contigs with a total assembly size of 193-293 MB and an N50 of 3,691-13,469 (Table 6) (for reference the genome size of An. gambiae is 260 MB). Analysis of the distribution of the average contig coverage (Appendix B, A-D) revealed a bimodal distribution with some of the contigs displaying half the expected coverage (based on the overall sequencing effort and the expected size of the Anopheles genomes). I hypothesized that these contigs included redundant sequences misassembled due to high genetic heterozygozity (i.e., each parental chromosome was assembled separately), as has previously been observed for other diploid organisms

(Takeuchi et al. 2012; Zhang et al. 2012; Zheng et al. 2013). I subsequently

65 Table 6. Summary of initial de novo genome assemblies before filtering out redundant contigs.

Sample # ctgs Size (kb) N50 Median Max

An. farauti 4 451,824 261,977 13,469 105 331,681

An. punctulatus s.s. 927,145 293,761 8,131 117 97,012

An. farauti s.s. 389,743 193,639 7,880 692 79,463

An. koliensis 484,838 204,191 3,691 311 76,110

merged these sequences together (see section II.B.4), which eliminated most of the redundancy (Appendix BE-H). My final assemblies accounted for 62.9% to

74.0% of the reads and contained 14,407-41,925 contigs with an N50 of 4,664 to

16,229 and a final assembly size of 146.2 to 161.6 Mb (Table 7). Thus, from single wild-caught mosquitoes, I was able to de novo assemble DNA sequences representing close to two-thirds of their hypothesized genome size based on An. gambiae (~260 Mb).

The best assembly, based on the N50 statistic, was obtained for the AF4 species, which was also the sample with the most sequence coverage (Table 7).

To evaluate if the increased sequencing depth was responsible for the better assembly, I randomly subsampled AF4 reads to coverage similar to that obtained for AP s.s.. I then used this subset of reads to de novo assemble the genome of AF4 independently. The lower sequence coverage did not reduce the assembly quality of AF4 and yielded similar assembly statistics when compared to those obtained when using all reads (Table 8). As de novo assembly algorithms sometimes fail to

66 collapse homologous chromosomes into a single DNA sequence, due to heterozygosity, I hypothesized that the differences in assembly quality for AP s.s. was, at least partially, due to differences in genetic heterogeneity and that AF4 is predicted to have a lower nucleotide diversity than the other AP species assembled here.

Table 7. Summary of the Anopheles punctulatus species genome assemblies after filtering out redundant contigs.

Expected # of % Median Sample * Size (kb) N50 Max (bp) coverage* contigs assembled* (bp)

An. farauti 4 131X 146,386 14,407 63.1 16,229 331,681 6,280

An. punctulatus s.s. 50X 146,190 20,775 62.9 10,258 97,012 5,136

An. koliensis 39X 151,327 41,925 74 4,664 76,110 2,557

An. farauti s.s. 34X 161,555 27,543 72.3 8,767 79,463 3,975 *Based on size of An. gambiae genome (~260 Mb)

2. Genome Comparison of Anopheles mosquitoes a. Comparison of the genome assemblies of An. punctulatus species

I assessed whether any of the AP genomes assembled could be used as a reference genome for future studies aimed at sequencing additional AP species.

For this, I aligned the ~37 to 170 million sequencing reads generated, from each

AP species, (Table 3) to the genome assembly of each species. In total, I was able to align 29% to 35% (Table 9) of the sequencing reads generated to each species showing that the AP species are deeply diverged from one another. These results

67 suggest that none of the AP species genomes could be used as a reference genome for the assembly of additional AP species. Therefore, additional AP species will have to be independently sequenced and de novo assembled.

Table 8. Summary of the An. farauti 4 assembly after sub-sampling sequencing reads to the coverage of An. punctulatus s.s.

Expected # of % Sample Size (kb) N50 Max (bp) Median (bp) Coverage* Contigs Assembled

An. farauti 4 131 X 146,386 14,407 63.1 16,229 331,681 6,280

An. farauti 4 sub- 50X 149,618 12,971 66.8 18,553 331,629 7,271 sampled

An. punctulatus s.s. 50X 146,190 20,775 62.9 10,258 97,012 5,136

*Based on genome size of An. gambiae (260 Mb)

In contrast to the AP species, when I aligned sequencing reads generated for several members of the An. gambiae complex to each of their respective genome assemblies, 52 to 76% of the sequencing reads aligned. For the two species undergoing sympatric speciation, An. gambiae and An. coluzzii (Lawniczak et al. 2010; Lee et al. 2013; Neafsey et al. 2010; Turner et al. 2005), I found that

76% of the sequencing reads aligned between species (Table 9). Compared to the

AP sibling species, more than double the number of reads can be aligned within the An. gambiae complex. Despite this, the genomes of species within the An. gambiae complex still have to be de novo assembled as the An. gambiae species, with the exception of An. gambiae and An. coluzzi, are too diverged for reference-

68 guided assembly methods (Neafsey et al. 2013; Neafsey et al. 2015). However, when I aligned sequencing reads generated for the AP species to members of the

An. gambiae complex I was only able to align 1.1-2.2% of the sequencing reads

(Table 9). This evidence supports my findings that the AP group is deeply diverged from the An. gambiae complex and therefore their genomes have to be de novo assembled.

Table 9. Percentage of DNA sequencing reads that align to the genome of species in the An. punctulatus group and An. gambiae complex. Members of the An. gambiae complex include An. arabiensis, An. gambiae, An. coluzzii and An. melas.

An. An. An. An. AP AK AF s.s. AF4 arabiensis gambiae coluzzii melas

AP 63.1

AK 28.9 74.8

AF s.s. 28.8 35.1 73.0

AF4 31.7 30.7 29.5 63.3

An. arabiensis 1.5 2.0 1.9 1.6 62.4

An. gambiae 1.8 2.2 2.2 2.2 60.8 84.1

An. coluzzii 1.4 1.9 1.8 1.6 60.5 75.5 70.2

An. melas 1.1 1.7 1.6 1.1 52.3 62.5 54.0 54.0 AP- An. punctulatus s.s.; AK- An. koliensis; AF s.s. – An. farauti s.s.; AF4- An. farauti 4

Although no benchmark is currently available for what percentage of reads need to map to use a species as a reference genome, it appears that at least 70-75% of the sequencing reads need to align (i.e. the percentage of reads that align

69 between the species undergoing sympatric speciation An. gambiae and An. coluzzi). b. Comparison of An. farauti s.s. genome assembly to An. farauti s.s genome assembled by the Broad Institute

I compared my An. farauti s.s. genome assembly to the only other AP species assembled, An. farauti s.s (Neafsey et al. 2015). First, I aligned the sequencing reads generated from my wild-caught AF s.s. mosquito to the Broad’s colony-adapted AF s.s. genome assembly (assembled using multiple AF s.s. mosquitoes) and was able to successfully align a total of 70,810,782(80.2%) of the 88,316,970 sequencing reads. In addition, I also was able to align

26,947(97.8%; accounting for 159 of the 161MB assembled) of the 27,543 assembled contigs to the Broad AF s.s. genome assembly. In total, based on the

Broad AF s.s. genome assembly, I successfully assembled 86.6% of the An. farauti s.s. genome by sequencing a wild-caught AF s.s. mosquito. This highlights the feasibility of sequencing wild-caught mosquitoes for genome assembly.

3. Phylogenetic analysis a. Phylogeny of Anopheles mosquitoes using the mitochondrial genome

I determined and aligned the protein coding DNA sequences of all

Anopheles mt genomes generated here, or previously sequenced, with several outgroups (Table 5). The concatenated protein coding sequence includes 10,770 nucleotides. I reconstructed a phylogenetic tree using the concatenated protein coding sequences and the Bayesian approach implemented in BEAST. Three independent runs of 20 million iterations were combined and adequate sampling

70 of the posterior distribution of each parameter was obtained. All phylogenetic relationships were supported with posterior probabilities greater than 90%, with the exception of the position of An. gambiae (72% support) and an internal node among the AP mosquitoes (85% support) (Figure 6). The resulting phylogenetic tree highlights three monophyletic clades corresponding to the AP s.s., An. dirus and An. albitarsis groups (Figure 6).

1 An. albitarsis F

0.91 An. janconnae

An. albitarsis G 0.99 1 An. deaneorum

An. albitarsis 1 An. oryzalimentes

1 An. darlingi Belize 0.97 An. darlingi Brazil

An. quadrimaculatus

0.97 An. gambiae

An. cracens 1 0.72 1 An. dirus s.s. An. dirus s.s. 0.95 1 An. farauti 4

An. farauti 4

0.96 1 An. koliensis 0.98 An. koliensis

0.99 1 An. hinesorum An. farauti s.s.

An. punctulatus s.s. 0.85 An. punctulatus s.s.

An. punctulatus s.s.

1 An. punctulatus s.s.

An. punctulatus s.s.

1 Aedes albopictus 1 Aedes aegypti

Culex pipiens

1 D. melanogaster D. yakuba

Figure 6. Support of the Anopheles phylogeny using the concatenated DNA sequences of all mitochondrial protein coding genes. The values on the tree correspond to the posterior probabilities of each node.

71

Figure 7. Phylogenetic tree of Anopheles using the concatenated DNA sequences of all mitochondrial protein coding genes. The bars illustrate the 95% credibility intervals for the divergence times and the numbers in brackets above each node display the actual values in millions of years. The panel on the right indicates the geographic distribution of the samples: the green bar indicates mosquitoes from South America (SA), red from North America (NA), grey from Africa (AF), blue from Southeast Asia (SEA) and orange from Papua New Guinea (PNG).

These results show deep divergence between two main Anopheles lineages. One lineage includes all mosquitoes from South and Central America and is further sub-divided into the An. albitarsis complex and An. darlingi species.

The other lineage, containing all non-Central and South American Anopheles, seems to have radiated to generate, first Anopheles species currently present in

North America and Africa and, from there, Southeast Asia (SEA) and Southwest

72 (SW) Pacific mosquitoes (Figure 7). The AP group from PNG clusters most closely with the An. dirus complex distributed across Southeast Asia (Figure 7). This tree also partially resolves the phylogeny within the AP group with An. farauti 4 being the most divergent while An. farauti s.s. and An. hinesorum being most closely related (Figure 6). b. Molecular dating of Anopheles using mitochondrial genome

As a result of the poor fossil record for mosquitoes (Poinar et al. 2000), few reliable calibration points exist for dating anophelines. Therefore, I estimated the divergence times among Anopheles species using the divergence time of

Drosophila-Anopheles (260 mya) (Gaunt & Miles 2002). I dated the most recent common ancestor (MRCA) of all Anopheles mosquitoes to 93 mya with a 95% credibility interval ranging from 61 to 126 mya (Figure 7). From this origin,

Anopheles mosquitoes seem to have rapidly diverged from each other and spread across the globe to reach SEA and the SW Pacific by ~43-87 mya (Figure 7). For more information on the origin and dispersal of AP mosquitoes see Appendix C.

Within the AP group, I observed a deep divergence among sibling species with the MRCA of the AP group dating back to 25-54 mya, roughly half as old as the MRCA of all Anopheles (Table 10). Importantly, this ancient origin of the AP group does not appear to result from a single highly divergent sibling species: most species analyzed here (An. punctulatus s.s., An. koliensis, and An. farauti 4) seem to have diverged 25-54 mya, and only An. farauti s.s. and An. hinesorum share a recent common ancestor (5-17 mya) (Figure 7). This finding is especially striking when compared with the only other sequenced Anopheles group: An.

73 Table 10. Mean divergence times and 95% credibility intervals for selected nodes

MRCA Mean (mya) 95% Credibility (mya)

a Drosophila / Anopheles (Calibration - 260 mya) 255 [215.6-293.8]

Anophelinae / Culicinae 145 [97.7-193.7]

Anopheles genus 93 [61.4-126.4]

An. dirus complex / An. punctulatus group 64 [42.7-86.5]

b South and Central American Anopheles 47 [26.8-72.9]

Anopheles punctulatus group 39 [25.4-53.9]

Anopheles albitarsis complex 14 [6.2-23.6]

An. farauti s.s. / An. hinesorum 10 [4.7-17.2] aCalibration point. The number in brackets indicates the mean value used for the calibration. The numbers in the table indicate the mean age and 95% credibility interval for this node after analysis. bAn. darlingi and An. albitarsis complex

albitarsis mosquitoes are distributed across South America (from southern Brazil to Columbia) over a much larger geographic range than the AP group in PNG but share an MRCA dating back to only 6-24 mya, significantly younger than the

MRCA of the AP group (Figure 7).

By using the estimated insect mt mutation rate of 0.0115 mutations per nucleotide per million years (Brower 1994) instead of a calibration point, led, overall, to more recent divergence times (Table 11).

74 Table 11. Divergence times using the insect mitochondrial DNA mutation rate. Mean divergence times and 95% credibility intervals for selected nodes using insect mitochondrial DNA mutation rate.

MRCA Mean (mya) 95% Credibility (mya)

Drosophila / Anopheles 80.5 [59.7-104.1]

Anophelinae / Culicinae 46.1 [37.9-55.4]

Drosophila 12.7 [8.2-18.2]

Anopheles genus 29.3 [24.6-34.9]

An. dirus complex / An. punctulatus group 20.4 [16.4-24.6]

South and Central American Anopheles 14.5 [10.3-19.1]

An. punctulatus group 12.5 [10.2-15.4]

An. albitarsis complex 3.8 [3.0-4.8]

An. farauti s.s. / An.hinesorum 3.1 [2.0-4.5]

c. Phylogeny of four AP species using multiple nuclear loci

I compared the genome assemblies of four species, An. punctulatus s.s., An. farauti s.s., An. koliensis, and An. farauti 4, and produced a total of 47,181 four- sequence alignments (or loci) containing 82,651,073 nucleotide positions (or

31.8% of the An. gambiae reference genome) (Figure 8A and 8B). This indicates that roughly 30% of the genome of AP sibling species is conserved and likely to have similar functions across AP species. In this dataset, I identified 11,925,951 variable positions. The number of pairwise differences between species varied from 4,947,209 to 7,053,973 (6% to 8.5% divergence, (Table 12) with AF s.s. and

AK being most closely related. Maximum-likelihood analysis of all alignment

75 blocks >5kb (accounting for 12.2 Mb of DNA sequences) yielded a similar phylogeny with AF s.s. and AK grouping together.

Figure 8. Multiple-alignment statistics for An. punctulatus group. The Venn diagrams summarize (A) the number of alignment blocks >500bp generated for all species combinations, and (B) the total number of aligned base pairs represented in these blocks (in millions).

I also reconstructed a phylogenetic tree for each of the 31,312 loci >1,000 bp (see section II.B.7 for filtering criteria) using the maximum likelihood model implemented in the program PhyML (Guindon & Gascuel 2003). 30,907 (98.7%) of the trees, accounting for 70,921,162 bp, were supported with an aLRT score greater than 0.9 (scale is 0 to 1) and were further analyzed. 99.7% of these well- supported trees grouped AF s.s. and AK together (9), which was not consistent with the mitochondrial topology observed in section II.C.3.a, while 0.17% and

0.08% of the trees supported two different tree topologies, grouping AF s.s. with

AP s.s. and AF4 respectively (Figure 9). Note that, for all trees, the internal branch only represented a small proportion of the total branch length, indicating an old

76 and rapid radiation of the An. punctulatus species from a common ancestor

(Figure 10), which is consistent with my observations of the mt genome phylogeny.

Figure 9. Consensus tree topology of An. punctulatus sibling species based on analysis of 30,907 nuclear loci, representing 71Mb of genome sequence. A phylogenetic tree was reconstructed for each nuclear locus. The topologies shown here were observed for (A) 99.7%, (B) 0.17% and (C) 0.08% of the loci.

77 Table 12. Number of pairwise nucleotide differences (lower diagonal) and percent divergence (upper diagonal) among the Anopheles punctulatus mosquito species sequenced.

AF s.s. AF4 AK AP s.s.

AF s.s - 8.2% 6.0% 8.3%

AF4 6,758,632 - 8.2% 8.1%

AK 4,947,209 6,808,445 - 8.5%

AP s.s. 6,883,739 6,685,658 7,053,973 -

Figure 10. Distribution of the internal branch length. The figure shows the distribution of the proportion of the internal branch length for each of the 30,907 trees.

D. Discussion

In this chapter, I generated the first de novo assembly of the nuclear genome of four AP species and the mitochondrial genome of 14 individual

78 mosquitoes (representing 7 species). Note that recently several Anopheles genomes have been assembled (Marinotti et al. 2013; Neafsey et al. 2015; Zhou et al. 2014); however only one species of the AP group, An. farauti s.s. was assembled from colony adapted mosquitoes. I used this data to conduct comparative genomic analysis among assembled Anopheles genomes and evaluate the evolutionary history of the An. punctulatus group.

1. Genome assembly and comparative analyses

There are two methods to assemble the genome of an organism: DNA sequencing reads from an organism can be (i) aligned to an already assembled genome of a closely related species (i.e. a reference genome) or (ii) de novo assembled. Given the deep divergence I observed between An. gambiae and the

AP species (i.e. only ~1% of reads aligned between AP species and the genome of

An. gambiae) I could not use An. gambiae as a reference genome for assembly.

This has also been observed for the recently assembled genomes of An. darlingi from South America (Marinotti et al. 2013) and An. sinensis from China (Zhou et al. 2014) suggesting that the divergence among Anopheles mosquitoes is extensive. In fact, comparative genomic analyses revealed that none of the AP species genomes could be used as a reference genome to assemble another AP species as only 29 to 35% of sequencing reads aligned among AP species.

Therefore, any further studies aiming to generate genome assemblies for additional AP species will have to de novo assemble the genome of each species independently. This finding is particularly striking when compared to the only other Anopheles complex currently sequenced, An. gambiae, as these species are

79 predicted to have only recently diverged within the past two million years

(Fontaine et al. 2015) and sequencing reads generated can be align between species: 52 to 76% (Table 9). In fact, two of the species, An. gambiae s.s. and An. coluzzii, are undergoing sympatric speciation and still have on-going gene flow occurring across all autosomal arms (Lawniczak et al. 2010; Lee et al. 2013;

Neafsey et al. 2010).

To evaluate the feasibility of de novo assembling wild caught mosquitoes I compared my An. farauti s.s. genome assembly to the Broad’s An. farauti s.s. assembly that was recently published (Neafsey et al. 2015). Unlike my study, the

Broad sequenced the DNA of multiple colony-adapted mosquitoes that were bred to remove heterozygosity. They did this because current de novo assembly algorithms have difficulties assembling homologous chromosomes that are highly heterozygous, as these chromosomes are not collapsed but, instead, are assembled into multiple redundant fragments (i.e., each parental chromosome was assembled separately) (Takeuchi et al. 2012; Zhang et al. 2012; Zheng et al.

2013). This is particularly problematic for Anopheles mosquitoes as the number of heterozygous sites are 10 to 15 times greater than those found for most vertebrate species (Neafsey et al. 2013). I found that, using wild-caught mosquitoes, 86.5% of the An. farauti s.s. genome was able to be assembled by sequencing a single AF s.s. mosquito to a coverage of 34X (based on the genome size of An. gambiae – 260MB). However, I was unable to assemble 13.5%

(accounting for 24 Mb) of the Broad’s AF s.s. genome assembly. This is possibly due to the lower sequencing coverage of my sample or could be the result of the

80 difficulties associated with de novo assembly: the algorithms have difficulties collapsing very heterozygous DNA sequences (i.e. collapsing parental chromosomes). In fact, to improve my genome assemblies I had to collapse overlapping assembled contigs together as there were numerous contigs that had half the expected coverage. Additionally, it is likely that the repetitive regions of the genome were not assembled as long sequencing reads that span the repetitive sequence are needed to assemble these regions. The Broad used Fosmid libraries to enable contigs to be aligned to create a scaffold to generate a more contiguous assembly (Neafsey et al. 2013; Neafsey et al. 2015). The results here show that sufficient DNA sequencing reads can be generated from a single wild-caught mosquito to de novo assembled its genome in the absence of a reference genome.

Additionally, my results suggest that expensive and labor intensive sequencing strategies such as Fosmid libraries may not be needed to produce quality genome assemblies. However, Fosmid libraries due enable more contiguous genome assemblies to be assembled compared to the genome assembly techniques I applied in this study.

2. Molecular dating of Anopheles species using mitochondrial genome

The deep divergence observed between AP species during genome assembly is further supported by my analysis of the mitochondrial genome, which suggests that the different AP sibling species diverged from each other 25-

54 mya, much earlier than proposed in previous studies of the AP group

(Ambrose et al. 2012; Beebe et al. 2000b). In fact, the MRCA of the AP group is estimated to be approximately four times older (25-54 mya) than the only other

81 Anopheles complex for which mitochondrial genome sequences are available, An. albiatarsis (6-24 mya) and almost half as old as the MRCA of all Anopheles mosquitoes.

Since my analysis relies on a single non-recombining locus, I cannot rule out the possibility that the estimates are influenced by the action of natural selection. However, when only Anopheles mt genomes are analyzed, there is little evidence for deviation from a clock-like model of evolution, which suggests that nucleotide substitutions occur at a similar rate on each lineage. Therefore, if natural selection is driving the evolution of the mt genome in Anopheles, it is likely to have acted in a similar manner on all lineages and consequently is unlikely to bias the molecular dates significantly. An additional possible complication is that the phylogenetic tree inferred from mt sequences differs from the actual species tree: since this analysis is based on a single locus, I cannot rule out that incomplete lineage sorting and introgression lead to a phylogenetic reconstruction that does not represent the true evolutionary pathways of the species studied (Nichols 2001; Pamilo & Nei 1988). The long internal branches separating species coupled with the very short branches separating individuals from the same species indicate that incomplete lineage sorting of ancestral polymorphisms is unlikely to affect this phylogeny (Degnan & Rosenberg 2006).

Ruling out introgression will require genetic data on multiple unlinked loci (i.e. nuclear genome), which can now be conducted using the genomic data generated here. This analysis will be important, as there is limited evidence that suggest

82 gene flow may be occurring among AP mosquitoes in southern PNG (See chapter

3 for more details on my examination of introgression among AP species).

Absolute dates estimated from molecular data should be consider cautiously, especially since these estimates rely on a single calibration point. Note however that the divergence estimates between An. quadrimaculatus and non-

American Anopheles, as well as, between An. farauti s.s. and An. hinesorum, are similar to those of previous studies (Ambrose et al. 2012; Krzywinski et al. 2006;

Moreno et al. 2010). When the estimated mutation rate for insect mtDNA was used (Brower 1994), the divergence dates obtained were significantly younger than those obtained using a calibration point (see Table 11). This mutation rate, while widely used (se e.g. (Ambrose et al. 2012)) was originally calculated using closely related species (the maximum divergence time used in the study was 3 mya) and, Brower noted, probably overestimated the actual mutation rate. A slower mutation rate would push estimated divergence dates back in time, closer to the estimates obtained using calibration points.

3. Evolutionary relationship of An. punctulatus species

Both the mitochondrial and nuclear phylogenies suggest that AP species rapidly diverged from each other and have been independently evolving since (25 to 54 mya ago based on molecular dating using mt genome sequences Figure 7).

In fact, analysis of 83Mb of nuclear sequencing data reveals that the AP species are 6 to 8.5% diverged from each other. However, the mt and nuclear tree topologies reveal different species relationships within AP mosquitoes. For example, the mt phylogeny shows that An. farauti s.s. and An. punctulatus s.s. are

83 most closely related while the nuclear phylogeny shows that An. farauti s.s. and

An. koliensis are most closely related. One explanation that may explain this difference is that the mitochondrial genome is a single, non-recombining, locus that represents a single, maternal, history, while, the nuclear genome consists of thousands of independently loci each representing a different evolutionary history. I therefore suggest that the species relationships observed using the nuclear genome, which are consistent with previous, though not statistically well supported, phylogenies (Beebe et al. 2000b; Beebe et al. 1999; Foley et al. 1998) recapitulate the true relationships of AP mosquitoes. This is further supported by my phylogenetic analysis of 30,907 loci where 30,814 loci displayed the same phylogenetic relationships among AP sibling species.

4. Implications for vector control initiatives

Given the old divergence times among most AP sibling species using the mt genome and the deep divergence observed in both the mt and nuclear phylogenies, I would hypothesize that, today, AP species in the Madang province are reproductively isolated and that hybridization is unlikely to occur in nature

(with the possible exception of AF s.s. and AH that only diverged 3.7 to 11.6 mya).

This potential reproductive isolation among AP sibling species is supported by cross-mating experiments suggesting that F1 hybrids between any combinations of An. farauti s.s, An. hinesorum, An. koliensis and An. punctulatus s.s. are non- viable or sterile (Bryan 1973a). These results have several implications for vector control in the Southwest Pacific. First, recent results have shown the possibility of releasing genetically engineered, sterile, male mosquitoes into the environment

84 to decrease mosquito populations (Harris et al. 2011). Based on our results, successful implementation of a similar strategy in Papua New Guinea would require independent engineering of mosquitoes from at least five highly divergent species to significantly impact the populations of the major malaria and filariasis vectors. Second, if AP mosquitoes are reproductively isolated from each other, insecticide resistance may be unlikely to spread quickly across all AP mosquitoes, as resistance would have to independently occur in each species.

Note however that further investigations are required utilizing the genomic data generated here to definitively assess the existence of gene flow among AP sibling species (see Chapter 3).

Lastly, the ancient divergence among AP sibling species also raises important questions for the evolution of malaria transmission. AP sibling species diverged long before humans arrived in Papua New Guinea (~50,000 years ago,

(Lilley 1992)) but several (but not all) of the sibling species are able to transmit malaria. Traits such as human blood feeding preference or the ability to transmit human malaria parasites must therefore have evolved independently on each of the AP lineages in the last fifty thousand years.

85 III. Chapter 3

Using Genomic Data to Investigate if Introgression is Occurring Between An. punctulatus Sibling Species

Portions of this chapter are published in the following manuscript:

Logue K, Small ST, Chan ER, Reimer L, Siba PM, Zimmerman PA, Serre D. 2015. Whole-genome sequencing reveals absence of recent gene flow and separate demographic histories for Anopheles punctulatus mosquitoes in Papua New Guinea. Mol Ecol. 24:1263-1274.

86 A. Introduction

There are more than 500 Anopheles species worldwide that are often organized in sibling species complexes. The African An. gambiae complex has been extensively studied and revealed a complex history of ancient introgression, sympatric speciation and on-going gene flow among sibling species (Clarkson et al. 2014; Crawford et al. 2012; Fontaine et al. 2015; Lawniczak et al. 2010; Lee et al. 2013; Neafsey et al. 2010; Neafsey et al. 2015; Riehle et al. 2011; Weetman et al. 2014). Unfortunately, most non-African Anopheles species have been less well studied however genome assemblies are now available and will enable genetic studies to be conducted on Anopheles mosquitoes in the Southwest Pacific (see

Chapter 2).

Papua New Guinea (PNG) has some of the highest rates of malaria (WHO

2013) and lymphatic filariasis (Bockarie & Kazura 2003) in the world. There are at least 38 species of Anopheles recognized in PNG and the Southwest Pacific, classified in 5 complexes or groups (Beebe 2013). Among them, the 13 species of the An. punctulatus group account for the majority of the Anopheles mosquitoes and include the main vectors of malaria and lymphatic filariasis. Many of these species are morphologically indistinguishable (Bryan 1973c; Cooper et al. 2002;

Foley et al. 1993) and have large overlaps in their geographical distributions.

While these species differ in their larval habitat preferences, with for example An. farauti s.s. typically using natural habitats near the coast while An. punctulatus s.s. prefer human-made inland habitats, larvae from multiple species are often found in the same habitat (Beebe & Cooper 2002). Similarly, these species show

87 differences in feeding preferences (e.g., An. punctulatus s.s. being more anthropophilic than An farauti s.s.) but adult Anopheles are still frequently captured together using human land and catches, light traps and barrier screens

(Benet et al. 2004; Burkot et al. 2013).

Analyses of complete mitochondrial genome sequences revealed that species of the Anopheles punctulatus group diverged from each other several millions years ago (Logue et al. 2013) and suggested that these species are much more distantly related (see Chapter 2) than, for example, species of the Anopheles gambiae complex in Africa (Fontaine et al. 2015) for which instances of gene flow between species have been reported (Clarkson et al. 2014; Lee et al. 2013;

Weetman et al. 2014). Interestingly recent studies have suggested that introgression may be occurring between two Anopheles punctulatus (AP) species in Southern New Guinea (Ambrose et al. 2012). To examine this further analysis of multiple genetic loci would allow the opportunity to rigorously assess contemporary gene flow among the members of the An. punctulatus group but these analyses are problematic in most Anopheles species due to the lack of sufficient genetic data and the divergence among Anopheles species (see Chapter

2).

Understanding the amount of gene flow among sympatric Anopheles species, as well as the structure and diversity of their populations, is critical for malaria control as these demographic parameters may influence whether insecticide resistance alleles could spread between species or if they would have to arise multiple times independently. This is particularly important as a new call

88 for the eradication of malaria (2007; Roberts & Enserink 2007) has been made, resulting in the widespread distribution of ITNs and use of IRS (Grabowsky 2008;

Kelly-Hope et al. 2008; Roberts & Enserink 2007). However, insecticide resistance still poses a formidable challenge to malaria control as resistance alleles have emerged in Anopheles populations in Africa (Namountougou et al.

2012; Ranson et al. 2011), Southeast Asia (Somboon et al. 2003; Van Bortel et al.

2008) and South America (Fonseca-Gonzalez et al. 2009; Zamora Perea et al.

2009). These resistance alleles have rapidly spread across Africa (Dabire et al.

2009; Jones et al. 2012) due to gene flow between populations, for example, between the sympatric M and S forms of An. gambiae (Clarkson et al. 2014;

Diabate et al. 2003; Etang et al. 2009; Weill et al. 2000). Therefore, it is crucial that we better understand the biology of Anopheles mosquitoes as renewed vector control strategies are implemented.

In Chapter 3, I will use the genomic data generated for the AP species in

Chapter 2 to evaluate the possibility of contemporary introgression among four

AP species: An. punctulatus s.s. (AP s.s.), An. farauti s.s. (AF s.s.) An. koliensis (AK) and An. farauti 4 (AF4) using phylogenetic and population genetic approaches. I also examine the genome-wide distribution of heterozygosity for AP s.s. and AF4, sequenced at the highest coverage to reconstruct the demographic history of their respective species.

89 B. Methods

1. Identification of putative introgression candidates using sequence divergence between orthologous regions of the nuclear genome

In Chapter 2, I assembled the genomes of four AP species: AP s.s., AK, AF s.s. and AF4 using high-throughput sequencing techniques. This genomic data enables detailed analyses to be conducted to examine whether introgression has occurred between species. Since I only have genome assemblies from a single individual mosquito per species (see section II.B.4) I am unable to employ population genetic analyses that rely on population data to detect introgression.

However, I can still detect loci indicative of introgression by examining the divergence between species across their genomes. To identify these loci, I searched for trees (using the trees reconstructed in section II.B.7) in which very

Figure 11. Identification of putative introgression candidates based on sequence divergence. The figure shows the distribution of the proportion of the total external branch length accounted for by the two neighboring external branches for each of the 30,907 trees. The red line represents my cut-off at the 0.01th quantile of the cumulative distribution.

90 short branch lengths separated two of the four species. For each tree, I calculated the ratio of the two shortest adjacent external branches to the sum of all four external branches. I then considered, as candidate loci for introgression, any trees for which the ratio was in the 0.01th quantile of the cumulative distribution and identified a total of 306 trees (Figure 11). I determined the gene annotation for each locus by aligning it against the NCBI nucleotide database. I annotated loci where the best alignment was to an An. gambiae transcript and was at least 70 percent identical.

2. Identification of Single Nucleotide Polymorphisms (SNPs) in orthologous regions of the An. punctulatus s.s. and An. farauti 4 Genomes

To obtain a first genome-wide perspective on the genetic diversity of AP s.s. and AF4, I mapped all sequencing reads from AP s.s. and AF4 onto their corresponding contig assembly using Bowtie2 (Langmead & Salzberg 2012)

(treating paired-end reads as unpaired). I restricted my analyses to loci ≥ 1,000 bp (see section II.B.7 for information on loci) and only called SNPs at positions sequenced at high quality (Phred quality score, Q ≥ 20) and high coverage (80-

200 X in AF4 and 25-80 X in AP s.s.,Figure 11). I considered a site to be variable if

>20% of the reads carried the minor allele (Figure 12). I defined as a shared polymorphism any position that was variable in both AP s.s. and AF4 and with the same two alleles (see Appendix A for location of SNP data). Overall, I identified

318,375 SNPs in AP s.s. and 164,081 SNPs in AF4.

91

Figure 12. Distribution of the sequence coverage per locus. The histograms show the average coverage (x-axis, in reads per base pair) or each locus assembled for (A) An. farauti 4 and (B) An. punctulatus s.s.

3. Number of shared polymorphisms expected under different population histories.

I calculated the expected number of shared polymorphisms between AP s.s. and AF4 under different demographic models using the coalescent simulation program ms (Hudson 2002) and the orthologous regions aligned between AF4 and AP s.s. (see section II.B.5.a). I estimated the scaled population mutation rate,

θ, from the observed heterozygosity and the population recombination rate, ρ, using mlRho (Haubold et al. 2010). I simulated 22,394 loci (i.e., the number of loci analyzed for AF4 and AP s.s. after filtering for loci with > 1,000 bp and aLRT >0.9, see section II.B.7) and assigned to each locus a θ value corresponding to the value observed in one of my alignment blocks (see section II.B.5.A). I used two different models in my simulations, one without gene flow and one with varying amounts of gene flow. The first model represented two species that diverged in the past

92

Figure 13. Distribution of reference allele frequencies. The histograms show the proportion of reads carrying the reference allele (x-axis, in percent) at each nucleotide position for (A) An. farauti 4 and (B) An. punctulatus s.s. The red box indicates the cut-off values used to call a nucleotide positions heterozygous. from a common ancestor and afterwards evolved independently (i.e., the isolation model with no gene flow between diverged species (see Sousa & Hey 2013) for a review).For this isolation model, I performed 1,000 simulations for each locus.

The second model investigated the consequences of gene flow on the number of shared polymorphisms between the two species. For this model, I simulated each locus 10,000 times with different values of gene flow (4Nem) and time since the most recent gene flow (in 4Ne generations). For each simulation, I randomly assigned a number of lineages originating from the introgressing population (i.e.,

4Nem) by drawing from a uniform distribution between 1 and 5; the time since the most recent gene flow was randomly drawn from a uniform distribution between 0 and 0.5 (scaled in 4Ne generations). In all simulations, I modeled gene flow as continuous for either 100 or 1,000 generations and at a constant rate. For

93 simplicity in presentation I chose to scale the results using a mutation rate of

2.8x10-9 per base pair per generation (Ambrose et al. 2012; Dixit et al. 2014;

Keightley et al. 2014) and assuming 12-20 generations per year as found for

Anopheles gambiae and Aedes aegypti (Barbosa et al. 2011; Beserra et al. 2006;

Chandre et al. 2000).

For each simulation, I used a modified version of the software msstats

(github.com/rossibarra/msstats) to calculate the total number of shared polymorphisms and the total number of unique polymorphisms within each species across all loci. I then grouped simulations into bins according to the random values of the parameters used for 4Nem (binned in 1.0 increments) and most recent gene flow (binned in 0.1 increments). Each bin typically contained

~500-1,000 simulations. I then calculated p-values by evaluating the number of simulations producing more shared polymorphisms than observed in the actual data.

4. Characterization of the historical demography of An. farauti 4 and An. punctulatus s.s.

To reconstruct the demographic history of AF4 and AP s..s from whole genome data, I used the pairwise sequential markovian coalescent (PSMC) model

(Li & Durbin 2011). I divided the consensus DNA sequences of each species into

10 bp non-overlapping bins and marked each bin as homozygous or heterozygous based on whether the 10 bp bin contained a SNP. I used bins of 10bp to account for the high genetic diversity observed in Anopheles, although increasing the size of the bins to 50 and 100bp did not qualitatively change the results. To improve

94 the accuracy of inferring historical recombination events, I excluded all contig sequences shorter than 5,000 bp (41-49% of all contigs, 11-19% of all bases) resulting in a total of 117-130Mb of DNA sequence analyzed. Note that in this analysis, each contig is treated as a separate accession (i.e. they are not artificially concatenated). I used the recommended parameters, only adjusting for the higher heterozygosity and recombination rate in Anopheles compared to humans. The settings of the piece-wise constant and maximum time (-p and –t options) were adjusted to maintain >20 ancestral recombination events in each time epoch (Li &

Durbin 2011). Thus, max time was set to 15 coalescent units for AP s.s. and 20 coalescent units for AF4 and the number of piece-wise parameters was left as default. The ratio of θ to ρ was set to 1, based on preliminary estimation using mlRho (Haubold et al. 2010). To estimate variance, I applied a bootstrapping approach by splitting genomic sequence into smaller segments and then randomly sampling these segments with replacement (Li & Durbin 2011). I performed a total of 100 bootstrap tests for each of the samples run in PSMC.

As the PSMC method relies on the genomic distribution of heterozygous sites, it can only be used when both alleles are called with high confidence. I assessed the influence of sequence coverage on the PSMC results by performing two separate runs for the AF4 mosquito: one with my original coverage (131X) and a second sequence that was down-sampled to the same coverage as AP s.s.

(51X).

All PSMC results were plotted in R statistical software using the R package ggplot2. Unscaled results in θ and pairwise sequence divergence were calculated

95 according to the PSMC manual (github.com/lh3/psmc). Results were scaled using a mutation rate of 2.8x10-9 per base pair per generation (Ambrose et al. 2012;

Dixit et al. 2014; Keightley et al. 2014) and assuming 12-20 generations per year

(Beserra et al. 2006; Chandre et al. 2000).

C. Results

1. Phylogenetic analysis does not provide any evidence of introgression

I used the phylogenetic trees constructed in section II.C.3.C to determine whether introgression is occurring between AP species. I examined the branching patterns of 30,907 phylogenies, constructed using the program PhyML (Guindon

& Gascuel 2003), that were >1,000 in length and were supported by an aLRT score of >0.9. The trees meeting these criteria accounted for 70,921,162 bp (see section 2.C.3.C) (Figure 14). The majority of these trees had the expected branching pattern that represented the genome average of the species (Figure

14A). Recent (i.e. contemporary) introgression of a given locus from one species to another would result in a tree with unusually short branches between these two species (Figure 14B). I therefore calculated, for each tree, the proportion of the total branch length accounted for by the two most closely related species. I then selected, as putative introgression candidates the 1% of loci with the shortest distances (proportionally) between the two closest tips (see section

III.B.1) and identified 306 loci (Figure 14B). These 306 loci accounted for 490,052 bp (or 0.19% of the genome) and contained 172 predicted protein-coding genes

(Appendix D). Most of these introgression candidates had short branches

96 separating AF s.s. and AK (302 of 306 trees). This observation could indicate that introgression preferentially occurred between these two species which, based on my previous phylogenetic analysis (See section II.C.3.c), are the most closely

Figure 14. Representation of branching patterns observed for 30,907 phylogenies reconstructed using PhyML (Guindon & Gascuel 2003). (A) Branching patterns that are consistent with the evolutionary history, or genome average, observed for the majority of trees observed and (B) for putative introgression candidates were two adjacent branches are shorter than expected. The number of trees observed for each branching pattern is displayed below each tree.

related. However, it is also possible that my analysis identified loci at the tail of the distribution of the tree topologies rather than true biological outliers (Figure

15) (e.g. loci that had slower mutation rates than the genome average that affected all four species similarly). Further investigations of these candidates revealed that the number of nucleotide differences between AF s.s. and AK was proportional to the number of differences between AF4 and AP s.s. (Figure 16),

97 indicating that these trees with short branches separating the two closest species were likely the results of unusual substitution rates affecting the entire tree, for

Figure 15. Distribution of the external branch lengths. The density plot shows the proportion of each tree ‘s total length accounted for by the length of the external branches separating AK and AF s.s. (y-axis, in percent) and by the length of the external branches separating AF4 and AP s.s.. example selection or low mutation rate, rather than only two of the four branches as I would expect with recent gene flow. Overall, my phylogenetic analyses of a third of the genome of four An. punctulatus species did not reveal any evidence of recent introgression among them.

2. Analyses of shared polymorphisms between An. farauti 4 and An. punctulatus s.s. does not support recent gene flow between these species

In addition to divergence, gene flow also influences the patterns of diversity and, in particular, can lead to shared DNA polymorphisms between species. Here, I restricted my analysis of genetic diversity to the AF4 and AP s.s.

98 mosquitoes as the lower sequence coverage in the other species hampered my ability to reliably call SNPs. Out of 51,610,847 orthologous nucleotides sequenced

Figure 16. Distribution of nucleotide differences between species. The plot shows the number of nucleotide differences of each introgression candidate locus between An. farauti 4 and An. punctulatus s.s. (y-axis) and An. farauti s.s. and An. koliensis (x-axis). Note that the three outliers in the bottom right corner are driven by one disproportionally long branch length in either AF s.s. or AK (not by two short branches) and are not indicative of introgression (but may indicate positive selection on one of the four lineages). at > 20X in both AF4 and AP s.s., I identified 164,081 (0.32%) heterozygous sites in the AF4 mosquito and 318,375 (0.62%) in the AP s.s. mosquito (Table 13).

Subsampling of the AF4 data to the same coverage as the AP s.s. mosquito did not qualitatively change my findings (Table 14) suggesting that the higher heterozygosity in AP s.s. was not caused by differences in sequence coverage but reflected genuine biological differences. These results confirmed my hypothesis that AF4 is less diverse than AP s.s., as I suggested would explain the higher N50 value obtained for AF4 compared to the AP s.s. assembly. Overall, across more

99 than 50 Mb of DNA sequences, only 467 nucleotide positions were polymorphic for the same alleles in both AP s.s. and AF4 (Table 13).

Table 13. Summary of the genetic diversity in Anopheles farauti 4 and Anopheles punctulatus s.s.

Length Heterozygous Heterozygous Divergence # of shared sites in AF4 sites in AP s.s. polymorphisms

164,081 318,375 4,056,848 Total 51,610,847 467 (0.32%) (0.62%) (7.86%)

2,305 7.32 14.22 181.2 0.02 Per locus [1,000-17,080] [0-86] [0-140] [22-1,551] [0-3]

0.007 0.014 0.18 2.1E-5 Per Kb NA [0-0.086] [0-0.14] [0.02-1.55] [0-0.003]

[]represent the minimum and maximum values observed

At least three non-exclusive mechanisms could generate shared polymorphisms: incomplete lineage sorting, gene flow, or sequencing errors. To determine how many shared alleles would be expected due to incomplete lineage sorting, I simulated the evolutionary history between two deeply diverged species, AF4 and AP s.s., under a coalescent model and calculated the resulting number of shared polymorphic sites. In all simulations, the sum of shared sites was always 0 indicating that, under a constant population size model without gene flow, I would not expect any shared sites. To discriminate between gene flow and sequencing errors, I first compared the distribution of these shared polymorphisms throughout the genome: if the shared polymorphisms resulted

100 from recent gene flow, I would expect them to be clustered in specific loci (i.e. those that have been recently exchanged). Instead, shared polymorphisms were distributed throughout the entire genome sequence, with an average of 0.02 shared polymorphisms per locus and no locus carried more than three shared polymorphisms (Table 13). This observation suggested that many of the shared polymorphisms could be due to sequencing errors.

Table 14. Summary of the genetic diversity in An. farauti 4 estimated after sub- sampling An. farauti 4 reads to the coverage of An. punctulatus s.s.

Heterozygous Heterozygous # of shared Length Divergence sites in AF4 sites in AP s.s. polymorphisms 196,731 312,996 3,969,250 Total 50,812,770 0.38% 0.62% 7.81% 492

2,301 8.91 14.17 179.7 0.02 Per Locus [1,000-17,020] [0-112] [0-138] [14-1,543] [0-2]

0.009 0.014 0.18 2.2E-05 Per Kb NA [0.0-0.1] [0.0-0.14] [0.01-1.5] [0-2.0E-3] [] represent the minimum and maximum values observed

To estimate the maximum amount of gene flow consistent with the number of shared polymorphisms observed (under the conservative assumption that all of these resulted from gene flow), I determined the number of shared polymorphisms that would be expected with various amounts of gene flow between AF4 and AP s.s.. I simulated under the coalescent two populations of constant size that, at a given time and continuing for 1,000 generations, have a specific amount of gene flow (see section III.B.3). I varied both the amount of gene flow and the most recent time of gene flow and determined which parameter values were consistent with the number of shared sites observed in

101 my data. I binned the simulations based on the coalescent time since the last gene flow event (in 4Ne generations) and the rate of gene flow per generation (in

4Nem). This analysis enabled me to identify introgression parameters that can be excluded given the observed number of shared polymorphisms (Figure 17).

Figure 17. Amount and age of gene flow that can be excluded given the number of shared polymorphisms observed. The figure shows the probability that introgression would lead to significantly more shared polymorphisms than observed based on the amount of gene flow (y-axis, in 4Nem) and time since the last gene flow (x-axis, in 4Ne generations). The white surface in the graph represents combinations of parameters that are incompatible with the data observed. Note that this analysis assumes that all shared polymorphisms were genuine (i.e. not sequencing errors) and therefore likely overestimates possible gene flow.

For, example, I could exclude gene flow involving more than one individual per generation (4Nem=2) if it occurs later than 0.13 4Ne generations ago (~11,000 years ago, assuming 12 generations per year) as these parameters produced significantly more shared polymorphisms than I observed in the data (in white).

102 Overall, my analyses suggested that there was no significant contemporary gene flow between AF4 and AP s.s. and that the small number of shared polymorphisms observed likely resulted from ancient gene flow or, perhaps more likely, from sequencing errors.

3. Characterizing the demographic history of An. punctulatus s.s. and An. farauti 4

I used the pairwise sequential Markovian coalescent (PSMC) model described in (Li & Durbin 2011) to estimate the demographic history of AP s.s. and AF4. Due to inherent uncertainty in estimating mutation rates and generation time for Anopheles mosquitoes, I present the results as θ and pairwise sequence divergence (Figure 18A; the results scaled in effective population size (Ne) and generations are shown in Figure 18B). My analyses showed that the two species had similar population sizes from 15,000,000 to until about 600,000 generations in the past( 1.3 mya and 50,000 years ago, respectively, assuming 12 generations per year), after which time their population histories diverged (Figure 18B). At this point, the AP s.s. population increased in size while the AF4 population initially declined before a short period of expansion followed by a final and rapid decline in population size (Figure 18A).

Bootstrap analysis showed little variance associated with all population size estimates except for the most recent time periods as expected due to the limited number of recent coalescent events that can be inferred from a single genome sequence (Li & Durbin 2011). Note also that estimates of population

103

Figure 18. Demographic history of An. farauti 4 and An. punctulatus s.s. The figure shows the estimated historical variations in effective population size for AF4 (green) and AP (red) based on PSMC analyses. The composite green (AF4) and red (AP) lines represent 100 bootstrap replicates. (A) Unscaled, the y-axis represents the population mutation rate (in log10 of θ), and the x-axis represents the log10 pairwise sequence divergence. (B) Scaled, the y-axis represents the effective population size (in log10 of Ne) and the x-axis represents the pairwise sequence divergence (in log10 generations), assuming a mutation rate of 2.8x10-9 per base pair per generation (Keightley et al. 2014). Note that this mutation rate is from Drosophila and may not accurately reflect the life history of Anopheles mosquitoes.

sizes in the very recent past (<30,000 generations ago) might be influenced by the fragmentation of my genome assembly (see Chapter 2): a small proportion of the contigs analyzed might be shorter than the maximum identical by descent

(IBD) tracks expected, increasing the statistical uncertainty of very recent estimates. More importantly, PSMC has been shown to be sensitive to false heterozygosities identified in sequence data. I thus down-sampled AF4 to the same coverage as AP s.s. to test whether the differences in population history between AP s.s. and AF4 could be artificially caused by the differences in

104 sequence coverage and my ability to correctly identify SNPs. The down-sampled

AF4 sequence contained more heterozygous sites than the original sequence, reflecting the difficulty in accurately identifying polymorphisms with lower coverage, but the overall pattern of the AF4 population history remained similar and distinct from that of AP s.s. (Figure 19).

Figure 19. Influence of sequence coverage on the estimates of the demographic history of An. farauti 4. The figure shows the estimated historical variations of the effective population size of AF4 based on the entire dataset (131X coverage, in green) or the subsampled dataset (51X coverage, in red). The composite green and red lines represent 100 bootstrap replicates.

D. Discussion

1. No evidence of contemporary gene flow among An. punctulatus sibling species

In this chapter, I use the previously de novo assembled genomes of four important malaria vectors from the Anopheles punctulatus group of PNG (see

Chapter 2) to rigorously test for recent (i.e. contemporary) introgression among

105 these species. Determining whether gene flow is currently occurring between species is critical for designing efficient vector control strategies as any gene flow may, for example, increase the spread of insecticide resistance alleles across mosquito populations. Population genetic studies provide an opportunity to test whether gene flow has occurred in the history of the studied populations and can reveal even low amounts of gene flow between two populations. For example, a recent study reported evidence of gene flow between two AP mosquito species,

An. farauti s.s. and An. hinesorum, in Southern New Guinea (Ambrose et al. 2012).

However, as this introgression was detected using only a mitochondrial locus, it is impossible to exclude that the mosquitoes studied were not actually sterile hybrids that would not contribute to the genetic diversity of either species. My analyses of thousands of independent loci in four of the main AP species did not reveal any evidence of recent gene flow. In particular, the analyses of genetic diversity throughout a large fraction of the genome of an AP s.s. and AF4 mosquito indicates that, if any gene flow occurred between these two species, it was extremely rare or happened in the distant past. In fact, my simulations of the number of shared polymorphisms expected under different population histories showed that migration of more than one individual per generation during the last

0.13 4Ne generations (11,000 years assuming 12 generations per year) could be statistically excluded given our data (Figure 17). While estimates of mutation rates and generation times are still very crude, my study suggests that there is no contemporary gene flow occurring among these An. punctulatus species in this

106 location of PNG and that, from a vector control perspective, these species can be considered to be reproductively isolated.

It is, however, important to note the limitations of my population-genetic analyses. First, my models assume a constant population size, no population structure and no selection. In this regard, the few genetic studies conducted in An. punctulatus mosquitoes have shown limited fine-scale population stratification in mainland PNG (Henry-Halldin et al. 2012; Logue et al. 2013; Seah et al. 2013). In addition, I would expect positive selection (e.g. for insecticide resistance) to increase the signal of gene flow, as the introgressed alleles, would provide higher fitness in the presence of insecticides and therefore would be more likely to spread through the populations. Overall, the data suggests that violations of my model assumptions are unlikely to qualitatively change my findings. Second, my analyses are based on a single mosquito per species. In this regard, it is important to emphasize that my analyses, as they are based on numerous independent loci, investigate possible gene flow throughout the entire genealogy and therefore capture a large history of the population. In addition, since the different mosquitoes were collected in the same geographic area, it should increase the probability to detect any gene flow among them. However, it is possible that, if there are important genetic differences between populations from various locations of PNG or the Southwest Pacific (i.e. if subpopulations of AP species exist), gene flow might occur in other regions of PNG or the Southwest Pacific.

This could, for example, explain the differences between my findings and the observations of putative introgression between An. hinesorum and An. farauti s.s.

107 in Southern New Guinea (Ambrose et al. 2012). Analysis of additional samples from across Papua New Guinea and involving multiple individuals per species would enable to test this hypothesis rigorously.

2. Discordant demographic histories of An. farauti 4 and An. punctulatus s.s.

Very little is known about the population size, organization and history of the different Anopheles species in PNG. Since I sequenced wild-caught mosquitoes

(as opposed to colony mosquitoes that are bred to reduce their genetic heterozygosity), (see Chapter 2) I was able to characterize the genome-wide patterns of genetic diversity in AP s.s. and AF4 by cataloguing differences between the two homologous chromosomes carried by each individual. While it may be assumed that both species shared a similar demographic history until

~52,000 years ago (630,957 generations assuming 12 generations per year), they then diverged and underwent independent variations in population sizes: the AP s.s. population continuously expanded in size, while the AF4 population fluctuated before a recent and dramatic decrease in size. One possible explanation for these observations may be the result of humans arriving in PNG

~50,000 years ago (Lilley 1992) which might have favored AP s.s. mosquitoes that can breed more readily in transient bodies of water (e.g. water containers, bowls) than AF4 (Charlwood et al. 1986; Cooper et al. 2002). Alternatively, I can speculate that AP s.s. may be more suited to feed on humans than AF4 and that, after the arrival of humans, AP s.s. would have had more hosts available for blood meals ((Beebe et al. 2013) and references within). If this were the case, I would also predict that AP s.s. would be more impacted by bed nets and insecticides.

108 However, it is also possible that the apparent differences in population size are in fact caused by population structure rather than genuine changes in the effective population size of AF4. For example, in humans, population divergence can cause an increase in effective population size (Li & Durbin 2011). The final population

“crash” in AF4 could therefore possibly reflect the population size of the small subpopulation analyzed. Additional sampling for other areas of PNG would enable me to differentiate between these scenarios.

One limitation of my analysis is that the time estimates presented here are scaled by the Drosophila mutation rate (Keightley et al. 2014), and assume 12-20 generations per year as shown for An gambiae and Aedes aegypti respectively

(Beserra et al. 2006; Chandre et al. 2000) . This may not be an accurate representation of the actual life history of AP mosquitoes in PNG given the divergence between An. gambiae and AP group (Logue et al. 2013) (Chapter 2) as the length of each group’s life cycle (e.g. An. gambiae larvae may mature quicker than AP larvae) and gonotrophic cycle (i.e. how often do the mosquitoes feed and lay eggs) may differ. However, regardless of scaling, the results of my analyses reveal that, historically, these species have responded differently to environmental changes.

3. Implications for disease transmission and implementation of vector control strategies in Papua New Guinea

My genomic analyses have important implications for ongoing and future vector control strategies implemented in PNG. The lack of detectable gene flow suggests that insecticide resistance alleles are unlikely to spread across species.

109 This could be encouraging for vector control in this region, as it may considerably slow down the spread of insecticide resistance: the resistance alleles will need to arise independently in each species. Currently, there are no standing variants for insecticide resistance reported among AP mosquitoes (Henry-Halldin et al. 2012;

Keven et al. 2010) but the widespread deployment of long lasting insecticide treated bed nets and indoor residual spraying campaigns (Hetzel et al. 2014;

Kazura et al. 2012) in PNG may rapidly change this situation.

My analyses of the population sizes and their dynamics in AF4 and AP s.s. are also interesting from a vector control perspective. The distinct demographic histories observed between AP s.s. and AF4 reveal that, historically, these species have responded differently to environmental changes. This suggests that these species may respond differently to current environmental perturbations such as climate change, or, more importantly the deployment of vector control measures.

Additionally, since the effective population size of a species directly influences the probability of an advantageous mutation being swept to fixation, the larger effective population size in AP s.s suggests that any insecticide resistance or other advantageous allele would reach fixation faster than in AF4.

110 IV. Chapter 4

Unbiased Characterization of Host Feeding Patterns of Anopheles punctulatus Species by Targeted High-Throughput Sequencing of the Mammalian Mitochondrial 16S rRNA

Portions of this chapter are in press at PLoS Neglected Tropical Diseases:

Logue K, Keven JB, Cannon MV, Reimer L, Siba P, Walker ED, Zimmerman PA, Serre D. 2015. Unbiased Characterization of Anopheles Mosquito Blood Meals by Targeted High- Throughput Sequencing.

111 A. Introduction: Current and historical techniques used to evaluate host blood meals

Many insects require blood meals to complete their gonotrophic cycles. By feeding successively on different hosts, these insects can transmit blood borne pathogens that cause diseases responsible for significant burden on global health

(Gratz 1999; Lounibos 2002). Insects that seek human blood meals are vectors of devastating diseases such as malaria, dengue fever, sleeping sickness, filariasis, leishmaniasis, typhus and plague. Understanding the complex blood feeding behaviors of the insect vectors of these human diseases is crucial for developing and prioritizing disease control program activities and identifying potential unrecognized disease reservoirs with the aim to reduce disease transmission.

The blood meals of have traditionally been analyzed using serological techniques such as ELISA or precipitin tests (Beier et al. 1988;

Tempelis 1975; Washino & Tempelis 1983). While these methods have provided useful information, they have limited resolution as they are generally only able to delineate hosts within blood meals at order or family taxonomic levels (Lardeux et al. 2007). In addition, since these approaches test for the presence of a protein from a specific organism they only test for absence/presence of organisms that are a priori believed to be blood meal hosts. More recently, a number of PCR based molecular techniques have been developed to characterize host blood meals ((Kent 2009) and references within) and determine the blood feeding behavior of mosquitoes (Crabtree et al. 2013; Hamer et al. 2008; Kent & Norris

2005; Ngo & Kramer 2003), ticks (Allan et al. 2010; Che Lah et al. 2015; Pichon et

112 al. 2003), (Haouas et al. 2007; Soares et al. 2014; Valinsky et al. 2014) and Tsetse flies (Muturi et al. 2011; Steuber et al. 2005). While these PCR-based approaches enable rigorous identification of the host species, they typically focus on species-specific amplification of putative hosts and therefore are not designed to identify novel, unanticipated host blood sources. In addition, the detection of mixed blood meals (i.e., when an insect feeds on more than one host) by these approaches suffers from low discriminatory power as the presence of multiple host sequences can inhibit species identification using traditional sequencing techniques. These limitations bias our understanding of the transmission of many vector-borne diseases and may prevent identifying potential disease reservoirs.

Beyond the identification of the host species, it may also be important to decipher which individuals of a given species are being fed upon: for example, knowing whether an insect preferentially bites specific individuals or, in contrast, feeds on multiple individuals per night, could dramatically influence our understanding of disease transmission potential. To this end, a number of studies have used microsatellites or other polymorphic genetic markers to generate individual DNA fingerprints from human blood meals of mosquitoes (Ansell et al.

2002; Chow-Shaffer et al. 2000; De Benedictis et al. 2003; Michael et al. 2001;

Norris et al. 2010; Scott et al. 2006) and lice (Mumcuoglu et al. 2004; Replogle et al. 1994). However, interpretation of these data can become complicated when

DNA from more than one individual is present in a single blood meal.

The blood feeding patterns of species within the Anopheles punctulatus group have been poorly characterized as only two studies conducted in the

113 1980’s have examined the blood hosts fed on (Burkot et al. 1988; Charlwood et al.

1985). When these studies were conducted there were only 5 known species within the AP group, since then a total of 8 additional morphologically identical species have been detected using allozyme (Foley et al. 1993), and DNA sequencing techniques (Beebe et al. 1999; Beebe et al. 1994; Beebe & Saul 1995;

Cooper et al. 2002; Foley et al. 1995). This makes it difficult to interpret the previous results, as the blood feeding patterns observed could be representative of multiple morphologically identical species. Based on these studies it appears that the AP species are generalist in regards to their feeding patterns, however,

An. punctulatus and An. koliensis were observed to be the most anthropophilic while An. farauti preferred non-human hosts (Burkot et al. 1988; Charlwood et al.

1985). Given the important caveats of these studies it is important that the blood feeding patterns of AP species be examined as this information may aid in understanding disease transmission in Papua New Guinea.

In this chapter, I describe a novel approach employing high-throughput sequencing technology to analyze the blood meal composition of individual mosquitoes in an unbiased manner. I first amplify DNA extracted from a single female mosquito using universal primers targeting the mammalian mitochondrial

(mt) 16S rRNA genes. Following primer-based individual barcoding, PCR products from up to 96 mosquitoes are pooled and then simultaneously sequenced using Illumina high-throughput sequencing methods. I also used the same approach to interrogate whether individual mosquitos had fed on more than one person by sequencing the human mt hypervariable region I. I applied

114 this approach to 442 mosquitoes captured in five villages of Madang Province of

Papua New Guinea and provide evidence that Anopheles punctulatus (AP) mosquitoes (i) feed on a variety of mammalian species including several previously unanticipated hosts and (ii) frequently feed on multiple mammalian hosts. I also show how my assay can be easily customized to examine the number of human individuals fed on by a single mosquito. The results of this chapter show the potential of my approach to identify blood meal hosts from a broad- range of blood feeding arthropods, enabling the identification of potential disease reservoirs, and how this approach can be used to understand the disease transmission of medically important vector species.

B. Methods

1. Ethics

Approval for this study was obtained from the Papua New Guinea Institute of Medical Research Institutional Review Board (0801) and PNG Medical

Research Advisory Board (07.28).

2. Sample collections

Field technicians collected mosquitoes from the villages of Dimer, Wasab,

Kokofine, Mirap and Matukar in the Madang province of Papua New Guinea (PNG) in June and August 2012. At each village, mosquitoes were collected between the hours of 7 pm and 4 am from an erected barrier screen as described by Burkot et al (Burkot et al. 2013). The net was manually searched every hour and resting mosquitoes were collected from the net surface using an aspiration device. After

115 collection, the sex and species of each mosquito was determined by morphology as previously described (Rozeboom & Knight 1946) (section I.B.2.a). All male mosquitoes and non-Anopheles mosquitoes were discarded. Each female

Anopheles mosquito was visually classified as not-fed, partially-fed or fully-fed by examining the size and coloration of their abdomen and individually stored each mosquito in an Eppendorf tubes containing silica gel.

3. DNA isolation and molecular species identification

I extracted DNA from individual mosquitoes using a 96 well Qiagen

DNeasy® blood and tissue kit as previously described (see section II.B.1) (Logue et al. 2015). Mosquito species identification was determined using a PCR based assay that evaluates species-specific polymorphisms in the ribosomal RNA internal transcribed spacer unit 2 (ITS2) (Henry-Halldin et al. 2011) (see section

II.B.1).

4. In silico assessment of mammalian mt 16S rRNA primers a. Amplification range of mt 16S rRNA primers

To test the range of mammals that the mt 16S rRNA primers (Taylor 1996) were able to amplify, I conducted an in silico analysis using the primerTree package (Cannon submitted). Briefly, I performed primer-BLAST searches using the mammalian mt 16S rRNA primers against the NCBI nucleotide database using default parameters but allowing for up to three mismatches in the primer sequences and set the maximum number of alignments to 10,000. I retrieved the taxonomical information of each sequence retrieved. As this search can be biased by the recent release of many sequences from a specific taxon, I performed this

116 search separately for each mammalian order. I then calculated how many different species were obtained from each order to estimate how many species would be likely amplified by this primer pair. Note that, when conducting the search without any taxonomic restrictions, these mammalian primers were predicted to also amplify amphibians and fish 16S rRNA genes. b. Number of mammalian species amplified by primers

To estimate the total number of species, from each mammalian order, that have been sequenced for the targeted locus and deposited in NCBI, I randomly selected one DNA sequence from each taxonomic family and compared this DNA sequence against the NCBI nucleotide database. I filtered out any alignment for which the DNA sequence did not contain the primer sequences. I then counted how many different species were observed by merging the searches performed using the DNA sequences randomly selected from each family (and only counting species belonging to the order investigated). These analyses provided me with the total number of mammalian species that should be amplified if the primers were truly universal. c. Ability of primers to differentiate among mammalian species

I also evaluated the ability of the 16S rRNA primers to differentiate species by examining the average proportion of informative sites that differentiate between species. I retrieved the DNA sequence alignment from my primerTree analysis and calculated the number of nucleotide differences, including pairwise deletions, between every pair of DNA sequences using the dist. program within the Ape package (Paradis et al. 2004). The proportion of informative sites

117 between species only reflects how different, on average, sequences from different orders are. I also calculated specifically how often two different species (or genera) have a different DNA sequence for the locus of interest (i.e., at least one nucleotide difference).

5. Targeted high-throughput sequencing of mosquito host blood meals a. Primer design of mammalian mt 16S ribosomal RNA genes and human mt genome hypervariable region I

I analyzed the blood meals of individual mosquitoes by amplifying 140 bp of the mammalian mt 16S rRNA gene using universal mammalian primers (Taylor

1996) modified to include a 5’-end tail complementary to the Illumina sequencing primers (Table 15). To identify individual differences in human blood meals, I designed PCR primers to amplify 300 bp of the human mt hypervariable region I.

I designed these primers by aligning 795 whole mt genomes of individuals from

Oceania (Duggan et al. 2014) using MAFFT version 7 (Katoh & Standley 2013).

Conserved sequences flanking the most variable portions of the mt hypervariable region I for these individuals were identified and used to design primers with

Primer3 (Untergasser et al. 2012). As described above, a 5’ tail was added to each primer for sample barcoding and high-throughput sequencing (Table 15). b. Primer design for genotyping polymorphisms within the nuclear genome of An. punctulatus s.s.

To determine if population structure exists between An. punctulatus s.s. mosquitoes that fed on human and non-human hosts, I designed primers to amplify 30 high-quality (Phred quality score, Q>20) polymorphic sites randomly

118 distributed across the genome of An. punctulatus s.s. (assembled in section II.B.4)

(Appendix E.1). These polymorphic sites had a reference allele frequency of 40 to

60% and coverage of 28 to 76X. Each polymorphism was located on a unique contig to decrease the likelihood of the polymorphic sites being linked.

Additionally, I removed any possible paralogous regions by filtering out sequences that had greater than 5 polymorphisms within a 100bp window and validated, in silico, that these primers did not amplify other regions within the genome of An. punctulatus s.s.. I designed primers to amplify a 200 to 300 bp region (locus) around each polymorphism using Primer3 (Rozen & Skaletsky

2000) with default settings except for the following change: the primer product size range was set to 200 to 300 bp. As described above, a 5’-end tail complementary to the Illumina sequencing primers was also added (Appendix

E.1).

Table 15. Primers used to amplify mammalian host blood meals and the human mitochondrial hypervariable region I.

Primer Name Primer 16SmamF CTTTCCCTACACGACGCTCTTCCGATCTCGGTTGGGGTGACCTCGGA

16SmamR GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCTGTTATCCCTAGGGTAACT huPNG_103F CTTTCCCTACACGACGCTCTTCCGATCTTACTGCCAGCCACCATGAAT huPNG_402r GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGGTCAAGGGACCCCTATCTG

Sequences in red and blue are the 5’ end tails that are complementary to the forward and reverse Illumina sequencing primers, respectively.

119 c. Targeted high-throughput sequencing of amplicons

For each sample and amplicon, I conducted two amplification reactions to prepare PCR products for Illumina-based sequencing (Figure 20). First, I performed a locus-specific amplification (e.g., of the mammalian mt 16S rRNA or the human mt hypervariable region I) using the Promega GoTaq PCR kit protocol with 1μL of genomic DNA and magnesium and primers added to a final concentration of 4mM and 0.2 μM respectively. PCR amplification was carried out under the following conditions: 3 minutes at 94°C followed by 30 cycles of 94°C for 45 seconds, 50°C for 45 seconds, 72°C for 30 seconds and a final elongation step at 72°C for 3 minutes. After the first PCR, I purified the PCR products using the QIAquick 96 PCR purification kit protocol (QIAGEN). Second, I incorporated the Illumina sequencing primers, including unique 6-nucleotide barcode sequences (Appendix E.2), at the ends of the individual blood meal PCR products by performing 10 additional PCR cycles, using barcoding primers complementary to the 5’-end tail (Figure 20). For these reactions, I used the Promega GoTaq protocol as described above with 1μL of PCR product being added to each reaction. The same thermocycling conditions as described above were used but the annealing temperature was 56°C. The final amplicon size after both amplification reactions was between 265 and 343 for the mammalian mt 16S rRNA, 440 and 444 for the human mt hypervariable region and 343 and 443 for the 30 AP s.s. SNP loci amplified (including Illumina adaptors, unique barcode sequence and sequencing primers, Figure 20). Finally, I pooled the barcoded amplification products from 96 individual mosquitoes together and

120 simultaneously sequenced them on an Illumina MiSeq instrument. Overall, 442 mosquitoes were analyzed with an average coverage of 82,528 sequencing reads

(150bp paired-end) per sample for the 16S rRNA amplicons, 26,721 sequencing reads (250bp paired-end) for the human mt hypervariable region amplicons and

42,190 (250bp paired-end) sequencing reads for the 30 loci amplified to assess population structure within An. punctulatus s.s. (see Appendix A for all NCBI SRA accession information).

Figure 20. Overview of the sequencing assay used to characterize blood meal composition of individual mosquitoes. (i) A first PCR is performed on DNA extracted from each mosquito using primers targeting ~140 bp of the mammalian mt 16S rRNA (in gray) and a 5’-end tail complementary to the Illumina sequencing primers (in red). (ii) A second PCR incorporates the Illumina sequencing primers and a 6-nucleotide barcode unique to each mosquito at the ends of the individual blood meal PCR products. (iii) After pooling amplification products from 96 samples together, the PCR products are simultaneously sequenced on an Illumina MiSeq to (iv) generate paired-end reads (in gray) and barcode sequences (grey box). (v) Paired-end reads are then merged to provide error-corrected consensus sequence reads.

121 6. Bioinformatic assessment of blood meal composition and population structure from individual mosquitoes a. Filtering sequencing data for analysis

I filtered out any read that did not carry exact barcode and primer sequences and removed the primer sequences from all sequencing reads to only keep the amplified sequences. I further removed any resulting read smaller than

50 bp as these likely represent primer dimers. I then merged paired-end sequencing reads together using pandaseq (Masella et al. 2012) (Figure 20) and analyzed 16S RNA, human mtDNA sequences and the 30 polymorphic loci amplified from An. punctulatus s.s. separately. b. Identification of host blood meals from Anopheles mosquitoes

Using all 43,743,363 16S rRNA sequences generated from the 442 mosquitoes, I identified all unique DNA sequences using Mothur (Schloss et al.

2009) and recorded the number of reads carrying each unique sequence. I removed any DNA sequence that was observed less than 10 times across all samples, as these likely resulted from sequencing errors. I compared the remaining unique DNA sequences against all DNA sequences present in the NCBI nucleotide database using blastn. For each DNA sequence, I recorded the best match(es), only considering sequences with > 90% identity over the entire sequence length. I then retrieved the taxonomic information from each best match from NCBI using the get_taxonomy function in PrimerTree (Cannon submitted). When an amplified sequence matched multiple species equally well, I recorded all species names associated with that sequence. I then summarized the

122 blood meal of each mosquito by calculating the proportion of reads matching each species. I only analyzed mosquito samples with at least 1,000 reads (Figure

19). I considered a mosquito to have fed on a single mammalian host if >90% of the sequencing reads aligned to the 16S rRNA of that species. Alternatively, if

>10% of the sequencing reads aligned to a second species, I considered the mosquito to have fed on multiple mammalian hosts.

Figure 21. Summary of the sequencing depth for each mosquito sample for the mammalian blood meals amplified using the 16S rRNA primers. Each vertical bar represents a single mosquito ranked along the x-axis according to the number of reads obtained to characterize its blood meal (y-axis, log scaled). The panel underneath the plot indicates whether the mosquitoes were visually classified as fed (fully-fed and partially-fed, green horizontal bar) or not-fed (blue bar). Extraction controls (water) are represented by the black horizontal bar. The horizontal red bar at 1,000 indicates the cut-off used for analysis inclusion.

c. Identification of human individuals fed on by mosquitoes

For the human mt hypervariable region, I aligned merged reads to the human mitochondrial reference genome sequence (NCBI RefSeq: NC_012920.1) using Bowtie2 with default parameters (Langmead & Salzberg 2012). For each

123 sample, I calculated the number of reads supporting each haplotype and reconstructed phylogenetic trees using MEGA version 6 (Tamura et al. 2013).

Note that I considered each unique mtDNA sequence a haplotype as both merged sequencing reads originated from the same DNA molecule. Only haplotypes supported by more than 500 reads were included for this analysis to avoid incorporating sequencing or PCR errors (i.e., rare haplotypes that differed from an abundant haplotype by one nucleotide difference) in the analyses. A mosquito was considered to have fed on more than one human individual if more than one mtDNA haplotype was present. d. Examination of population structure within An. punctulatus s.s. mosquitoes

To determine if any population structure exists within An. punctulatus s.s. mosquitoes, I aligned the sequencing reads of each sample against the An. punctulatus s.s. genome assembly (see section II.B.4) using Bowtie2 (Langmead &

Salzberg 2012) with default parameters. I called heterozygous sites at positions sequenced at high-quality (Phred quality score, Q>=20) and high coverage

(≥100X) and only considered a site to be variable if >20% of the sequencing reads supported the minor allele. Note that I examined how coverage influenced heterozygous sites by examining the number of heterozygous sites called at various coverage levels ranging from 20 to 1000X and found that the number of heterozygous sites called was similar at all coverage levels (Figure 22). For the original 30 heterozygous sites genotyped, I examined the number of alleles observed at each site, across all samples, and removed any site that was triallelic,

124 as these loci are possibly paralogous, or not in Hardy-Weinberg equilibrium. I also removed any locus that had <1,000 reads aligned across all samples or was not successfully amplified in >7% of the samples or was not polymorphic in any of the samples. In addition, I removed any sample where at least 26% of the loci were not successfully amplified. Overall, I was able to successfully amplify 12 loci from 52 An. punctulatus s.s. mosquitoes from the villages of Dimer and Wasab.

400 350 300 250 20X 200 40X 150 100 100X 50 500X # of heterozygous sites 0 1000X 1090 1142 10427 10624 10742 11191 11263 11513 11571 12142 13374 10286 10366 11776 12988 Locus

Figure 22. Number of heterozygous sites called at various levels of coverage. Each bar cluster represents a single locus amplified from all samples (x-axis) and the number of heterozygous sites called (y-axis). The color of each bar represents the minimum coverage level for calling a site heterozygous: 20X (blue), 40X (red), 100X (green), 500X (purple) and 1,000X (light blue).

Using these 12 loci (SNPs), I determined if there was any population structure within An. punctulatus s.s. mosquitoes using a model-based clustering algorithm implemented in the program STRUCTURE (version 2.3.4) (Pritchard et al. 2000) to determine the number of clusters (K), or populations, that exists within my samples. I used default parameters and set the number of iterations to 1 million,

125 after a 10% burnin, and ran three iterations for each of the different values of K tested (K=1 to 4).

To increase the resolution of my analysis, I reconstructed haplotypes consisting of all SNPs (i.e. not just the SNPs targeted) across each sequencing read and removed any haplotype that was not carried by at least 100 sequencing reads. Additionally, at least 20% of the reads, across all samples, from each locus had to support each haplotype. I considered a unique sequencing read that met the above criteria, to be a haplotype as both merged sequencing reads originated from the same DNA molecule. I ran STRUCTURE (version 2.3.4) (Pritchard et al.

2000) using these haplotypes to evaluate population structure within An. punctulatus s.s. using the same parameters as described above. Additionally, I conducted an analysis of molecular variance (AMOVA) in Arlequin (version3.01)

(Excoffier et al. 2005) using default parameters and stratified my analysis by village and blood host fed on.

C. Results

1. In silico assessment of mammalian mt 16S rRNA primers a. Amplification range of universal mammalian 16S rRNA primers

I describe a molecular approach to study any insect’s blood meal composition in a comprehensive and unbiased manner using deep sequencing of the vertebrate mt 16S rRNA gene amplified directly from fed insects. I first conducted extensive in silico analyses to confirm that the primer pair (Taylor

1996) used could amplify DNA sequences from the majority of mammalian orders

126 including Primates, Rodentia, Artiodactyla (Even-toed ungulates), Carnivora,

Chiroptera (Bats), Cetacea (Marine mammals), Insectivora (Insectivores) and

Diprotodontia (). Overall, in silico analysis predicted that these primers should amplify 1,752 of the 1,779 mammalian species (98.5%) sequenced at this locus and present in the NCBI nucleotide database (Table 16).

Besides mammalians, these primers also seemed to amplify Actinopteri (Bony- fishes) and Amphibia (Amphibians) (Figure 23A).

class order

● Microbiotheria Macroscelidea● A" B" ● ● ● ● ● ● ●●●●● ● ●● ●Paucituberculata ● ●●● ● ●●● ● ●● ●●● ●● ● Proboscidea● ●● ● ●●●●●●● ● ●● ● ● ●● ● ● ●● ●● ●●● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●● ● ●● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ●● ● ● ●●● ●●● Perissodactyla ● ●●●●●●● ●●● ● ● ●●● ●● ●●●● ● ● ● ●●●●● ● ●● ● ●●●●●●●●● ●●●● ●●● ●● ● ●● ● ● Dasyuromorphia ● ● ●● ● ● ●● ● ●●● ●●●● ●●●●●●●●●●●●● ●●● ● ● ●● ●● ● ● ●● ● ●●●●●●●●● ●●●●●●● ●●●●● ●● ●● ● ●● ● ● ● ● ●●●●●●●●● ●●●●●●● ●●●●●● ● ● ● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●● ●●●●●●●● ● ●● ● ● ● ● ● ● ●●● ● ● ●●●●●● ●●●●●● ●●●● ●●●●● ● ●●●● ●● ● ● ●●●● ●●●● ●●●● ● ● ● ● ●●●●●● ●●●●●● ● ●● ●●●●●●● ●●●● ●● ●●● Peramelemorphia ● ●●● ●● ●● ● ● ● ● ●● ● ●● ●● ● ● ●●●●●●● ●●●●● ●●● ●●●●●●● ● ● ● ● ● ●● ●●● ●● ●●●●●●●●● ●●●● ●● ● ● ● ●●● ●● ●● ● ● ● ● ● ● ● ●●●● ●●●●●●● ● ●● ●● ● ● ●● ●●● ● Didelphimorphia ● ● ●● ●●●● ● ●●● ● ● ●● ●● ●●●●●●●●●●●●● ●●●● ● ● ●●●●●●● ● ●● ● ●●●● ● ●●●●●●●●● ● ● ● ●●●●● ● ●● ● ● ● ● ● ● ●●●●●● ● ● ● ●●●●● ● ●● ● ●● ●● ● ● ●●●●●●●●● ● ● ●● ● ● ● ● ●● ● ●●● ● ● ● ●●●●● ●●●●●●● ● ● ● ● ●●●● ●● ● ● ●●●● ●● ●●●●● ●● ●● ●●●●●●●●●● ●●● Insectivora Suina ● ● ● ●● ● ● ●●●● ●● ●●●● ● ●●●●●●●●●●● ● ●●●● ● ● ● ●● ● ● ● ●● ●● ● ●●●●●● ● ●●●● ●●● ●●●●●●● ● ● ●● ● ●● ●● ●● ● ●●● ●●●●●●●●●● ●●●● ● Mammalia ●●● ●●● ● ● Rodentia Pholidota ● ● ●● ● ● ●●●●●●●● ● ●●●●●●●●● ● ●● ●●● ●● ● ● ● ●●●● ● ● ●●●●●●● ●● ●●●●● ● ● ●●● ●●●● ●●● ● Lagomorpha ● ● ●●● ●●●● ●●●● ● ● ●●●●●● ●● ●●●●●●● ● ● ● ●● ●●●●●●●●● ●●● ● ●● ● ●●● ●●●●●●● ● ●●●● ● ● ●● ● ●●●●●●●●● ● ● ●●●●●● ●●●● ●● ●● ● ●●●●●● ●●●● ●●●●● ●● ● ● Sirenia Carnivora ●● ●● ● ● ● ●●●●● ● ● ● ●● ●●●● ● ● ●●●● ●●●●● ● ● ● ●●●●●● ● ●●●●●●● ● ●● ● ● ● ●●● ● ●●● ●●●● ●● ● Amphibia ●● ●● ● ● ● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ● ●● ●●●● ●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● Pilosa ● ● ● ●● ●● ● ●● ● ● ● ●●●●●● ● ●●●● ● ●● ● ● ● ● ● ●●● ●● ●● ● ●● ● ●●●●●●●●● ● ●●●●●●●● ● ●● ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●●● ● ●●●●●●●●●● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ●●●●●●●● ●●●●● ● ● ●● ● ● ●● ●● ●● ● ●●● ●● ● ●●●●●●● ● ●●●●●●●●●●● ● ●● ●● ● ● ● ●● ● ●●●● ●●●●●●●●● ● ● ● ● ●● ●●● ●● ● ● ● ●●● ● ● ● ● ●●●● ●●●●●●●●●● ●● ● ●●● ●●●● ● ●● ● ● ●● ● ● ●●●●● ● ●● ●●●● ●●●●● ● ● ● ● ● ● ● ● ●●●● ●●● ● ●●●●●●●●● ● ● ●● ● ●●●●●●● ● ●●●● ● ● ●●●●●●●● ● ●● ● ●● ● ●● ●●● ● ●●●●●●●● Ruminantia ● Primates Chiroptera ● ● ● ●●●●●● ● ● ●●● ●● ●● ● ● ● ●● ● ●●●● ●●●●●●●●● ●●● ● ● ●●●● ●●● ● ● ● ● ●● ●●● ● ● ● ●●●● ●●●●● ●●●●● Cingulata ● ● ● ●●● ●● ●●● ●●● ● ●●●●●●●●●● ●● ● ●● ● ●● ●●●● ● ● ● ●● ● ● ● ●● ●●● ●●●●●●●●● ● ● ●● ●● Tylopoda ●● ●●● ● ● ● ● ●●●●● ●● ● ●● ● ●●●● ● ● ● ● ●●●● ●●●●● ● ●● ● ● ● ●●●●● ●● Actinopteri ● ● ● Monotremata Dermoptera● ● ● ● ●●●●● ● ● ● ● ●●●●● ● ● ● ●● ●● ●●●● ●●●● ● Hyracoidea ● ● ● ● ●●●●● ●●●●●●●● ● ● ●● ● ● ● ●●●●●●● ●●●●●●●●●● ●●● ● ● ● ●●●●●● ●●●●●●●●● ●● ● ● Tubulidentata ● ●● ● ●● ●●●●● ●●● ● ●●● ●●●●● ●● ●● ●● ●● ●● ● ●●● ●● ● ● ● ●● ● ● ● ●●●●● ●●●●● ● ● ● ● ●●● ●●●●● ● ●● ●●● ● ●●●●●● ● ● ● ●● ● ● ● ● ● ● ●●●●●●● ● ● ● ●● ● Notoryctemorphia ● ● ●● ●●●●●●●●●●●● ●● ● ●●●●●●●●●● ●●● ● ●●●● ●●● ●●●●●●●●●● ●●●● ● ●●●●●●●●●● ●● ● ●●● ●● ● ●●●● ●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●● ●● ●●●● ●●●●● ●●●●● ●●●●●● ●●●●●●●●●●● ●●●●● ● ●●● ● ●● ●●●●●●●●● ● ●●● ●●●●●●●●●●● ● ● ● ●●● ● ● ●●●●●● ● ●●●●●●●● ●●●●● ●●●●●●●●●●●●●● ●● ● ●● ●●● ●●●●●●●●●●● ●●●● ●●●● ● ●●● ● ● ● ● ●●●●●●●● ●●● ●● ●●●● ●●●● ● ● ● ● ● ●●●●●●●●● ● ●●● ●●●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● Cetacea ●

●●● ●● ●

●● ● ● ●● ● ● ● ●● ● ● ●

10 10

Figure 23. Neighbor-joining tree reconstructed using the DNA sequences predicted to be amplified by the mammalian mt 16S rRNA primers. Each dot represents a different DNA sequence. (A) Tree showing the entire range of species amplified and colored by classes (Blue, mammals; Red, bony-fish; Green, amphibians). (B) Tree showing the results restricted to mammals and colored by orders. The scale mark represents 10 nucleotide differences.

127 b. Specificity of the universal mammalian 16S rRNA gene primer pairs

In addition to their wide range of amplification targets, these primers need, for my purpose, to amplify DNA sequences containing enough information to identify each species specifically. I tested this parameter by comparing the

DNA sequences predicted to be amplified by this primer pair. Despite the short amplified DNA sequence (~140 bp), these primers enabled rigorous differentiation of most mammalian species as illustrated by the nucleotide differences (including deletions) between sequences of species belonging to the same or different order (Table 17). For example, 27% of the nucleotide positions differ on average between one Carnivora and one Primate species and 17% between two Carnivora species (Table 17). This high discriminating ability is also shown by the long-branch lengths displayed by a phylogenetic tree reconstructed using these sequences (Figure 23B). In fact, I found that one DNA sequence amplified by these primers typically matches a single genus and, in 86% of the cases, a single species (Table 16).

2. Application of assay to field-caught female Anopheles mosquitoes a. Mosquitoes collected and sequencing reads generated

I analyzed mosquitoes collected from five villages in the Madang Province in PNG: Dimer (n=45), Wasab (n=81), Kokofine (n=83), Mirap (n=171) and

Matukar (n=62). These mosquitoes included several species of the Anopheles punctulatus group: An. punctulatus s.s., An. koliensis, An. farauti 4 and An. farauti s.s. (Table 17). I characterized the blood meal composition of a total of 442 female

Anopheles by amplifying the 16S mammalian mt rRNA genes from DNA extracted

128 Table 16. Summary of the amplification range and discriminatory power predicted for the mammalian 16S rRNA primers. The table indicate, for each mammalian order, the number of species deposited in NCBI for the 16S rRNA genes, the number of species predicted to be amplified by the universal primers as well as the percentage of genera and species that would carry a unique sequence for this locus (enabling their rigorous identification).

* # species Orders # of species amplified Genus Species

Placental

Artiodactyla (even-toed ungulates) 209 206 93.6 76.1

Carnivora (carnivores) 141 138 97 86.1

Cetacea (whales) 73 73 58.1 44.2

Chiroptera (bats) 423 421 97.7 89.8

Insectivora (insectivores) 174 169 100 84.8

Lagomorpha (rabbits and hares) 24 21 100 68.4

Macroscelidea (elephant shrews) 12 12 100 100

Perissodactyla (odd-toed ungulates) 22 21 100 68.4

Primates (primates) 195 192 100 94.6

Rodentia (rodents) 404 400 100 91.2

Scandentia (tree shrews) 19 17 100 88.2

Marsupial Dasyuromorphia (quolls, dunnarts, and numbats) 66 67 100 90.9

Didelphimorphia (opposums) 19 19 89.5 89.5 Diprotodontia (possums, , and ) 63 62 100 100 Peramelemorphia (bandicoots and bilbies) 14 14 100 85.7 This table does not include 7 orders for which less than 10 sequences were available in NCBI for this locus

129 Table 17. Proportion of nucleotide differences, including deletions, between sequences of species in the same or different mammalian order. 130 from these mosquitoes, pooled the PCR products of 96 samples after individual barcoding, and simultaneously sequenced the samples on an Illumina MiSeq instrument (Figure 20). I generated a total of 43,743,363 paired-end reads of 150 bp. For 42,198,573 DNA sequences (96.5%), I was able to collapse the overlapping paired-ends (Figure 20) and correct the majority of the sequencing errors. After combining the reads generated from all samples together, I identified 2,436,277 unique DNA sequences (accounting to 42,198,573 reads or 100%). I discarded from further analyses 2,404,684 unique DNA sequences that were carried by less than 10 reads across all samples (as these likely represent DNA sequences with rare sequencing errors).

Table 18. Number of mosquitoes that fed, at least partially, on each identified mammalian host according to their species. Numbers in parenthesis represent the number of samples that had > 90% of reads align to the mammalian species

Common Human Dog Pig Mouse Bat spotted

An. punctulatus s.s. (n=71) 52 (44) 20 (8) 12 (4) 3 (0) 0 0

An. koliensis (n=16) 15 (13) 3 (1) 1 (0) 0 (0) 0 0

An. farauti 4 (n=60) 54 (51) 4 (3) 5 (3) 0 (0) 0 0

An. farauti s.s (n=172) 80 (55) 33 (17) 93 (65) 2 (0) 1 (0) 1 (0)

I then compared the remaining 31,593 unique DNA sequences, accounting for

39,310,579 (or 89.9 %) of the reads initially generated, to all DNA sequences deposited in the NCBI database. 28,999 of these DNA sequences (representing

131 38,375,616 (or 97.6%) of the reads analyzed) had > 90% nucleotide identity to at least one mammalian DNA sequence present in NCBI: 18,814 DNA sequences best matched a single mammalian species sequence while 10,185 DNA sequences matched equally well to multiple mammalian species sequences (Table 19).

Overall, I generated an average of 82,528 DNA sequences per mosquito. The number of reads generated from each mosquito varied considerably (Figure 19) as it depends on several factors including: the amount of starting template (i.e., quantity of mammalian DNA present in the mosquito), the amplification efficiency, uneven pooling or variation in sequencing output. For further analyses, I only considered mosquito samples with more than 1,000 reads. None of the 30 extraction controls (water samples) reached this cutoff illustrating the low level of cross- contamination or read mis-assignment due to errors in the barcode sequence (if any). In contrast, I successfully retrieved mammalian DNA from 314 blood fed mosquitoes, including 56 out of the 86 mosquitoes visually-classified as partially-fed

(65.1%) and 258 out of the 337 mosquitoes characterized as fully-fed (76.6%). Only

5 out of the 19 mosquitoes visually classified as non-fed yielded mammalian 16S rRNA sequences. Interestingly, the number of sequencing reads generated for mosquitoes visually classified as fully-fed or partially-fed were significantly different from not-fed ones (p<0.05, Wilcoxon Rank-Sum test, Figure 24), but there was no difference between fully-fed and partially-fed mosquitoes. In total I successfully amplified and sequenced mammalian DNA from 319 (314 blood fed and

5 non-fed mosquitoes that the blood meal was successfully amplified) Anopheles mosquitoes.

132 Table 19. Summary of blast results showing, for each species identified, the number of mosquito samples that carried a DNA sequence matching this species, the maximum percent identity between the reads and NCBI sequence, and the average number of corresponding reads per sample. Note that when the sequences generated matched equally well several species, these are all indicated.

Average # # of Samples Percent General Name Species of reads Detected Identity Bat moluccensis 2,664 1 100 Bat Dobsonia praedatrix 1,916 1 94.4 Common spotted rufoniger 7,599 1 93.5 cuscus Dog Canis lupus/Canis aureus/Canidae 53,933.9 60 100 Dog Canis lupus 29,706.1 59 100 Dog Canis latrans 135.3 54 100 Dog Canis aureus 75.5 51 100 Dog Canis lupus/Canis aureus 63.7 43 98.9 Dog Canis lupus/Canis latrans/Canis aureus/Canidae 30.8 37 98.9 Dog Canis latrans/Canis aureus 25.6 25 97.8 Dog Canis latrans/Canis aureus/Canis lupus 20.2 17 95.7 Dog Canis latrans/Canis lupus 6.5 4 98.9 Dog Canis aureus/Canis lupus 5 1 95.7 Human Homo sapiens 70,405.7 201 100 Human Pan troglodytes/Homo sapiens 1,386.1 185 100 Human Pongo abelii/Homo sapiens 213.3 138 100 Human Pan troglodytes 182.5 136 100 Human Homo sapiens/Pan troglodytes 34.8 110 98.9 Human Homo sapiens/Gorilla gorilla 11.3 7 100 Mouse Mus musculus 3,217.8 5 100 Pig Sus scrofa/Sus barbatus/Scolothrips 55,523.3 111 100 takahashii/Sus philippensis Pig Sus scrofa 19,298.6 109 100 Pig Sus scrofa/Sus barbatus/Scolothrips 320.2 104 98.9 takahashii/Sus philippensis/Sus cebifrons Pig Potamochoerus porcus/Potamochoerus larvatus 119.7 95 94.7 Pig Sus scrofa/Sus celebensis 110.7 104 100 Pig Babyrousa babyrussa 93.7 95 92.6 Pig Sus verrucosus 62.5 101 100 Pig Sus scrofa/Sus verrucosus/Sus 42.2 97 98.9 barbatus/Scolothrips takahashii/Sus philippensis Pig Babyrousa celebensis 34.5 89 91.6 Pig Potamochoerus porcus 30.3 89 92.6 Pig Sus scrofa/Sus verrucosus 27.7 87 95.7 Pig Sus scrofa/Babyrousa babyrussa 24.3 93 94.7 Pig Potamochoerus porcus/Babyrousa 15.2 76 93.7 babyrussa/Potamochoerus larvatus Pig Sus scrofa/Sus celebensis/Sus barbatus/Scolothrips 14.2 70 98.9 takahashii/Sus philippensis Pig Sus scrofa/Sus philippensis 9.9 61 90.5

133 Pig Sus cebifrons 5 1 90.5 Pig Sus scrofa/Potamochoerus porcus/Potamochoerus 5 1 92.6 larvatus/Sus barbatus/Scolothrips takahashii/Sus philippensis

Figure 24. Box plot showing the number of sequencing reads generated per mosquito according to the visual blood meal status of mosquitoes (only samples from one sequencing run are displayed). b. Mammalian blood hosts identified from blood fed Anopheles mosquitoes

Overall, I identified 201 Anopheles mosquitoes that carried human DNA, 60 dog DNA, 111 pig DNA and 5 mouse DNA. In addition to these expected hosts, I identified one mosquito that carried DNA from two different bat species: 7.2% of the reads matched perfectly Dobsonia moluccenis, a fruit bat commonly found in PNG, while 5.1% of the reads were most similar (94.4% identity) to another species Dobsonia praedatrix, also endemic to PNG. Note that these DNA sequences are clearly distinct (8 nucleotide difference) and unlikely to derive from sequencing errors, indicating that the mosquito fed on two different bats (Table 20 and Figure

134 25A). This result illustrates the power of this sequencing approach to identify host species even if their DNA sequences are not present in the database (as long as a closely related species has been sequenced). Similarly, one mosquito yielded 7,599 reads (13%) most similar to the (Spilocuscus maculatus,

98% similarity), a marsupial found in the forests of PNG (Figure 25B). Note that, as predicted by my in silico analysis, I was not always able to identify the exact species that was fed on. For example, I could not differentiate Canis lupus from Canis aureus

(Table 19).

Out of the 319 mosquitoes analyzed here, 52 (16.3%) showed clear evidence of having fed on more than one host species (with >10% of the reads supporting the minor host): 44 mosquitoes carried DNA from two species and eight, DNA from three species (Figure 26).

Table 20. Summary of the hosts identified in the mosquito blood meals. For each host, the number of Anopheles mosquitoes carrying a corresponding DNA sequencing is indicated as well as the highest percent identity between the read generated and the host DNA sequence in NCBI and the average number of reads per sample carrying each DNA sequence.

Name # samples detected Percent Identity Average # of reads

Human 201 100 71,971 Pig 111 100 75,273 Dog 60 100 83,412 Mouse 5 100 3,218 Dobsonia moluccensis 1 100 2,664 Dobsonia praedatrix 1 94.4 1,916 Spilocuscus maculatus 1 98 7,599

135

c. Blood meal composition of individual An. punctulatus species

At each village I identified three predominate mammalian hosts from the analyzed Anopheles mosquitoes: humans, dogs and pigs, accounting in total for 37 to

100% of each mosquito blood meal. However, the proportion of mosquitoes that fed on each host varied within and between villages. In Dimer, An. punctulatus s.s.

A" B" Carnivorous marsupials - Leafed-nose bat Australian and New Guinea Herbivorous and Insectivorous marsupials- Australian and New Guinea Greater false vampire and Lesser mouse-tailed bat gymnotis Funel-eared bats

Egyptian slit-faced bat Pseudochirops archeri cuscus and possum Brushtailed South America Davy8s naked-backed bat celebensis Black-capped fruit bat Sac-winged bat Ailurops ursinus Thomas8s -feeding bat Trichosurus caninus Trouessart8s Trident bat Wyulda squamicaudata Ghost bat 91 HAS312B Marsupial Read Common bats 86 Spilocuscus maculatus

Luzon and Mindanao pygmy fruit bat Spilocuscus rufoniger HAS2865 Bat read

ulo bat bulldog and bat Mega cinereus Dobsonia moluccensis Dobsonia praedatrix Vombatus ursinus HAS2865 Bat read krefftii Noctilio albiventris leucopterus Feather-tailed and shrew possum - Australia helvum - Australia robusta Sac-winged bat , , - Dusky fruit and Greater long-fingered bat Australia and New Guinea Opossum - South America Mega bat Striped and - Australia and New Guinea Mega bat Mega bat Northern Masupial mole Free-tailed bat Ringtail and and gliding possum - Australia and New Guinea Mega bat Mega bat Tasmanian devil

0.05 5

Figure 25. Neighbor-joining phylogenetic tree showing the species relationships among (A) bat species and (B) marsupial species based on the DNA sequence targeted with the 16S mt rRNA primers. The red boxes indicate the (A) Bat or (B) marsupial DNA sequences amplified from a single mosquito’s blood meal.

(n=32), An. farauti s.s. (n=1) and An. koliensis (n=5) were collected. Overall,19 out of the 38 AP mosquitoes (50%) fed on humans while 8 mosquitoes fed on dogs and 11

136 mosquitoes fed on two or three species including mice (Figure 26A) (see Appendix

F.1. for blood meal composition of each AP species). In Wasab, where An. punctulatus s.s. (n=37) and An. koliensis (n=8) were collected 35 out of 45 mosquitoes fed on humans (77.8%), while there were only 6 non-human and 4 mixed blood meals observed including two instances of a mosquito feeding on mice

(Figure 26B). All An. koliensis mosquitoes collected in Wasab fed exclusively on humans (Appendix F.2). Although mosquitoes fed more often on humans in Wasab than in Dimer, the feeding patterns of An. punctulatus s.s. in these two villages was not statistically different (p=0.05, chi-squared test, note An. koliensis mosquitoes were removed). In Kokofine, where all mosquitoes analyzed were An. farauti 4 (with the exception of 2 An. koliensis mosquitoes, Appendix F.3), 52 out of the 62 mosquitoes fed on humans (84%) while the remaining 10 mosquitoes fed on dogs

(n=3), pigs (n=3) or on multiple species (n=4) (Figure 26C).

For An. farauti s.s. mosquitoes collected in the village of Matukar 12 of the 47 mosquitoes fed on multiple mammalian hosts including mice while the remaining

35 fed on humans (n=26), dogs (n=6) and pigs (n=3) (Figure 26D). Note that one An. punctulatus s.s. mosquito was also collected in Matukar and fed on humans

(Appendix F.4). In Mirap, where An. farauti 4 (n=125), An. koliensis (n=1) and An. punctulatus s.s. (n=1) were collected (see Appendix F.5 for blood meal composition of each AP species collected), only 31 of the 127 mosquitoes (24%) fed on humans while 62 (49%) fed on pigs, 11 fed on dogs (9%) and 23 fed on two or three species

(18%) including one mosquito that fed on two bat species and one mosquito that fed on a Common spotted cuscus (Figure 26E). I found that An. farauti s.s.

137    

 

           

 

     

 

 

   

 

     





 



 

138 Figure 26. Composition of the blood meals for mosquitoes collected in (A) Dimer, (B) Wasab, (C) Kokofine, (D) Matukar, and (E) Mirap. Each vertical bar shows the composition of the blood meal for one mosquito: each color represents a different host species and the height of each stacked bar corresponds to the proportion of reads matching this host DNA sequence. Grey corresponds to human DNA, turquoise to pig, blue to dog, white to mouse, red to bat and orange to cuscus.

mosquitoes from Mirap fed significantly more often on pigs than those from

Matukar (p=0.0002, chi-squared test). However, note that as the densities of available hosts are not available for these villages, I was unable to test whether the observed differences in blood meal composition were associated with differences in mosquito feeding behavior among locations.

Since mosquitoes were collected from both sides of a barrier screen I assessed whether there was any difference in the feeding patterns between mosquitoes collected on the ‘bush’, or forest, and village side of the barrier screen. I did not detect any statistical difference in the feeding patterns of mosquitoes from any village except Mirap where An. farauti s.s. mosquitoes fed more often on non- human hosts (94.8%) on the village side of the barrier screen than the bush side.

(p=0.003, chi-squared test). Note however, that there were very few samples collected on the bush side (n=13).

I also evaluated the host feeding patterns of the four AP sibling species collected (Figure 27) To determine if An. farauti s.s. is more zoophilic than the other

AP species analyzed here, I compared the number of human and non-human blood meals between An. farauti s.s. and the other AP species. I found that An. farauti s.s. fed statistically more often on non-human mammalian hosts (i.e. pigs and dogs) then the other AP species (p=3.4e-13, chi-squared test) (Figure 27). I conducted the

139   

 

    





     

 

      

 

 

   



        

 

   

 







   



      





140 Figure 27. Composition of the blood meals for each AP sibling species collected: (A) An. farauti s.s., (B) An. punctulatus s.s., (C) An. farauti 4, and (D) An. koliensis. Each vertical bar shows the composition of the blood meal for one mosquito: each color represents a different host species and the height of each stacked bar corresponds to the proportion of reads matching this host DNA sequence. Grey corresponds to human DNA, turquoise to pig, blue to dog, white to mouse, red to bat and orange to cuscus.

same analysis for all other AP species, but did not find any statistical difference between the species’ feeding patterns.

3. Examination of population structure within An. punctulatus s.s.

I genotyped 30 SNPs from a total of 71 An. punctulatus s.s. mosquitoes to determine if there was any population structure according to the mammalian blood host fed on. In total, I generated an average of 42,190 sequencing reads for each locus amplified. After removing samples and loci that were not successfully amplified, or did not meet my criteria (see section IV.B.6.d), I was able to examine the population structure of 52 An. punctulatus s.s. mosquitoes from the villages of

Dimer and Wasab, 33 that fed exclusively on humans and 19 that fed on non- humans. Using the 12 successfully amplified SNPs, I did not find any evidence of population structure among the An. punctulatus s.s. mosquitoes analyzed (Figure

28A). Instead my results indicate that the AP s.s. population from these two villages is panmictic.

To increase the resolution of my analysis, I reconstructed haplotypes, for each sample, that contain all of the SNPs detected for each sequencing read. Using these haplotypes, I did not detect any evidence of population structure among An.

141 punctulatus s.s. mosquitoes (Figure 28B). In addition, I also conducted an AMOVA using these haplotypes. There was no evidence of genetic differentiation between

Figure 28. Inference of population structure among An. punctulatus s.s. mosquitoes collected from the village of Dimer and Wasab using the program Structure, assuming K=2 clusters, using (A) 12 SNPs and (B) reconstructed haplotypes for 12 loci. The y-axis indicates the assignment probability of each individual to specific clusters.

An. punctulatus s.s. mosquitoes that fed exclusively on humans compared to those that fed on non-human blood hosts. There was also no genetic differentiation between An. punctulatus s.s. mosquitoes collected in the village of Dimer compared to those in Wasab. In fact, in all of my analyses, >90% of the variance was always within individual samples.

4. Evidence of mosquito blood meals from multiple human hosts

Since I observed that 16.3% of the mosquitoes analyzed had fed on multiple mammalian hosts, I hypothesized that mosquitoes could also be feeding on multiple

142 human individuals. I therefore investigated the number of different human sequences present in 157 human-fed mosquitoes using the same approach to sequence 300 bp of the human mt hypervariable region. I successfully amplified the human mt hypervariable region from 102 of these mosquitoes yielding a total of 20 different human mtDNA sequences (Figure 29A). While a single DNA sequence was

Figure 29. Neighbor-joining trees showing the relationships among the human mtDNA haplotypes with a numt sequence (A) and without (B) amplified from mosquitoes. The shapes indicate the species of each mosquito carrying a specific human DNA sequence (squares represent An. punctulatus, triangles An. farauti 4). The color of each shape indicates the village where the mosquito was collected (green from Dimer, blue from Wasab, and purple from Kokofine). (A) Note the long- branch separating the mitochondrial DNA sequences from the nuclear insertion (numt) sequence. (B) Mixed blood meals are highlighted by boxes of the same color: for example, the two red boxes show two human mtDNA haplotypes amplified from a single An. farauti 4 mosquito collected in Kokofine.

143

Table 21. Mixed human blood meals. The table shows, for each mosquito with multiple human mtDNA sequences, the collection site, the mosquito species, the number of nucleotide differences between the two mtDNA sequence and the proportion of the minor sequence.

Nucleotide Proportion minor Sample ID Village Species differences haplotype

HAS-3305 Dimer AP s.s. 14 0.423

HAS-0247 Wasab AP s.s. 5 0.148

HAS-4444 Wasab AP s.s. 13 0.447

HAS-0530 Kokofine AF4 1 0.013

HAS-0640 Kokofine AF4 7 0.070

HAS-0807 Kokofine AF4 1 0.162

HAS-1736 Kokofine AF4 1 0.023

HAS-1737 Kokofine AF4 1 0.020

HAS-0485 Kokofine AF4 5 0.130 AP s.s. is An. punctulatus s.s.

AF4 is An. farauti 4

amplified from most human-fed mosquitoes analyzed (78%), 22 mosquitoes yielded two distinct DNA sequences (Figure 29A). One sequence, identified in 14 of these potential mixed human blood meal was always present at low frequency (<8% of the reads) and was actually more similar to a region of human chromosome 11

(98% similarity) than to the mitochondrial genome (91%). This DNA sequence likely resulted from the amplification of the nuclear insertion of the mitochondrial sequence (numt, (Leister 2005)) and was excluded from further analyses. Nine mosquitoes, representing two species and collected in three locations, showed the

144 presence of two human mtDNA sequences (Table 21). In four of these mosquitoes, only one substitution (out of the 300 bp amplified) differentiated the two sequences and these could possibly be caused by a PCR error occurring at an early cycle.

However, for the remaining five mosquitoes, 5-14 nucleotide substitutions differentiated the two sequences amplified and indicated that the mosquito successively fed on multiple individuals (Figure 29B and Table 21).

D. Discussion

Vector-borne diseases such as dengue, malaria, Chagas disease or leishmaniasis, account for more than 17% of all infectious diseases and cause more than one million deaths annually (WHO March 2014). For example, while birds are well known to be the primary reservoir host of Eastern Equine Encephalitis virus

(EEEV), a virus transmitted by mosquitoes that can cause zoonotic infections, recent studies have shown that snakes constitute another, previously unsuspected, reservoir of EEEV (Graham et al. 2012).

Most molecular techniques used to investigate insects’ blood meal composition are specifically designed to identify one or a few specific host(s) and cannot characterize blood meal composition in an agnostic manner. Universal primer pairs have been used to circumvent this limitation and amplify any mammalian (Hamer et al. 2008; Molaei et al. 2006), or vertebrate DNA (Kocher et al.

1989; Molaei et al. 2007). However, these studies have relied on cloning the amplified products and sequencing a few clones from each insect and are consequently very expensive and labor intensive. In addition, the presence of

145 multiple species in a blood meal complicate the sequence analysis when the amplification product is sequenced directly (resulting in high background noise) or further increases the cost of the experiment if the PCR products are cloned and several clones sequenced per mosquito. These challenges have limited the number of studies that rigorously examined mixed blood meals from disease vectors and provided a potentially incomplete perspective on these vectors’ feeding behavior.

Rigorous identification of mixed blood meal is however critical to understand disease transmission as it might reveal higher transmission rates, if a blood meal typically consists of the blood from multiple individuals, or, lower, if the insect often feeds on species not susceptible to infection.

1. Strength and utility of targeted sequencing approach to identify mammalian host blood meals from Anopheles mosquitoes

A unique strength of the assay described here is its ability to rigorously detect and quantify mixed blood meals by identifying, in a single mosquito, the presence of multiple species DNA even if they only contribute to a small fraction of the entire blood meal (down to 10% in the current study). I was able to accurately detect and quantify mixed blood meals due to the high sequencing coverage achieved by high-throughput sequencing: on average, mammalian mt 16S rRNA genes amplified from each mosquito was sequenced by 82,528 reads (and, therefore, even minor host DNA present in 10% of the total mammalian DNA was represented by several thousand reads). Note that the DNA amplification might have different efficiency for different DNA sequences (e.g. amplify better pig than dog and human DNA). Consequently, the proportion of reads obtained from each

146 species might not reflect the true proportions of these species in the blood meal

(especially since the mtDNA content in blood might also vary among species).

However, this possible bias will affect all samples similarly and enable rigorous comparison of the blood meal composition across samples: detection of 10% of dog

DNA sequences in a given sample might not indicate that exactly 10% of the blood meal came from a dog but will reveal that this sample’s blood meal contained proportionally less dog than a sample where 20% of the reads matched dog DNA.

The second key feature of my approach is its ability to detect novel blood hosts that would not have been detected using traditional techniques. For example, here I report the first observation, to the best of my knowledge, that Anopheles mosquitoes can feed on bats or marsupials. Importantly, all hosts identified in my study are endemic to New Guinea where my samples were collected. For one of the bat sequences, I was not able to identify the exact species (as the most similar sequence in NCBI only had 94.4% identify) but my analyses revealed that it must be closely related to the megabat Dobsonia praedatrix. This illustrates that, even if the actual host has not been sequenced for the locus of interest, my approach can still reveal its presence and guide future studies to obtain more precise taxonomic information.

There are however some limitations to this approach. First, the primers may not enable the exact species to be identified: I estimated that 14% of the mammalian species do not have a unique DNA sequence at the locus amplified and the sequencing may therefore not enable differentiation among several closely related species. However, this limitation could easily be overcome by designing species-

147 specific primers for a more variable region (e.g. the mt hypervariable region).

Second, since I am comparing DNA sequences to the NCBI nt database there is the possibility of identifying incorrectly annotated sequences or pseudogenes, which could introduce spurious results. For example, one of the DNA sequences amplified that matched many pig DNA sequences perfectly (Sus scrofa, Sus barbatus, Sus philippensis, Sus celebensis and Sus verrucosus) was also identical to a thrip DNA sequence (Scolothrips takahashii). This instance possibly represents a case of misannotation in NCBI but could be problematic without stringent quality controls.

Similarly, several DNA sequences matched equally well human and gorilla, chimpanzee or orangutan DNA sequences and likely represent amplification of nuclear pseudogenes (numt) common to apes. Typically, these sequences were supported by a much lower number of reads (on average, 411) than DNA sequences that perfectly matched Homo sapiens mtDNA (on average represented by 70,405 reads) (Table 19).

Finally, my approach enables simultaneous processing of batches of 96 samples with minimum hands-on time (7-9 hours for the whole laboratory work).

This provides a unique throughput that is essential to analyze several hundred mosquitoes for well-powered comparisons. In addition, the high multiplexing of my approach dramatically reduces the cost of high-throughput sequencing (to less than

US$10 per sample), especially when combining the characterization of the blood meal composition with other data such as intra-species host characterization (see below), molecular species determination or genotyping.

2. DNA profiling of human maternal lineages from field collected mosquitoes

148 Previous studies have used microsatellites to compare the attractiveness of different individuals or group of individuals (Ansell et al. 2002; De Benedictis et al.

2003; Scott et al. 2006), examine the blood feeding patterns of mosquitoes (Darbro et al. 2007; Koella et al. 1998; Norris et al. 2010) or determine the effectiveness of insecticide treated bed nets (Gokool et al. 1993; Gokool et al. 1992; Norris & Norris

2013; Soremekun et al. 2004). DNA profiling with microsatellites allows for the identification of unique genetic profiles from human individuals fed on and can be a very powerful method to differentiate DNA from unrelated individuals. However, microsatellites can only detect the simultaneous presence of multiple individual

DNAs (typically two) if their proportion in one sample is relatively similar.

Otherwise, the signal from the less abundant DNA is typically swamped and indistinquishable from background noise. Rigorously identifying whether a disease vector feeds on a single or multiple individuals is however essential for disease control as vectors that feed on multiple individuals are more likely to rapidly spread the disease than those that only feed on a single individual.

As an alternative to microsatellites, my approach relies on identifying unique human mitochondrial haplotypes carried by a mosquito by analyzing 300 bp of the mt hypervariable region I. I showed that at least five (out of 102 mosquitoes analyzed) carried human mitochondrial DNA sequences from more than one person.

It is important to emphasize here that this number of mixed human blood meals is clearly underestimated as only maternal lineages can be detected by this approach: all offspring will carry the same DNA sequence as their mother and will be indistinguishable. However, one could, at least partially, circumvent this limitation

149 by including additional polymorphic nuclear loci in the assay and sequence them together with the mt hypervariable region locus (and the 16S rRNA). Overall, my approach allows for a rapid evaluation of the number of maternal lineages a mosquito has fed on that can be added to the characterization of the blood meal at no additional costs, and could be used to preliminarily identify individuals that mosquitoes are more attracted to.

3. Host feeding patterns of An. punctulatus mosquitoes in Papua New Guinea

Using my targeted sequencing approach, I was able to preliminarily characterize the host feeding patterns of four vectors of malaria in PNG. I found that

An. farauti s.s. was the most zoophilic (p<0.001) of the AP species examined, which is consistent with previous studies (Burkot et al. 1988). However, for the other AP species examined, I did not detect any statistically significant difference in their feeding patterns suggesting that the majority of the AP species may be generalist; this would be consistent with the findings of both Charlwood’s (Charlwood et al.

1985) and Burkot’s (Burkot et al. 1988) blood meal studies conducted in the 1980’s.

Note though that of the AP species examined in Burkot’s study An. punctulatus s.s. and An. koliensis were the most anthropophilic, but still fed on multiple blood hosts

(Burkot et al. 1988). However, the feeding patterns of An. koliensis and An. farauti 4 were predominately on human hosts, but these observations were not statistically significant. This could indicate that these species are more anthropophilic than the other AP species, but further data regarding host availability is needed. Indeed, if the AP species are generalists, the best predictor of AP mosquitoes feeding patterns would be the availability and density of mammalian hosts in a village. Under this

150 scenario, in villages where the majority of blood meals were from human hosts, for example in Wasab and Kokofine, I would expect proportionally more human hosts than non-human hosts to be present. In contrast, in the village of Mirap where 76% of An. farauti s.s. mosquitoes fed on non-human hosts (predominately pigs), I would expect a higher proportion of pigs compared to human hosts. However, I currently do not have any information as to the availability and density of hosts in any of the villages. Therefore, although An. farauti s.s. was determined to be the most zoophilic species of the AP species it is not possible to determine if this is due to (i) host preference or (ii) host availability/density. Further studies will need to be conducted that not only examine the host blood meals of AP mosquitoes, but also take into account host availability and density to determine if these mosquitoes are generalists. Additionally, these studies will need to examine the feeding patterns of

AP mosquitoes over a broader geographic range in PNG to assess whether feeding patterns differ between geographic locations, which could indicate that subpopulations of AP species exists that display distinct feeding patterns.

Another explanation for the host feeding patterns observed could be the result of subpopulations of mosquitoes that display different feeding patterns. For example, one subpopulation may prefer to feed on humans while another prefers dogs regardless of host availability. However, my analysis of the allele frequencies of

12 polymorphic loci for one member of the AP group, An. punctulatus s.s., collected from the villages of Dimer and Wasab, did not reveal any evidence of population structure, but instead suggested that the An. punctulatus s.s. population was panmictic. This further supports the hypothesis that the host feeding patterns of at

151 least An. punctulatus s.s. may be the result of host availability. However, further, more comprehensive, studies will need to be conducted.

There are several limitations to my analysis of population structure in An. punctulatus s.s. mosquitoes. First, I evaluated population structure by genotyping 12

SNPs and the haplotypes associated with them. With these few markers it is likely that I can only detect structure in a highly diverged population and, subsequently, will likely not be able to detect structure within a more recently diverged population where gene flow could still be occurring between subpopulations.

Additionally, the filtering criteria I used selected for common polymorphisms within a population and may not have included rare variants that could be more informative, particularly for subpopulations that have recently diverged or are in the process of diverging (within An. punctulatus s.s.). However, the removal of rare variants was necessary as it is difficult to distinguish rare variants from sequencing artifacts (i.e sequencing errors) and even without removing these variants I would likely still have too few markers to examine population structure. Lastly, I was only able to identify heterozygous sites for An. punctulatus s.s. and An. farauti 4 as these species were the most extensively sequenced (see Table 7). It is important that further genomic data from wild-caught AP species be generated to achieve the sequencing coverage necessary to generate a comprehensively list of polymorphic sites. This SNP data could be used to better assess if any population structure exists within AP species. In fact, to thoroughly evaluate if subpopulations of AP mosquitoes exist that displayed different feeding patterns, hundreds of polymorphisms within each AP species may be needed.

152 V. Chapter 5

Project Summary

153 A. Summary

Anopheles mosquitoes are distributed worldwide and are the vectors of devastating human diseases including malaria, filariasis and arboviruses that result in high mortality and morbidity globally. In the Western Pacific region, 70% of the reported malaria cases occur in Papua New Guinea (PNG) and >95% of the population in PNG live at high-risk of contracting malaria (WHO 2014). Currently, malaria control campaigns are ongoing in PNG and, in combination with the administration of antimalarial drugs, LLIN is the predominate method employed to decrease malaria transmission in the country. Since a call has been made to eliminate malaria (2007; Roberts & Enserink 2007) there has been widespread distributions of long-lasting insecticide treated bednets (LLINs) in PNG and, as of

2014, it is estimated that 68% of the households in PNG own at least one LLIN

(WHO 2014). Past attempts to eliminate malaria transmission by targeting

Anopheles mosquitoes had very limited success and were abandoned due to the widespread occurrence of insecticide resistance and logistical restraints (Najera et al. 2011; WHO 1973). Given the push to eliminate malaria it is important that a better understanding of the biology of Anopheles mosquitoes is acquired and utilized to promote the effective deployment of vector control measures and understand disease transmission of the main disease vectors.

In PNG, members of the Anopheles punctulatus (AP) group are the main vectors of malaria and are commonly found throughout mainland PNG. There are currently 13 known, species within the AP group, five of which are considered major disease vectors: An. punctulatus s.s., An. koliensis, An. farauti s.s., and An. farauti 4.

154 These vectors are widely distributed across PNG and are commonly found in the same larval habitats (Cooper et al. 2002). Studies of this group are complicated by the existence of cryptic species, lack of genetic data (only 7 loci available) and difficulties associated with establishing and maintaining mosquito colonies for laboratory experiments. In addition to these challenges, only two studies have examined the feeding patterns of these species. As malaria elimination campaigns are in progress in PNG it is important that a better understanding of the species relationships, reliability of species definitions and feeding patterns of these species is acquired.

The purpose of this dissertation is to gain a better understanding of the biology of the main disease vectors of the AP group by circumventing the current limitations and lack of genetic data for this group by using high-throughput sequencing strategies to generate genomic data for AP species. I used this data to examine the divergence between AP sibling species and investigate if introgression is occurring between these sibling species. In addition, I also utilized high- throughput sequencing strategies to develop an unbiased method to identify any mammalian blood meal from female Anopheles mosquitoes and apply this technique to assess the feeding patterns of AP species. I relate how my findings may aid in the implementation of vector control strategies and evaluating disease transmission in

Papua New Guinea.

1. Generation of genomic data for four wild-caught An. punctulatus species

In Chapter 2 of this dissertation I sequenced and assembled the first genome assemblies of four, wild-caught, disease vectors from the Madang Province in PNG:

155 An. punctulatus s.s., An. koliensis, An. farauti s.s. and An. farauti 4. Note that recently the genomes of several Anopheles mosquitoes were assembled, however only one member of the An. punctulatus group, An. farauti s.s. was assembled (Neafsey et al.

2015). Initially, I attempted to assemble these genomes by aligning the sequencing reads of each AP species to the An. gambiae genome, which was the only available

Anopheles genome at the start of my dissertation research. However, only ~1% of the sequencing reads aligned to An. gambiae indicating that the AP species are highly diverged from this African mosquito and therefore de novo assembly methods were required to assemble the genome sequence of the AP species.

Furthermore, I found that the divergence between AP species precluded using any

AP species as a reference genome for the assembly of additional AP species, only 29-

35% of DNA sequencing reads aligned between these sibling species. Instead, I had to independently de novo assemble the genomes of each AP species. Overall, I assembled between 146 and 161 Mb of DNA sequence for each AP species. Given the divergence between these species the genomes of additional AP species will have to be de novo assembled. In addition to generating genome assemblies, I also identified thousands of DNA polymorphisms for the two most extensively sequenced AP species (Table 13), An. punctulatus s.s. (318,375 SNPs) and An. farauti 4 (164,081

SNPs).

2. Anopheles punctulatus species are deeply diverged and reproductively isolated

In Chapters 2 and 3 of this dissertation I utilized the genomic data generated for AP species to examine the divergence among species and to investigate if

156 introgression was occurring. In Chapter 2, I reconstructed the evolutionary history of the AP species using 82 Mb of orthologous nuclear DNA sequence of four AP species (Figure 9) and ~11 Kb of mitochondrial protein coding sequence for five AP species (Figure 6). Both nuclear and mitochondrial phylogenies revealed that the AP species rapidly diverged from each other, based on the short internal branches observed between species, and have been independently evolving since. This deep divergence was also supported by comparative genomic analyses between species

(Table 9 and Table 12).

I estimated the divergence times of AP species using the previously published divergence time of Drosophila and Anopheles (Gaunt & Miles 2002) and mitochondrial protein coding sequences of species from Anopheles, Drosophila,

Aedes and Culex genera (Table 5). The results of this analysis revealed that AP species diverged from each other 25-54 mya (Table 5 and Figure 7), much earlier than previously reported (Ambrose et al. 2012; Beebe et al. 2000b). The deep divergence of the AP group does not appear to be the result of a single highly divergent species as most of the species analyzed diverged from an ancient ancestor, and only two species share a recent common ancestor 5-17 mya (An. farauti s.s. and

An. hinesorum). These findings suggest that AP species inhabited PNG before the arrival of humans in PNG. Despite this, AP species have adapted to human habitats, particularly An. punctulatus s.s. that is readily found in temporary, man-made, larval habitats (Cooper et al. 2002). The deep divergence between AP species also suggests that they are reproductively isolated. However they have overlapping

157 geographic distributions and are commonly found in the same larval habitats across

PNG (Cooper et al. 2002).

In Chapter 3 of this dissertation I used the genomic data generated for four

AP species to investigate whether recent introgression was occurring among any of these sibling species. I examined genome-wide patterns of divergence between AP sibling species and the distribution of genetic heterozygosity within a single diploid individual. Overall, my analysis of 31,312 independent loci, accounting for 71 Mb of

DNA sequence, did not reveal any evidence of recent gene flow. Further analyses of the distribution of genetic diversity across 51 Mb of DNA sequence from one An. punctulatus s.s. mosquito and one An. farauti 4 mosquito indicated that, if any gene flow occurred between these two species, it was extremely rare or happened in the very distant past, as only 467 shared polymorphisms were observed compared to the 318,375 and 164,081 private polymorphisms identified within An. punctulatus s.s. and An. farauti 4, respectively (Table 13). In fact, my simulations of two diverged populations with varying amounts of gene flow and time since last gene flow event revealed that the migration of more than one mosquito per generation (i.e. no more than one introgression event occurring per generation) during the last 11,000 years, assuming 12 generations per year (Barbosa et al. 2011; Beserra et al. 2006; Chandre et al. 2000), could be excluded given my data (Figure 17). These findings show that there is no evidence of recent introgression occurring among the An. punctulatus sibling species analyzed here in the Madang Provence of PNG and these species can be considered reproductively isolated.

158 Since I sequenced wild-caught mosquitoes I was able to use the genome-wide diversity data generated for An. farauti 4 and An. punctulatus s.s. mosquitoes to examine their demographic history in Chapter 3 of this dissertation. My analysis revealed that both species shared a similar demographic history until 52,000 years ago when they diverged and the AP s.s. population continuously expanded while the

AF4 population fluctuated until a recent and dramatic decrease in population size occurred (Figure 18). One possible explanation for these observations could be the result of humans arriving in PNG ~50,000 years ago (Lilley 1992) which could have favored AP s.s. as they readily breed in transient, man-made, bodies of water, for example water containers and bowls (Charlwood et al. 1986; Cooper et al. 2002).

Alternatively, as AP s.s. is considered the most anthropophilic member of the AP group, it is possible that, after humans arrived in PNG, AP s.s. had more available blood hosts compared to AF4 resulting in the dramatic increase in the population size of AP s.s. ((Beebe et al. 2013)). However, it is also possible that differences in population sizes are the result of population structure, however there does not appear to be any evidence of population structure within AP s.s. populations in the

Madang Province, based on AP s.s. mosquitoes collected in Dimer and Wasab (Figure

28). Although note that I only examined 12 SNPs in my analysis and additional studies using additional SNPs are likely needed to verify this result. The genetic polymorphisms data generated in this dissertation could be used to conduct population genetic studies to examine whether any evidence of population stratification exists among AP sibling species populations across PNG.

159 3. Development of an unbiased method to identify mammalian hosts fed on by

An. punctulatus mosquitoes

In Chapter 4 of this dissertation I developed a targeted high-throughput sequencing technique to analyze the mammalian blood meal composition of individual mosquitoes in an unbiased manner. I validated in silico that universal primers targeting the mammalian mitochondrial 16S ribosomal RNA genes (16S rRNA) should amplify 98.5% of the mammalian 16S rRNA sequences present in the

NCBI nucleotide database. I also examined the specificity of these primers and found that one 16S rRNA sequence amplified by these primers typically matched a single genus and, in 86% of the cases a single species (Table 16). I successfully applied this targeted high-throughput sequencing technique to field collected An. punctulatus mosquitoes from five villages in the Madang Province of PNG and demonstrated the ability of this assay to identify both expected (human, dog, pig, mice) and unexpected (bats, common spotted cuscus) blood hosts (Table 19). This assay also enabled the identification of mosquitoes that had fed on multiple mammalian species (i.e. mixed blood meal), even if they only contributed to a small proportion of the entire blood meal (Figure 26A-E). However, as the host DNA from a blood meal can be detected up to 24 to 30 hours post-feeding in mosquitoes (Kent &

Norris 2005) it is possible that, in some of the mixed blood meals, I am detecting host DNA from a previous blood meal.

In addition to comprehensively evaluating the blood meals of mosquitoes, I also showed how this approach could be adapted to determine the number of human individuals fed on by a single mosquito. Analysis of the human mitochondrial

160 hypervariable region I revealed that five An. punctulatus mosquitoes unambiguously fed on more than one person (Table 21 and Figure 29B). Since the human mitochondrial genome was targeted, it is important to emphasize that the number of mixed human blood meals is likely underestimated as only maternal lineages were detected and all offspring will carry the same DNA sequence as their mother.

However, this limitation could be overcome by sequencing an additional polymorphic nuclear locus and could be easily sequenced together with the mitochondrial hypervariable locus and the mammalian mitochondrial 16S rRNA amplicons. Overall, this approached allows for the rapid evaluation of the number of maternal lineages a mosquito has fed on and can be added to the characterization of host blood meals at no additional costs. The results of these types of analyses could be readily used to understand disease transmission of mosquitoes.

Using my high-throughput sequencing technique, I was able to preliminarily characterize the feeding patterns of An. punctulatus species from five villages in the

Madang Province of PNG (n=314). Comparative analyses of the feeding patterns of the AP mosquitoes revealed that An. farauti s.s. was the most zoophilic of the AP species examined here (p<0.001) (Figure 27), however, for the other AP species, there was no statistical differences in their feeding patterns suggesting that these species are unspecialized in their feeding preferences (Table 17). These results suggest that the best predictor of an AP species’ feeding pattern is the availability and density of mammalian hosts at each village. However, as I do not have any data on the availability and density of mammalian hosts in these villages these results need to be interpreted cautiously and further studies will need to be conducted to

161 verify my findings. Additionally, without host density data, it is not possible to determine if An. farauti s.s. preference for non-human hosts is indeed a preference or the result of host availability. Overall, the blood meal identification method I described in this dissertation will enable further comprehensive examinations of

Anopheles feeding patterns and could be employed to assess feeding patterns of other insect disease vectors.

B. Contribution of results to vector control initiatives and disease transmission

Since mosquitoes were identified as the vectors of human diseases they have been the primary targets of programs aimed at reducing disease transmission globally. Throughout history various methods have been used to control mosquito populations including destruction of larval habitats, spraying of insecticides and, more recently, the deployment insecticide treated bed-nets. Efforts to eliminate the transmission of malaria have had varied success: malaria was eliminated from

North America, the former Soviet Union and Europe during the mid-20th century as part of malaria eradication programs (Mabaso et al. 2004; Najera et al. 2011), but, in other more malarious regions, this program was abandoned due to the occurrence and widespread distribution of insecticide resistance in Anopheles and logistical issues (Najera et al. 2011; WHO 1973). Today, the primary method to control the transmission of malaria, and other vector-borne diseases, is still the deployment of vector control methods. Recently, as a result of a call to eliminate malaria transmission, there has been a widespread distribution of long lasting insecticide

162 treated bed-nets (LLINs) and an increase in the indoor spraying of insecticides (IRS) globally (2007; Roberts & Enserink 2007). However, as insecticide resistance continues to be a substantial challenge and is widespread, it is important that a better understanding of the biology of Anopheles mosquitoes, and genetic factors associated with insecticide resistance and avoidance of LLINs, is acquired in order to effectively implement vector control strategies. However, to address these questions genomic data needs to be generated for Anopheles mosquitoes.

In Papua New Guinea, as part of the National Malaria Control Program, LLINs have been distributed across the country. In addition to LLINs, widespread indoor residual spraying (IRS) campaigns are being planned as part of malaria elimination programs in PNG (Kazura et al. 2012). Given the widespread distribution of vector control measures, and the goal to eliminate malaria transmission in PNG, it is important that the species definitions and genetic diversity of members of the An. punctulatus group are examined to promote the successful implementation of vector control strategies and to understand disease transmission in PNG.

1. Vector control

In this dissertation I sought to better examined the robustness of species definitions and the genetic diversity of members of the An. punctulatus group by employing high-throughput sequencing techniques to generate genomic data for four of the main disease vectors in PNG. My genomic analyses, in Chapters 2 and 3 of this dissertation, revealed that members of the An. punctulatus group are deeply diverged and that there is no evidence of gene flow occurring between these sibling species. This suggests that insecticide resistance alleles are unlikely to spread across

163 species and, instead, will have to independently evolve in each species. This is encouraging for vector control programs in PNG, as insecticide resistance alleles are unlikely to spread between species. Currently, there no insecticide resistance alleles have reported for An. punctulatus mosquitoes (Henry-Halldin et al. 2012) but the widespread deployment of vector control measures (Hetzel et al. 2014; Kazura et al.

2012) could rapidly change this situation. Therefore, it is important that continuous monitoring of insecticide resistance is conducted for each of the disease vectors in

PNG.

Additional genomic analyses conducted in Chapter 3, that examined the population sizes and dynamics of An. punctulatus s.s. and An. farauti 4, revealed distinct demographic histories between these species. My findings show that, historically, these species have responded very differently to environmental changes. For example, around 52,000 years ago (630,957 generations ago) the population size of An. punctulatus s.s. continued to increase while the population of

An. farauti 4 fluctuated before suffering a dramatic decrease in size. These findings suggest that these species may respond differently to current environmental perturbations such as the deployment of vector control measures. Additionally, since the population size of a species directly influences the probability of an advantageous mutation being swept to fixation in a population, the larger effective population size of An. punctulatus s.s. could suggest that any insecticide resistance alleles would reach fixation faster than in An. farauti 4. Additional genomic studies of other An. punctulatus species across PNG could be informative for understanding how different populations may respond to the deployment of insecticides.

164

2. Disease transmission

Anopheles mosquitoes display varying degrees of disease transmission potentials depending on a number of factors including their degree of anthropophily, ability to transmit disease, location and density in relation to humans, longevity, and incubation period of the parasite. An important component of evaluating the disease transmission potential of mosquitoes is understanding their feeding patterns, this influences the diseases they transmit, host(s) they infect, and rate of disease transmission.

In Chapter 4 of this dissertation I developed an unbiased technique to identify the mammalian blood meals of Anopheles mosquitoes and to detect the number of human individuals fed on by a single mosquito. This technique enables comprehensive analyses of the feeding patterns of Anopheles mosquitoes to be conducted and could provide important insights into the feeding behaviors of mosquitoes. The ability of this technique to detect any mammalian hosts fed on enables the identification of species that are unspecialized (i.e. they feed on a variety of mammalian hosts). These species are less likely to transmit human specific diseases since they feed on a variety of hosts. However, in contrast, these species are more likely to acquire and transmit zoonotic diseases from reservoir hosts. In these instances, my technique enables the identification of possible mammalian reservoir hosts. Furthermore, by adapting this technique to estimate the number of humans fed on, this technique can identify mosquitoes that feed on multiple human individuals and are more likely to rapidly spread disease compared to those that feed on a single individual.

165

C. Utility of high-throughput sequencing for understudied species

It is difficult to study the biology of non-model organisms as they may not be easily adapted to laboratory conditions, have limited genetic information available or are unable to be manipulated using traditional techniques. These limitations inhibit the acquisition of information that could have important implications for the transmission of human diseases, understanding the evolution of various traits among lineages (i.e. species or populations), or, how the lineages are related, and identifying adapted phenotypes among lineages. In this dissertation I employed high-throughput sequencing techniques to better understand the biology of an understudied group of mosquitoes, the An. punctulatus group, which had limited genetic data available and laboratory studies were difficult to conduct. I was able to leverage the genomic and genetic data generated in my dissertation to understand the species relationships, extent of introgression and feeding patterns of the main disease vectors in the An. punctulatus group. The high-throughput and genomic approaches I used here are powerful tools that can be employed to examine the species relationships, demographic histories and adaptive phenotypes of other medically important organisms.

For many understudied organisms there are few genetic loci or genes available. This makes it difficult to determine the species relationships or population history of these species as a few gene trees may not represent the actual evolutionary history of the species examined. In contrast, the generation of genomic data using high-throughput sequencing methods enables phylogenetic analyses of thousands of independent loci, representing mega bases of DNA sequence, to be

166 conducted to robustly determine the evolutionary history of the species examined.

This could enable analyses aimed at determining whether certain traits evolved once in an ancestral population, or, if the traits were acquired independently in different species. This is particularly important for understanding the evolution of traits associated with the transmission of disease for insect vectors

Genomic data generated by high-throughput sequencing techniques can also be used to conduct population genetic analyses. Genome wide-patterns of genetic diversity can be examined to evaluate whether introgression is occurring between species, examine the demographic history of a species (e.g. the effective population size) and identify adaptive loci that are under positive selection or have recently undergone a selective sweep in a population. Additionally, from a single individual, the entire genealogy of a species can be examined since thousands of independent loci are being analyzed across the genome. This means that, despite the initial costs of genome sequencing, fewer individuals can be sequenced, preliminarily, to identify populations that may be of interest for future studies.

The generation of genomic data for a species also enables the identification of possibly thousands of SNPs that can be used to conduct future genetic and population genetic studies without the need for generating additional genome sequences. The SNPs generated in these studies could be used to determine if subpopulations of a species exist that display distinct genetic differences across a geographic region of interest or as markers for future studies. For example, there are known polymorphisms in the voltage-gated sodium gene (VGSC) and acetylcolinesterase gene (Alout & Weill 2008; Weetman et al. 2015) that confer

167 resistance to insecticides that are routinely used to monitor insecticide resistance in mosquito populations (Henry-Halldin et al. 2012).

Overall, genomic analyses are very powerful methods that enable the biology of understudied organisms to be evaluated and can provide enormous amounts of data that can be used to examine the evolutionary and demographic histories of a species and identify adapted loci. However, genomic techniques do not replace but, instead, compliment current traditional techniques. As laboratory studies use organisms that have been adapted to laboratory conditions and are typically inbred, it is not possible to determine whether the results of these studies are representative of what is occurring in natural populations of the organism.

However, genomic techniques can be employed to investigate what is occurring in wild populations. For example, in this dissertation I was able to confirm the results of previous laboratory experiments, which found that F1 hybrids of AP sibling species colonies were sterile, by generating genomic data for four wild-caught AP sibling species and examining the divergence and diversity of these species across

~70Mb of genomic data.

Targeted high-throughput sequencing of the mammalian mitochondrial 16S rRNA, which I used to identify the blood meals of mosquitoes in Chapter 4, can also be used to understand other vector borne diseases. This technique can readily be employed to examine the blood feeding patterns of other disease vectors such as sand flies that transmit Leishmaniasis, Tsetse flies that transmit African

Trypanosomiasis and Triatomine bugs, the vectors of Chagas disease. It is also easily adapted to identify non-mammalian hosts of interest, for example birds that are

168 known to be reservoir hosts of West Nile virus, or to detect pathogens of interest.

The latter could enable epidemiological surveys to be conducted to rapidly identify the prevalence of a pathogen within vector populations.

D. Future directions

The genomic data generated in this dissertation enables a seemingly endless number of research studies focused on members of the An. punctulatus group. In this dissertation I was able to generate genome sequences for four of the major disease vectors in PNG, but was unable to sequence the genome of the other major vector, An. hinesorum. Additionally, there are no genome assemblies available for AP species that are non-vectors or minor vectors. The generation of genome sequences for these AP species, and for multiple individuals from across PNG, would enable a robust, and comprehensive, examination of the species relationships of AP species.

It would also allow comparative genomic analyses to be conducted to identify gene(s) associated with specific ecological, behavioral and vectorial traits. For example, identifying the genetic loci associated with the shared salinity tolerance of

An. farauti s.s. and An. farauti 7 that has enabled the medically important vector An. farauti s.s. to thrive in the coastal regions of PNG (Foley & Bryan 2000) or genes associated with the ability to transmit malaria or lymphatic filariasis. Comparative analyses could also determine whether a genetic component exists that is driving the changes in biting time of An. farauti s.s. in the Solomon islands, An. farauti s.s. has shifted from biting indoors to now biting outside after the distribution of LLINs in the region making it less susceptible to vector control measures (Bugoro et al.

2011).

169 The genomic analyses I conducted in this dissertation strongly suggest that the four members of the AP group sequenced are deeply diverged, reproductively isolated, species that are evolving independently. However, as my analyses focused on mosquitoes collected from only the Madang Province of PNG it is important that additional genomic studies are conducted to determine if my findings apply to other

AP species populations, or, in contrast, in subpopulations exists in PNG that are more closely related. Also, since I do not have a genome assembly for An. hinesorum,

I was unable to determine if introgression was occurring between An. hinesorum and An. farauti s.s., the two most closely related major vectors within the AP group.

It would be advantageous to determine if introgression is occurring between these species as they are frequently collected from the same larval habitats and there is evidence, based on a single mitochondrial gene, of introgression occurring in

Southern New Guinea (Ambrose et al. 2012). If introgression is occurring this could complicate vector control programs in regions where these two species are present as any alleles conferring insecticide resistance could be exchange and rapidly spread between species.

Since I conducted high-throughput sequencing of AP genomes I was able to identify thousands of SNPs for the two species that had the highest sequence coverage, An. farauti 4 and An. punctulatus s.s. However, I was not able to identify

SNPs in An. farauti s.s. and An. koliensis. By sequencing additional samples SNPs could be identified for these species and be used to conduct population genetic studies aim at determining whether subpopulations exist that display different traits. Additionally, SNP data could be employed to perform genome-wide scans to

170 identify genetic loci under selection within a species or to examine the genetic differentiation of two species (e.g. inversions).

The targeted high-throughput sequencing technique I developed and applied to An. punctulatus mosquitoes collected from PNG provided a preliminary assessment of the feeding patterns of AP species. However, as samples were collected from a limited geographic range and data regarding host availability and density were not collected, additional studies need to be conducted to determine if a blood feeding preference exists for AP species across PNG. For example, in this dissertation my analysis suggests that An. farauti s.s. prefers to feed on non-humans hosts, but I was not able to determine whether this was due to host availability.

These types of studies are important as the findings have direct implications for understanding disease transmission in PNG. In addition, it would also be of interests to determine if any genetic differentiation existed within AP populations that display different feeding patterns across PNG using the SNPs generated here.

In conclusion, by using high-throughput sequencing techniques, my dissertation has provided a wealth of genomic data for members of the An. punctulatus group and a novel tool to detect the blood meals of mosquitoes and identify the number of human individuals fed on by a single mosquito. My genomic analyses revealed that the AP species are deeply diverged, reproductively isolated species. The genomic data generated here is an invaluable resource and will provide the opportunity to possibly identify the genetic-basis of traits associated with disease transmission and to, overall, better understand the biology of this medically important group of mosquitoes. The analyses here have laid a framework for how

171 genomic based studies can improve our understanding of understudied vector species and how this information can be used to better implement vector control strategies and understand disease transmission.

172 VI. Appendix A

Accession numbers and location of deposited sequencing data

173 1. Raw sequencing data for whole genome sequencing of nuclear genomes: NCBI

SRA: SRP042363

2. Raw sequencing data for sequencing of mitochondrial genomes: NCBI SRA:

SRP013853

3. Raw sequencing data for targeted high-throughput sequencing 16S rRNA of

Anopheles punctulatus blood meals (including human profiling) and genotyping of

SNPs from AF4 and AP: NCBI SRA: SRP062959

4. DNA sequence assemblies of nuclear genomes of An. punctulatus s.s., An. koliensis,

An. farauti s.s. and An. farauti 4: GenBank Accessions JXWZ00000000-

JXXC00000000

5. Assembled mitochondrial genomes: Genbank Accessions JX219731- JX219744

6. SNP information for An. punctulatus s.s. and An. farauti 4: Dryad doi:10.5061/dryad.16hc8

174 VII. Appendix B

Distribution of the sequence coverage per contig for the initial genome assemblies

175                 

         

         

 

 

           

                       

               

         

Appendix B. Distribution of the sequence coverage per contig for the initial genome assemblies (A-D) and after removing redundant contigs (E-H). The

176 histograms display the data for An. farauti 4 (A and E), An. punctulatus s.s. (B and F), An. koliensis (C and G) and An. farauti s.s. (D and H). The red line represents the cut off used to filter out putative paralogs from the assembly.

177 VIII. Appendix C

Origin and dispersal of Anopheles mosquitoes based on analysis of mitochondrial genome sequences

This section is published in the following manuscript:

Logue K, Chan ER, Phipps T, Small ST, Reimer L, Henry-Halldin C, Sattabongkot J, Siba PM, Zimmerman PA, Serre D. 2013. Mitochondrial genome sequences reveal deep divergences among Anopheles punctulatus sibling species in Papua New Guinea. Malar J. doi: 10.1186/1475-2875-12-64.

178 The predominant hypothesis regarding the origin of Anopheles mosquitoes predicts that they originated in western Gondwana (Krzywinski & Besansky 2003;

Krzywinski et al. 2006) and that, by 95 mya, Anopheles had migrated into Africa.

Ancestral Anopheles are predicted to have then colonized Europe and North America

(via land bridges), and migrated through Asia into the Southwest (SW) Pacific. The topology of the tree in Figure 7 is globally consistent with this hypothesis. However, the position of North American mosquitoes (An. quadrimaculatus) relative to African and other non-American Anopheles remains unclear. In particular the lack of mt genomes from European Anopheles mt genomes preclude determining whether

African Anopheles are ancestral to European and North-American Anopheles or, alternatively, whether North-American Anopheles derive directly from South-

American Anopheles. Additional sampling of mt genomes would provide better resolution of Anopheles early dispersal routes.

Regarding Southeast Asia (SEA) and the SW Pacific, it is generally believed that Anopheles from the SW Pacific derives from SEA mosquitoes (Beebe et al.

2000b; Foley et al. 1998; Krzywinski & Besansky 2003). The results of this study suggest that the AP group is most closely related to the An. dirus complex of SEA, consistent with an origin of the AP group in SEA. However, currently it cannot be determined whether SW Pacific and Australian Anopheles originate from an SEA ancestor as currently hypothesized or, alternatively, whether SW Pacific and SEA

Anopheles have an Australian origin. The molecular dates in this study, suggest that the ancestor of the AP group was present in PNG between 25 and 54 mya but does not allow rejecting either of these scenarios. Plate tectonic models show that the

179 Australia/PNG plate, that separated from Gondwana during the Cretaceous, moved from a southern position in the Eocene to its current position near SEA (Scotese

2004). While the upper limit of the age of the AP ancestor (54 mya) corresponds to a time when the distance between PNG and SEA would not have allowed migration between these regions, the lower limit (~25 mya) corresponds to a time when the

Australian plate had moved close enough to the Asian plate to enable possible migration of species between the two regions (Hall 2002). Inclusion of additional mt genome sequences, in particular from Anopheles complexes restricted to Australia

(e.g., Anopheles annulipes) may allow better understanding of these early dispersal routes in SEA and SW Pacific.

The monophyly of AP mosquitoes (Figure 6) suggests that they colonized PNG through a single migration event followed by speciation (as opposed to multiple migrations of pre-established species). This study suggests that the different AP sibling species diverged from each other 25–54 mya, much earlier than proposed in previous studies of the AP group (Ambrose et al. 2012; Beebe et al. 2000b). This deep divergence among AP mosquitoes is unlikely to be caused by a single species that could have diverged from the other sibling species in SEA and colonized PNG later. In fact, most of the AP sibling species are equally divergent from each other, suggesting rapid radiations of AP sibling species upon arrival in the SW Pacific area.

180 IX. Appendix D

Putative genes that are introgressing in An. punctulatus mosquitoes

181 Appendix D. Putative genes that are possibly introgressing in Anopheles punctulatus mosquitoes

Chrom- Gene Locus % ID Genebank ID osome Start Gene End 10034_locus_24 25,077,385 25,088,084 457 88.39 AgaP_AGAP002648 2R 10150_locus_42 780 82.21 AgaP_AGAP000529 X 9,483,283 9,488,015 10160_locus_36 389 88.3 AgaP_AGAP012196 3L 38,651,386 38,651,796 10430_locus_22 778 82.66 AgaP_AGAP007206 2L 44,286,271 44,287,612 10539_locus_43 282 77.03 AgaP_AGAP009041 3R 24,535,055 24,538,821 10567_locus_35 994 92 AgaP_AGAP000331 X 5,789,782 5,813,837 10610_locus_45 190 88.99 AgaP_AGAP000631 X 11,356,530 11,363,878 10799_locus_25 615 84.53 AgaP_AGAP005351 2L 14,446,023 14,450,861 10817_locus_15 171 83.51 AgaP_AGAP009032 3R 24,358,504 24,360,683 10919_locus_38 814 88.77 AgaP_AGAP009459 3R 33,814,493 33,815,820 1094_locus_406 49 77.57 AgaP_AGAP011070 3L 16,241,949 16,263,775 11490_locus_39 303 85.39 AgaP_AGAP000602 X 10,709,824 10,717,506 11569_locus_67 50 77.67 AgaP_AGAP006214 2L 28,459,338 28,460,276 11652_locus_17 773 80.23 AgaP_AGAP012121 3L 37,838,834 37,840,507 11680_locus_10 050 90.62 AgaP_AGAP003115 2R 32,677,888 32,742,956 12552_locus_15 76 79.45 AgaP_AGAP003560 2R 39,524,664 39,530,336 12624_locus_28 594 82.83 AgaP_AGAP005213 2L 12,366,733 12,399,301 12657_locus_83 7 82.56 AgaP_AGAP011104 3L 16,990,272 16,991,898 12743_locus_51 28 83.74 AgaP_AGAP009997 3R 47,430,060 47,430,884 12801_locus_24 448 87.04 AgaP_AGAP010162 3R 49,749,722 49,757,997 13101_locus_77 43 76.37 AgaP_AGAP004036 2R 48,260,053 48,265,961

182 13129_locus_10 725 86.74 NKD_ANOGA 3R 45,349,133 45,390,739 13180_locus_44 028 76.62 AgaP_AGAP003918 2R 46,261,308 46,263,981 13180_locus_44 026 76.62 AgaP_AGAP003918 2R 46,261,308 46,263,981 13243_locus_65 97 86.19 AgaP_AGAP000805 X 14,756,843 14,758,303 13255_locus_44 602 78.88 AgaP_AGAP000453 X 7,904,010 7,913,810 13741_locus_33 129 78.97 AgaP_AGAP001316 2R 2,915,593 2,934,805 13775_locus_34 870 77.95 CPTC2 2L 19,097,966 19,098,322 14318_locus_15 101 79.86 AgaP_AGAP003656 2R 41,180,953 41,215,391 14666_locus_43 228 73.42 smurf 14929_locus_21 365 87.89 AgaP_AGAP006638 2L 35,406,635 35,411,920 15035_locus_41 784 83.61 AgaP_AGAP011189 3L 18,872,452 18,875,154 15118_locus_30 356 77.01 AgaP_AGAP002166 2R 16,812,393 16,816,168 15179_locus_32 26 83.41 AgaP_AGAP001037 X 19,938,332 19,941,402 15179_locus_16 727 83.41 AgaP_AGAP001037 X 19,938,332 19,941,402 15236_locus_38 322 87.67 AgaP_AGAP000360 X 6,486,709 6,500,088 15518_locus_63 28 86.6 AgaP_AGAP002161 2R 16,611,022 16,635,610 15749_locus_18 219 85.26 AgaP_AGAP007592 2L 48,060,644 48,082,917 15749_locus_20 977 85.26 AgaP_AGAP007592 2L 48,060,644 48,082,917 15961_locus_26 067 87.79 AgaP_AGAP013029 2R 31,316,494 31,321,779 16282_locus_81 37 82.42 AgaP_AGAP011373 3L 22,316,104 22,333,447 16333_locus_40 173 74.36 AgaP_AGAP011982 3L 35,849,053 35,850,604 16423_locus_23 990 92.53 ATC1_ANOGA 2L 27,894,349 27,928,675 16566_locus_50 03 82.06 AgaP_AGAP005659 2L 18,351,791 18,353,234 16716_locus_11 45 88.48 AgaP_AGAP011315 3L 20,934,915 20,942,050

183 16762_locus_24 9 80.38 AgaP_AGAP004765 2L 3,156,536 3,158,788 1692_locus_839 6 83.52 AgaP_AGAP003141 2R 33,183,329 33,200,001 1692_locus_188 9 83.52 AgaP_AGAP003141 2R 33,183,329 33,200,001 16955_locus_15 626 85.93 AgaP_AGAP002889 2R 28,937,967 28,938,827 16988_locus_16 10 82.76 AgaP_AGAP003831 2R 44,192,743 44,197,614 17059_locus_17 117 85.67 AgaP_AGAP000630 X 11,335,152 11,352,646 17065_locus_16 626 85.96 AgaP_AGAP000041 X 512,138 517,675 1712_locus_178 00 83.19 AgaP_AGAP002262 2R 18,298,863 18,307,591 17253_locus_16 268 75 AgaP_AGAP012000 3L 36,049,525 36,050,591 17439_locus_22 821 87.2 AgaP_AGAP007031 2L 41,165,091 41,222,640 177_locus_4331 4 80.62 AgaP_AGAP009677 3R 38,306,520 38,314,791 177_locus_1259 1 80.62 AgaP_AGAP009677 3R 38,306,520 38,314,791 17730_locus_65 94 78.95 AgaP_AGAP005787 2L 20,649,734 20,650,289 17790_locus_34 309 83.58 AgaP_AGAP010251 3R 51,658,337 51,662,103 17835_locus_37 547 82.33 AgaP_AGAP003111 2R 32,505,726 32,508,690 18268_locus_11 711 74.22 AgaP_AGAP004256 2R 53,330,421 53,368,510 NADH dehydrogenase subunit 4 (ND4),tRNA-His gene, NADH dehydrogenase 18344_locus_30 subunit 5 (ND5) Mitocho 105 71.29 gene ndria 18574_locus_11 199 84.82 AgaP_AGAP002991 2R 30,635,986 30,637,455 18700_locus_23 37 88.02 AgaP_AGAP010750 3L 9,789,482 9,818,643 18808_locus_50 84 83.32 AgaP_AGAP009087 3R 25,493,915 25,496,935 1892_locus_286 76.95 AgaP_AGAP013494 2R 37,973,160 37,976,060

184 38 1915_locus_372 69 87.9 AgaP_AGAP003468 2R 38,042,427 38,056,441 19156_locus_52 2 80.93 AgaP_AGAP004854 2L 4,483,668 4,490,466 19162_locus_31 460 90.22 AgaP_AGAP000184 X 3,012,698 3,021,784 19205_locus_41 8 79.4 AgaP_AGAP004090 2R 49,541,940 49,551,051 19205_locus_17 447 79.4 AgaP_AGAP004090 2R 49,541,940 49,551,051 19388_locus_42 649 90.33 AgaP_AGAP001633 2R 7,152,914 7,188,656 19524_locus_32 633 81.46 AgaP_AGAP000215 X 3,572,299 3,581,714 19605_locus_18 247 92.07 AgaP_AGAP002076 2R 14,826,967 14,831,432 1985_locus_249 90 78.13 AgaP_AGAP006327 2L 29,733,363 29,734,698 19875_locus_15 255 77.37 AgaP_AGAP004560 2R 57,603,436 57,608,335 20069_locus_20 86 89.37 AgaP_AGAP005467 2L 15,963,587 15,980,283 20151_locus_37 228 92.73 AgaP_AGAP010049 3R 48,276,330 48,303,583 20412_locus_36 377 84.6 AgaP_AGAP001416 2R 4,586,513 4,587,286 20487_locus_31 366 73.64 AgaP_AGAP007935 3R 3,148,064 3,167,010 2051_locus_131 37 84.78 AgaP_AGAP011526 3L 26,521,434 26,531,738 20656_locus_34 951 87.08 AgaP_AGAP000217 X 3,631,408 3,638,938 20679_locus_26 89 92.58 AgaP_AGAP004372 2R 55,372,537 55,394,660 20843_locus_15 913 84.58 AgaP_AGAP009954 3R 46,297,393 46,302,538 21054_locus_45 864 88.19 ARM_ANOGA X 20,100,627 20,158,206 21103_locus_15 648 72.41 AgaP_AGAP002602 2R 23,823,922 23,825,471 21371_locus_43 148 74.8 AgaP_AGAP003618 2R 40,568,743 40,575,981 21515_locus_28 312 75.35 AgaP_AGAP010201 3R 50,614,016 50,616,187 21851_locus_41 018 75.54 AgaP_AGAP006373 2L 30,566,598 30,573,592 2200_locus_133 78.82 AgaP_AGAP006702 2L 37,071,278 37,079,246

185 37 22008_locus_42 03 86.78 AgaP_AGAP003506 2R 38,764,084 38,772,048 22046_locus_17 791 88.42 AgaP_AGAP005891 2L 23,191,393 23,209,482 22218_locus_27 814 82.58 CUE_ANOGA 2L 41,013,267 41,025,452 2247_locus_145 4 81.68 AgaP_AGAP004125 2R 50,398,489 50,399,671 2247_locus_145 7 81.68 AgaP_AGAP004125 2R 50,398,489 50,399,671 2247_locus_145 3 81.68 AgaP_AGAP004125 2R 50,398,489 50,399,671 22608_locus_45 125 75.22 AgaP_AGAP007167 2L 43,756,355 43,757,437 22730_locus_31 706 81.57 AgaP_AGAP005897 2L 23,352,311 23,357,446 23109_locus_36 486 95.21 AgaP_AGAP009995 3R 47,378,262 47,382,240 2404_locus_251 19 92.61 AgaP_AGAP007131 2L 43,139,088 43,142,599 24668_locus_18 804 88.76 AgaP_AGAP009750 3R 42,351,341 42,356,155 24866_locus_17 51 75.68 AgaP_AGAP002041 2R 14,312,602 14,358,238 24946_locus_37 51 86.64 KIBRLG X 582 16,387 25003_locus_21 234 94.79 AgaP_AGAP006328 2L 29,747,604 29,774,523 25215_locus_37 539 85.95 AgaP_AGAP002578 2R 23,163,895 23,198,838 25242_locus_32 929 89.38 AgaP_AGAP006824 2L 38,955,405 38,962,892 25933_locus_16 326 89.48 AgaP_AGAP005672 2L 18,535,500 18,542,604 2602_locus_234 72 77.22 AgaP_AGAP004340 2R 54,835,347 54,837,595 26248_locus_31 885 72.39 AgaP_AGAP008668 3R 14,644,361 14,646,193 26261_locus_84 57 87.04 AgaP_AGAP010711 3L 8,901,319 8,904,939 26265_locus_24 661 88.41 GPROP1 2R 735,912 737,457 26306_locus_87 77 72.4 AgaP_AGAP009004 3R 23,176,726 23,183,401 270_locus_1443 6 76.43 AgaP_AGAP001194 2R 1,215,908 1,232,126 3104_locus_975 78.83 AgaP_AGAP013524 2R 50,689,842 50,692,141

186 6 3341_locus_104 84 81.23 AgaP_AGAP004284 2R 53,882,841 53,884,852 3353_locus_158 14 81.71 GPRMAC2 2R 61,414,929 61,417,946 3436_locus_118 8 83.74 AgaP_AGAP011206 3L 19,377,155 19,379,051 3444_locus_179 74 89.92 AgaP_AGAP010093 3R 48,863,994 48,868,239 3600_locus_389 55 83.01 AgaP_AGAP001696 2R 8,745,208 8,775,657 3768_locus_179 57 90.65 AgaP_AGAP004066 2R 49,296,241 49,374,964 3889_locus_159 putative Ropn1l- 8 75.85 like protein 4069_locus_263 28 86.36 AgaP_AGAP006349 2L 30,343,651 30,358,560 4077_locus_180 9 77.2 AgaP_AGAP003151 2R 33,381,051 33,383,831 4093_locus_649 4 91.1 AgaP_AGAP003597 2R 40,392,844 40,401,314 pyrokinin-like 4139_locus_877 78.9 receptor 1 (APR-1) 4195_locus_391 7 91.7 AgaP_AGAP006141 2L 27,094,572 27,101,952 4196_locus_224 94 100 AgaP_AGAP005164 2L 11,054,076 11,119,385 4240_locus_457 3 73.15 AgaP_AGAP003524 2R 39,125,743 39,129,201 4283_locus_345 36 81.95 AgaP_AGAP000534 X 9,555,155 9,565,371 4442_locus_108 63 81 AgaP_AGAP007924 3R 2,856,038 2,897,703 4449_locus_624 3 80.68 AgaP_AGAP002945 2R 30,000,134 30,006,998 469_locus_1704 2 89.18 AgaP_AGAP009006 3R 23,234,262 23,241,748 4788_locus_152 11 82.3 AgaP_AGAP003858 2R 44,651,654 44,653,292 4862_locus_434 6 84.01 AgaP_AGAP006324 2L 29,584,913 29,588,492 4868_locus_335 35 80.43 AgaP_AGAP001014 X 19,408,610 19,409,948 4881_locus_148 97 75.02 AgaP_AGAP003715 2R 42,403,550 42,408,724 4999_locus_308 86.83 AgaP_AGAP002272 2R 18,416,769 18,456,860 5238_locus_294 77.02 AgaP_AGAP011138 3L 17,798,459 17,819,055

187 47 5309_locus_475 4 85.33 AgaP_AGAP007107 2L 42,819,885 42,821,706 5577_locus_241 50 82 AgaP_AGAP007904 3R 2,605,246 2,606,918 5619_locus_264 05 76.8 AgaP_AGAP003043 2R 31,405,640 31,418,582 5632_locus_285 04 93.87 AgaP_AGAP011111 3L 17,034,389 17,057,435 5677_locus_243 81 83.56 AgaP_AGAP011084 3L 16,351,133 16,353,867 5677_locus_214 89 83.56 AgaP_AGAP011084 3L 16,351,133 16,353,867 5999_locus_362 18 83.36 AgaP_AGAP004573 2R 57,711,826 57,725,049 6070_locus_387 20 80.65 AgaP_AGAP011661 3L 31,043,003 31,046,325 6116_locus_444 40 95.7 AgaP_AGAP007996 3R 3,941,700 3,968,468 6214_locus_333 99 96.3 AgaP_AGAP009310 3R 31,160,369 31,163,214 6303_locus_329 42 74.78 AgaP_AGAP000488 X 8,579,493 8,602,534 6324_locus_399 87 97.37 AgaP_AGAP002971 2R 30,267,141 30,294,999 6455_locus_159 54 84.35 AgaP_AGAP007069 2L 42,165,842 42,176,356 6479_locus_189 9 81.96 AgaP_AGAP005975 2L 24,348,685 24,349,738 6613_locus_305 04 88.59 AgaP_AGAP002325 2R 19,610,696 19,658,751 6826_locus_458 phenoloxidase 51 81.31 inhibitor protein 7090_locus_285 87 79.48 AgaP_AGAP013482 2R 52,002,707 52,006,145 7185_locus_219 5 85.28 AgaP_AGAP005901 2L 23,470,505 23,484,056 7354_locus_403 87 78.25 AgaP_AGAP013217 2R 40,060,839 40,063,191 7465_locus_187 99 74.52 AgaP_AGAP004322 2R 54,458,643 54,461,050 7477_locus_267 46 82.41 AgaP_AGAP004772 2L 3,348,417 3,353,159 7583_locus_450 03 89.34 AgaP_AGAP010139 3R 49,356,476 49,362,525 7759_locus_352 20 91.44 AgaP_AGAP000029 X 379,162 393,442

188 7759_locus_463 63 91.44 AgaP_AGAP000029 X 379,162 393,442 7808_locus_151 18 81.64 AgaP_AGAP008848 3R 19,395,968 19,401,441 8065_locus_155 44 78.28 AgaP_AGAP004782 2L 3,469,348 3,473,011 8404_locus_209 15 73.26 AgaP_AGAP006092 2L 26,586,883 26,589,045 8407_locus_420 79 86.75 AgaP_AGAP008783 3R 17,809,578 17,810,630 8436_locus_337 37 78.61 AgaP_AGAP009688 3R 38,705,707 38,708,285 8647_locus_196 15 82.07 AgaP_AGAP006042 2L 25,557,922 25,567,942 8656_locus_133 1 85.31 AgaP_AGAP000686 X 12,174,016 12,182,557 8687_locus_270 12 77.07 AgaP_AGAP008770 3R 17,382,453 17,386,996 875_locus_1260 6 88.85 AgaP_AGAP011134 3L 17,742,761 17,775,344 8779_locus_412 83 91.78 AgaP_AGAP000687 X 12,281,540 12,300,807 8953_locus_458 14 75.23 AgaP_AGAP008779 3R 17,727,806 17,728,715 908_locus_6264 77.74 AgaP_AGAP006624 2L 34,768,062 34,771,115 9167_locus_259 92 83.54 AgaP_AGAP008353 3R 9,738,104 9,740,908 9360_locus_386 67 85.59 AgaP_AGAP013283 X 9,538,068 9,538,788 9385_locus_831 3 83.2 AgaP_AGAP002752 2R 26,670,074 26,672,699 9411_locus_607 3 81.61 AgaP_AGAP004408 2R 55,813,788 55,816,172 9440_locus_420 25 82.26 AgaP_AGAP002250 2R 18,103,826 18,178,600 9586_locus_107 97 86.13 AgaP_AGAP001267 2R 2,125,310 2,132,140 9638_locus_917 1 81.89 AgaP_AGAP004608 2R 58,346,783 58,354,002 9780_locus_891 putative odorant 2 72 receptor Or2

189 X. Appendix E

Primers used for SNP amplification and barcoding amplicons for high- throughput sequencing

190 Appendix E.1 Primers used to amplify polymorphisms in An. punctulatus s.s.

Primer Name, contig, SNP location Primer sequence

CTTTCCCTACACGACGCTCTTCCGATCTAACTTGGCG AP_10286_4120F CAGGACCTC

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTAC AP_10286_4120R AAGGGCGCAACGAGAA

CTTTCCCTACACGACGCTCTTCCGATCTCTCACGATG AP_10537_795F ACTGGGGCG

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTC AP_10537_795R TCCAAAGCCAGTGCGG

CTTTCCCTACACGACGCTCTTCCGATCTTTGCTAGTC AP_10624_4229F GGCGTGGTG

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCA AP_10624_4229R ACACAAGCGCCCACAC

CTTTCCCTACACGACGCTCTTCCGATCTAACCAGCGC AP_11191_5165F GACCAACTT

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTG AP_11191_5165R CGCGCAGGATAGCATT

CTTTCCCTACACGACGCTCTTCCGATCTAAACCAGCC AP_11545_3603F CGGCATGAA

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCG AP_11545_3603R TGGCGGCTGAGATTGA

CTTTCCCTACACGACGCTCTTCCGATCTCCTGCCACG AP_11607_4282F GGAATACGG

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGC AP_11607_4282R GCTCGCCAGTTCAAAG

CTTTCCCTACACGACGCTCTTCCGATCTTTCCGATGG AP_12969_4657F GTTTGGGCC AP_12969_4657R

191 GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTAG TGCGACCAACCGAGTG

CTTTCCCTACACGACGCTCTTCCGATCTGTTTGTGTC AP_12988_3197F CCGCGATGC

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCG AP_12988_3197R TCGGGCTGTTCATGGT

CTTTCCCTACACGACGCTCTTCCGATCTAGCTGGCAT AP_13254_3960F GATGGACGC

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCG AP_13254_3960R CACGTTTTAGCCGACG

CTTTCCCTACACGACGCTCTTCCGATCTCGCGCAGTA AP_13374_637F GGCAGATGA

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCT AP_13374_637R TCAGAACCGGCCCAGG

CTTTCCCTACACGACGCTCTTCCGATCTGCTTTCATG AP_13457_8350F GCACGGCAG

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCG AP_13457_8350R TGTTTGCAGCGTGCTT

CTTTCCCTACACGACGCTCTTCCGATCTGAGTTCCAC AP_10419_4955F GATCCAGCGT

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGC AP_10419_4955R AAAAGGTCCCGCGTTC

CTTTCCCTACACGACGCTCTTCCGATCTCGCACTTTT AP_10708_1741F ACACCGCCG

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCC AP_10708_1741R AGTCCTGCAACGGGTT

CTTTCCCTACACGACGCTCTTCCGATCTAACACCTCG AP_10742_1311F TACGCCGTG

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCG AP_10742_1311R GTGGGATAGCAGGCAG

AP_11015_12327F CTTTCCCTACACGACGCTCTTCCGATCTCGATGGGA

192 TTGAAGCACATGT GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCA AP_11015_12327R CGTGTTGGAAAAGGTTGGT

CTTTCCCTACACGACGCTCTTCCGATCTTCATGCCCC AP_11207_2362F AACCGGAAC

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTT AP_11207_2362R GCATACCTCTCGGGCG

CTTTCCCTACACGACGCTCTTCCGATCTGGCGGTGTG AP_11263_11245F TGACTTCGA

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGG AP_11263_11245R TTGGCTTCGTGGTCGA

CTTTCCCTACACGACGCTCTTCCGATCTGGGAACCGG AP_1142_11555F TCGCGTTAA

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTC AP_1142_11555R ACTCGATCACGTGCGG

CTTTCCCTACACGACGCTCTTCCGATCTCAGTGTTGT AP_11465_4040F CGCATGCGG

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTG AP_11465_4040R CCCGGCCAAAAAGGAT

CTTTCCCTACACGACGCTCTTCCGATCTGTGAGAAG AP_11571_9122F CTCGTGCGGT

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTAG AP_11571_9122R GGAACGGTCGCGTTTT

CTTTCCCTACACGACGCTCTTCCGATCTTGGTCCAAG AP_11776_5771F GGCAGTGGA

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCA AP_11776_5771R GCCTCTTCACGCCACA

CTTTCCCTACACGACGCTCTTCCGATCTGGGCGCTAC AP_11783_1975F GGAGATGTC

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTG AP_11783_1975R TTTGTCCAGCGAACGTG

193

CTTTCCCTACACGACGCTCTTCCGATCTCCGGCATCT AP_10062_3285F GAAGGCGAA

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGT AP_10062_3285R TTGACGTACTGGTGCGG

CTTTCCCTACACGACGCTCTTCCGATCTCTGTGCAGC AP_10363_6785F TTGTAATGGAAA

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGC AP_10363_6785R TTGTTCTAATTGTTGGGCA

CTTTCCCTACACGACGCTCTTCCGATCTGGGACCAAT AP_10366_2456F AAGGCCGGG

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGC AP_10366_2456R AGGTGCTAGCCAAAACC

CTTTCCCTACACGACGCTCTTCCGATCTGATCCCGGC AP_10427_2945F ACTCAGAGC

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCC AP_10427_2945R TCTGCCACCGCAAGAA

CTTTCCCTACACGACGCTCTTCCGATCTTAGCACCAA AP_1090_13920F CACCGGCAG

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTAC AP_1090_13920R CGACGGTCAGCTGTTG

CTTTCCCTACACGACGCTCTTCCGATCTATCATTTCC AP_1134_5902F GGCCGCCAT

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGT AP_1134_5902R CGTAACACTGCGTACCG

CTTTCCCTACACGACGCTCTTCCGATCTGTCTTCAAC AP_11513_5724F CGCCCACGA

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCG AP_11513_5724R CCGCATTCCATTTCCG

CTTTCCCTACACGACGCTCTTCCGATCTCCCGTAGAA AP_12142_4198F CCGCCGAAT

194

GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTC AP_12142_4198R GGTTCCACGCTTGCAT

195 Appendix E.2. Primers used to add barcode and Illumina adaptor sequences for high-throughput sequencing. Primer PCR_Pr1 is a universal reverse primer, all other primers contain a specific barcode sequence.

Name Primer Sequence PCR_Pr1 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGAC PCR_DS1 CAAGCAGAAGACGGCATACGAGATAACCGCGTGACTGGAGTTC PCR_DS2 CAAGCAGAAGACGGCATACGAGATAACGCCGTGACTGGAGTTC PCR_DS3 CAAGCAGAAGACGGCATACGAGATAAGCGGGTGACTGGAGTTC PCR_DS4 CAAGCAGAAGACGGCATACGAGATAAGGCGGTGACTGGAGTTC PCR_DS5 CAAGCAGAAGACGGCATACGAGATACACAGGTGACTGGAGTTC PCR_DS6 CAAGCAGAAGACGGCATACGAGATACACTCGTGACTGGAGTTC PCR_DS7 CAAGCAGAAGACGGCATACGAGATACAGACGTGACTGGAGTTC PCR_DS8 CAAGCAGAAGACGGCATACGAGATACAGTGGTGACTGGAGTTC PCR_DS9 CAAGCAGAAGACGGCATACGAGATACCACTGTGACTGGAGTTC PCR_DS10 CAAGCAGAAGACGGCATACGAGATACCAGAGTGACTGGAGTTC PCR_DS11 CAAGCAGAAGACGGCATACGAGATACCTCAGTGACTGGAGTTC PCR_DS12 CAAGCAGAAGACGGCATACGAGATACCTGTGTGACTGGAGTTC PCR_DS13 CAAGCAGAAGACGGCATACGAGATACGACAGTGACTGGAGTTC PCR_DS14 CAAGCAGAAGACGGCATACGAGATACGAGTGTGACTGGAGTTC PCR_DS15 CAAGCAGAAGACGGCATACGAGATACGTCTGTGACTGGAGTTC PCR_DS16 CAAGCAGAAGACGGCATACGAGATACGTGAGTGACTGGAGTTC PCR_DS17 CAAGCAGAAGACGGCATACGAGATACTCACGTGACTGGAGTTC PCR_DS18 CAAGCAGAAGACGGCATACGAGATACTCTGGTGACTGGAGTTC PCR_DS19 CAAGCAGAAGACGGCATACGAGATACTGAGGTGACTGGAGTTC PCR_DS20 CAAGCAGAAGACGGCATACGAGATACTGTCGTGACTGGAGTTC PCR_DS21 CAAGCAGAAGACGGCATACGAGATAGACACGTGACTGGAGTTC PCR_DS22 CAAGCAGAAGACGGCATACGAGATAGACTGGTGACTGGAGTTC PCR_DS23 CAAGCAGAAGACGGCATACGAGATAGAGAGGTGACTGGAGTTC PCR_DS24 CAAGCAGAAGACGGCATACGAGATAGAGTCGTGACTGGAGTTC PCR_DS25 CAAGCAGAAGACGGCATACGAGATAGCACAGTGACTGGAGTTC PCR_DS26 CAAGCAGAAGACGGCATACGAGATAGCAGTGTGACTGGAGTTC PCR_DS27 CAAGCAGAAGACGGCATACGAGATAGCTCTGTGACTGGAGTTC PCR_DS28 CAAGCAGAAGACGGCATACGAGATAGCTGAGTGACTGGAGTTC PCR_DS29 CAAGCAGAAGACGGCATACGAGATAGGACTGTGACTGGAGTTC PCR_DS30 CAAGCAGAAGACGGCATACGAGATAGGAGAGTGACTGGAGTTC PCR_DS31 CAAGCAGAAGACGGCATACGAGATAGGTCAGTGACTGGAGTTC PCR_DS32 CAAGCAGAAGACGGCATACGAGATAGGTGTGTGACTGGAGTTC PCR_DS33 CAAGCAGAAGACGGCATACGAGATAGTCAGGTGACTGGAGTTC PCR_DS34 CAAGCAGAAGACGGCATACGAGATAGTCTCGTGACTGGAGTTC PCR_DS35 CAAGCAGAAGACGGCATACGAGATAGTGACGTGACTGGAGTTC PCR_DS36 CAAGCAGAAGACGGCATACGAGATAGTGTGGTGACTGGAGTTC PCR_DS37 CAAGCAGAAGACGGCATACGAGATATCCGGGTGACTGGAGTTC

196 PCR_DS38 CAAGCAGAAGACGGCATACGAGATATCGCGGTGACTGGAGTTC PCR_DS39 CAAGCAGAAGACGGCATACGAGATATCGGCGTGACTGGAGTTC PCR_DS40 CAAGCAGAAGACGGCATACGAGATATGCCGGTGACTGGAGTTC PCR_DS41 CAAGCAGAAGACGGCATACGAGATATGCGCGTGACTGGAGTTC PCR_DS42 CAAGCAGAAGACGGCATACGAGATATGGCCGTGACTGGAGTTC PCR_DS43 CAAGCAGAAGACGGCATACGAGATCAACCTGTGACTGGAGTTC PCR_DS44 CAAGCAGAAGACGGCATACGAGATCAACGAGTGACTGGAGTTC PCR_DS45 CAAGCAGAAGACGGCATACGAGATCAAGCAGTGACTGGAGTTC PCR_DS46 CAAGCAGAAGACGGCATACGAGATCAAGGTGTGACTGGAGTTC PCR_DS47 CAAGCAGAAGACGGCATACGAGATCACAAGGTGACTGGAGTTC PCR_DS48 CAAGCAGAAGACGGCATACGAGATCACATCGTGACTGGAGTTC PCR_DS49 CAAGCAGAAGACGGCATACGAGATCACTACGTGACTGGAGTTC PCR_DS50 CAAGCAGAAGACGGCATACGAGATCACTTGGTGACTGGAGTTC PCR_DS51 CAAGCAGAAGACGGCATACGAGATCAGAACGTGACTGGAGTTC PCR_DS52 CAAGCAGAAGACGGCATACGAGATCAGATGGTGACTGGAGTTC PCR_DS53 CAAGCAGAAGACGGCATACGAGATCAGTAGGTGACTGGAGTTC PCR_DS54 CAAGCAGAAGACGGCATACGAGATCAGTTCGTGACTGGAGTTC PCR_DS55 CAAGCAGAAGACGGCATACGAGATCATCCAGTGACTGGAGTTC PCR_DS56 CAAGCAGAAGACGGCATACGAGATCATCGTGTGACTGGAGTTC PCR_DS57 CAAGCAGAAGACGGCATACGAGATCATGCTGTGACTGGAGTTC PCR_DS58 CAAGCAGAAGACGGCATACGAGATCATGGAGTGACTGGAGTTC PCR_DS59 CAAGCAGAAGACGGCATACGAGATCCAACGGTGACTGGAGTTC PCR_DS60 CAAGCAGAAGACGGCATACGAGATCCAAGCGTGACTGGAGTTC PCR_DS61 CAAGCAGAAGACGGCATACGAGATCCATCCGTGACTGGAGTTC PCR_DS62 CAAGCAGAAGACGGCATACGAGATCCATGGGTGACTGGAGTTC PCR_DS63 CAAGCAGAAGACGGCATACGAGATCCGCAAGTGACTGGAGTTC PCR_DS64 CAAGCAGAAGACGGCATACGAGATCCGCTTGTGACTGGAGTTC PCR_DS65 CAAGCAGAAGACGGCATACGAGATCCGGATGTGACTGGAGTTC PCR_DS66 CAAGCAGAAGACGGCATACGAGATCCGGTAGTGACTGGAGTTC PCR_DS67 CAAGCAGAAGACGGCATACGAGATCCTACCGTGACTGGAGTTC PCR_DS68 CAAGCAGAAGACGGCATACGAGATCCTAGGGTGACTGGAGTTC PCR_DS69 CAAGCAGAAGACGGCATACGAGATCCTTCGGTGACTGGAGTTC PCR_DS70 CAAGCAGAAGACGGCATACGAGATCCTTGCGTGACTGGAGTTC PCR_DS71 CAAGCAGAAGACGGCATACGAGATCGAACCGTGACTGGAGTTC PCR_DS72 CAAGCAGAAGACGGCATACGAGATCGAAGGGTGACTGGAGTTC PCR_DS73 CAAGCAGAAGACGGCATACGAGATCGATCGGTGACTGGAGTTC PCR_DS74 CAAGCAGAAGACGGCATACGAGATCGATGCGTGACTGGAGTTC PCR_DS75 CAAGCAGAAGACGGCATACGAGATCGCCAAGTGACTGGAGTTC PCR_DS76 CAAGCAGAAGACGGCATACGAGATCGCCTTGTGACTGGAGTTC PCR_DS77 CAAGCAGAAGACGGCATACGAGATCGCGATGTGACTGGAGTTC PCR_DS78 CAAGCAGAAGACGGCATACGAGATCGCGTAGTGACTGGAGTTC PCR_DS79 CAAGCAGAAGACGGCATACGAGATCGGCATGTGACTGGAGTTC PCR_DS80 CAAGCAGAAGACGGCATACGAGATCGGCTAGTGACTGGAGTTC

197 PCR_DS81 CAAGCAGAAGACGGCATACGAGATCGTACGGTGACTGGAGTTC PCR_DS82 CAAGCAGAAGACGGCATACGAGATCGTAGCGTGACTGGAGTTC PCR_DS83 CAAGCAGAAGACGGCATACGAGATCGTTCCGTGACTGGAGTTC PCR_DS84 CAAGCAGAAGACGGCATACGAGATCGTTGGGTGACTGGAGTTC PCR_DS85 CAAGCAGAAGACGGCATACGAGATCTACCAGTGACTGGAGTTC PCR_DS86 CAAGCAGAAGACGGCATACGAGATCTACGTGTGACTGGAGTTC PCR_DS87 CAAGCAGAAGACGGCATACGAGATCTAGCTGTGACTGGAGTTC PCR_DS88 CAAGCAGAAGACGGCATACGAGATCTAGGAGTGACTGGAGTTC PCR_DS89 CAAGCAGAAGACGGCATACGAGATCTCAACGTGACTGGAGTTC PCR_DS90 CAAGCAGAAGACGGCATACGAGATCTCATGGTGACTGGAGTTC PCR_DS91 CAAGCAGAAGACGGCATACGAGATCTCTAGGTGACTGGAGTTC PCR_DS92 CAAGCAGAAGACGGCATACGAGATCTCTTCGTGACTGGAGTTC PCR_DS93 CAAGCAGAAGACGGCATACGAGATCTGAAGGTGACTGGAGTTC PCR_DS94 CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTC PCR_DS95 CAAGCAGAAGACGGCATACGAGATCTGTACGTGACTGGAGTTC PCR_DS96 CAAGCAGAAGACGGCATACGAGATCTGTTGGTGACTGGAGTTC

198 XI. Appendix F

Composition of AP sibling species blood meals by village and AP species

199     





 

 

           











Appendix F.1. Composition of blood meal for mosquitoes collected in the village of Dimer for (A) An. farauti s.s. (n=1), (B) An. koliensis (n=5) and (C) An. punctulatus s.s. (n=32). Each vertical bar shows the composition of the blood meal for one mosquito: each color represents a different host species and the height of each stacked bar corresponds to the proportion of reads matching the host DNA sequence. Grey corresponds to human DNA, turquoise to pig, blue to dog, white to mouse, red to bat and orange to cuscus.

200    







  







  





 Appendix F.2. Composition of blood meal for mosquitoes collected in the village of Wasab for (A) An. koliensis (n=8) (B) An. punctulatus s.s. (n=37). Each vertical bar shows the composition of the blood meal for one mosquito: each color represents a different host species and the height of each stacked bar corresponds to the proportion of reads matching the host DNA sequence. Grey corresponds to human DNA, turquoise to pig, blue to dog, white to mouse, red to bat and orange to cuscus.

201    







  







  





 Appendix F.3. Composition of blood meal for mosquitoes collected in the village of Kokofine for (A) An. farauti 4 (n=60) (B) An. koliensis (n=2). Each vertical bar shows the composition of the blood meal for one mosquito: each color represents a different host species and the height of each stacked bar corresponds to the proportion of reads matching the host DNA sequence. Grey corresponds to human DNA, turquoise to pig, blue to dog, white to mouse, red to bat and orange to cuscus.

202    







  







  





 Appendix F.4. Composition of blood meal for mosquitoes collected in the village of Matukar for (A) An. farauti s.s. (n=46) (B) An. punctulatus s.s. (n=1). Each vertical bar shows the composition of the blood meal for one mosquito: each color represents a different host species and the height of each stacked bar corresponds to the proportion of reads matching the host DNA sequence. Grey corresponds to human DNA, turquoise to pig, blue to dog, white to mouse, red to bat and orange to cuscus.

203    









 



 

       









  Appendix F.5. Composition of blood meal for mosquitoes collected in the village of Mirap for (A) An. farauti s.s. (n=125) (B) An. koliensis (n=1), and (C) An. punctulatus s.s. (n=1). Each vertical bar shows the composition of the blood meal for one mosquito: each color represents a different host species and the height of each stacked bar corresponds to the proportion of reads matching the host DNA sequence. Grey corresponds to human DNA, turquoise to pig, blue to dog, white to mouse, red to bat and orange to cuscus.

204 XII. References

205 (2007) Is malaria eradication possible? Lancet 370, 1459. Abascal F, Zardoya R, Telford MJ (2010) TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res 38, W7-13. Allan BF, Goessling LS, Storch GA, Thach RE (2010) Blood meal analysis to identify reservoir hosts for Amblyomma americanum ticks. Emerg Infect Dis 16, 433- 440. Alout H, Weill M (2008) Amino-acid substitutions in acetylcholinesterase 1 involved in insecticide resistance in mosquitoes. Chem Biol Interact 175, 138-141. Alout H, Yameogo B, Djogbenou LS, et al. (2014) Interplay between Plasmodium infection and resistance to insecticides in vector mosquitoes. J Infect Dis 210, 1464-1470. Ambrose L, Riginos C, Cooper RD, et al. (2012) Population structure, mitochondrial polyphyly and the repeated loss of human biting ability in anopheline mosquitoes from the southwest Pacific. Mol Ecol 21, 4327-4343. Ansell J, Hamilton KA, Pinder M, Walraven GE, Lindsay SW (2002) Short-range attractiveness of pregnant women to Anopheles gambiae mosquitoes. Trans R Soc Trop Med Hyg 96, 113-116. Avery J (1974) A Review of the Malaria Eradication Programme in the British Solomon Islands 1970-1972. Papua New Guinea Medical Journal 17, 50-60. Backhouse T (1934) Anopheles punctulatus as an experimental intermediate host of Wuchereria bancrofti. Trans R Soc Trop Med Hyg 27, 365-370. Baker RH, French WL, Kitzmiller JB (1962) Mosquito News 22. Barbosa S, Black WCt, Hastings I (2011) Challenges in estimating insecticide selection pressures from mosquito field data. PLoS Negl Trop Dis 5, e1387. Beard CB, Hamm DM, Collins FH (1993) The mitochondrial genome of the mosquito Anopheles gambiae: DNA sequence, genome organization, and comparisons with mitochondrial sequences of other insects. Insect Mol Biol 2, 103-124. Beebe N, Russell T, Burkot T, Lobo N, Cooper R (2013) The Systematics and Bionomics of Malaria Vectors in the Southwest Pacific. Beebe NW, Bakote'e B, Ellis JT, Cooper RD (2000a) Differential ecology of Anopheles punctulatus and three members of the Anopheles farauti complex of mosquitoes on Guadalcanal, Solomon Islands, identified by PCR-RFLP analysis. Med Vet Entomol 14, 308-312. Beebe NW, Cooper RD (2002) Distribution and evolution of the Anopheles punctulatus group (Diptera: Culicidae) in Australia and Papua New Guinea. Int J Parasitol 32, 563-574. Beebe NW, Cooper RD, Morrison DA, Ellis JT (2000b) A phylogenetic study of the Anopheles punctulatus group of malaria vectors comparing rDNA sequence alignments derived from the mitochondrial and nuclear small ribosomal subunits. Mol Phylogenet Evol 17, 430-436. Beebe NW, Ellis JT, Cooper RD, Saul A (1999) DNA sequence analysis of the ribosomal DNA ITS2 region for the Anopheles punctulatus group of mosquitoes. Insect Mol Biol 8, 381-390.

206 Beebe NW, Foley DH, Saul A, et al. (1994) DNA probes for identifying the members of the Anopheles punctulatus complex in Papua New Guinea. Am J Trop Med Hyg 50, 229-234. Beebe NW, Russell T, Burkot TR, Cooper RD (2015) Anopheles punctulatus group: evolution, distribution, and control. Annu Rev Entomol 60, 335-350. Beebe NW, Russell, Tanya L., Burkot, Thomas R. ,Lobo, Neil F. and Cooper, Robert D. (2013) The Systematics and Bionomics of Malaria Vectors in the Southwest Pacific. In: Anopheles mosquitoes - New insights into malaria vectors (ed. Manguin PS). InTech. Beebe NW, Saul A (1995) Discrimination of all members of the Anopheles punctulatus complex by polymerase chain reaction--restriction fragment length polymorphism analysis. Am J Trop Med Hyg 53, 478-481. Beier JC, Perkins PV, Wirtz RA, et al. (1988) Bloodmeal identification by direct enzyme-linked immunosorbent assay (ELISA), tested on Anopheles (Diptera: Culicidae) in Kenya. J Med Entomol 25, 9-16. Belkin J (1962) The Mosquitoes of the South Pacfic (Diptera: Culicidae). University of California Press, Berkeley, CA. Benet A, Mai A, Bockarie F, et al. (2004) Polymerase chain reaction diagnosis and the changing pattern of vector ecology and malaria transmission dynamics in papua new Guinea. Am J Trop Med Hyg 71, 277-284. Beserra EB, de Castro FP, Jr., dos Santos JW, Santos Tda S, Fernandes CR (2006) [Biology and thermal exigency of Aedes aegypti (L.) (Diptera: Culicidae) from four bioclimatic localities of Paraiba]. Neotrop Entomol 35, 853-860. Black R (1955) Malaria in the South-west Pacific. South Pacific Comission Technical Paper 81, 1-56. Blanchette M, Kent WJ, Riemer C, et al. (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14, 708-715. Blandin SA, Wang-Sattler R, Lamacchia M, et al. (2009) Dissecting the genetic basis of resistance to malaria parasites in Anopheles gambiae. Science 326, 147- 150. Bockarie M, Kazura J, Alexander N, et al. (1996) Transmission dynamics of Wuchereria bancrofti in East Sepik Province, Papua New Guinea. Am J Trop Med Hyg 54, 577-581. Bockarie MJ, Kazura JW (2003) Lymphatic filariasis in Papua New Guinea: prospects for elimination. Med Microbiol Immunol 192, 9-14. Bower JE, Dowton M, Cooper RD, Beebe NW (2008) Intraspecific concerted evolution of the rDNA ITS1 in Anopheles farauti sensu stricto (Diptera: Culicidae) reveals recent patterns of population structure. J Mol Evol 67, 397- 411. Brower AV (1994) Rapid morphological radiation and convergence among races of the butterfly Heliconius erato inferred from patterns of mitochondrial DNA evolution. Proc Natl Acad Sci U S A 91, 6491-6495. Bryan JH (1973a) Studies on the Anopheles punctulatus complex. 3. Mating behaviour of the F1 hybrid adults from crosses between Anopheles farauti no. 1 and Anopheles farauti no. 2. . Trans R Soc Trop Med Hyg 67.

207 Bryan JH (1973b) Studies on the Anopheles punctulatus complex. I. Identification by proboscis morphological criteria and by cross-mating experiments. Trans R Soc Trop Med Hyg 67, 64-69. Bryan JH (1973c) Studies on the Anopheles punctulatus complex. II. Hybridization of the member species. Trans R Soc Trop Med Hyg 67, 70-84. Bryan JH (1974) Morphological studies on Anopheles punctulatus Donitz complex. Transactions of the Royal Entomological Society London 125, 413-435. Bryan JH (1986) Vectors of Wuchereria bancrofti in the Sepik Provinces of Papua New Guinea. Transactions of the Royal Soeciety of Tropical Medicine and Hygiene 80, 123-131. Bryan JH, Coluzzi M (1971) Cytogenetic observations on Anopheles farauti Laveran. Bull World Health Organ 45, 266-267. Bugoro H, Cooper RD, Butafa C, et al. (2011) Bionomics of the malaria vector Anopheles farauti in Temotu Province, Solomon Islands: issues for malaria elimination. Malar J 10, 133. Bugoro H, Hii JL, Butafa C, et al. (2014) The bionomics of the malaria vector Anopheles farauti in Northern Guadalcanal, Solomon Islands: issues for successful vector control. Malar J 13, 56. Burkot TR, Graves PM, Paru R, Lagog M (1988) Mixed blood feeding by the malaria vectors in the Anopheles punctulatus complex (Diptera: Culicidae). J Med Entomol 25, 205-213. Burkot TR, Russell TL, Reimer LJ, et al. (2013) Barrier screens: a method to sample blood-fed and host-seeking exophilic mosquitoes. Malar J 12, 49. Carey AF, Wang G, Su CY, Zwiebel LJ, Carlson JR (2010) Odorant reception in the malaria mosquito Anopheles gambiae. Nature 464, 66-71. Chandre F, Darriet F, Duchon S, et al. (2000) Modifications of pyrethroid effects associated with kdr mutation in Anopheles gambiae. Med Vet Entomol 14, 81- 88. Charlwood JD, Dagoro H, Paru R (1985) Blood-feeding and resting behaviour in the Anopheles punctulatus Donitz complex (Diptera: Culicidae) from coastal Papua New Guinea. Bulletin of Entomological Research 75, 463-476. Charlwood JD, Graves PM, Alpers MP (1986) The ecology of the Anopheles punctulatus group of mosquitoes from Papua New Guinea: a review of recent work. P N G Med J 29, 19-26. Che Lah EF, Yaakop S, Ahamad M, Md Nor S (2015) Molecular identification of blood meal sources of ticks (Acari, Ixodidae) using cytochrome b gene as a genetic marker. Zookeys, 27-43. Chenna R, Sugawara H, Koike T, et al. (2003) Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 31, 3497-3500. Chow-Shaffer E, Sina B, Hawley WA, De Benedictis J, Scott TW (2000) Laboratory and field evaluation of polymerase chain reaction-based forensic DNA profiling for use in identification of human blood meal sources of Aedes aegypti (Diptera: Culicidae). J Med Entomol 37, 492-502. Clarkson CS, Weetman D, Essandoh J, et al. (2014) Adaptive introgression between Anopheles sibling species eliminates a major genomic island but not reproductive isolation. Nat Commun 5, 4248.

208 Cockburn AF, Mitchell SE, Seawright JA (1990) Cloning of the mitochondrial genome of Anopheles quadrimaculatus. Arch Insect Biochem Physiol 14, 31-36. Cooper RD, Frances SP, Sweeney AW (1995) Distribution of members of the Anopheles farauti complex in the northern territory of Australia. J Am Mosq Control Assoc 11, 66-71. Cooper RD, Frances SP, Waterson DG, Piper RG, Sweeney AW (1996) Distribution of anopheline mosquitoes in northern Australia. J Am Mosq Control Assoc 12, 656-663. Cooper RD, Waterson DG, Bangs MJ, Beebe NW (2000) Rediscovery of Anopheles (Cellia) clowi (Diptera: Culicidae), a rarely recorded member of the Anopheles punctulatus group. J Med Entomol 37, 840-845. Cooper RD, Waterson DG, Frances SP, et al. (2009) Malaria vectors of Papua New Guinea. Int J Parasitol 39, 1495-1501. Cooper RD, Waterson DG, Frances SP, Beebe NW, Sweeney AW (2002) Speciation and distribution of the members of the Anopheles punctulatus (Diptera: Culicidae) group in Papua New Guinea. J Med Entomol 39, 16-27. Corbel V, N'Guessan R (2013) Distribution, mechanisms, impact, and management of insecticide resistance in malaria vectors:A pragmatic review. In: Anopheles mosquitoes - New insights into malaria vectors (ed. Manguin S). InTech. Crabtree MB, Kading RC, Mutebi JP, Lutwama JJ, Miller BR (2013) Identification of host blood from engorged mosquitoes collected in western Uganda using cytochrome oxidase I gene sequences. J Wildl Dis 49, 611-626. Crawford JE, Bischoff E, Garnier T, et al. (2012) Evidence for population-specific positive selection on immune genes of Anopheles gambiae. G3 (Bethesda) 2, 1505-1519. Cupp EW, Zhang D, Yue X, et al. (2004) Identification of reptilian and amphibian blood meals from mosquitoes in an eastern equine encephalomyelitis virus focus in central Alabama. Am J Trop Med Hyg 71, 272-276. Dabire KR, Diabate A, Namountougou M, et al. (2009) Distribution of pyrethroid and DDT resistance and the L1014F kdr mutation in Anopheles gambiae s.l. from Burkina Faso (West Africa). Trans R Soc Trop Med Hyg 103, 1113-1120. Darbro JM, Dhondt AA, Vermeylen FM, Harrington LC (2007) Mycoplasma gallisepticum infection in house finches (Carpodacus mexicanus) affects mosquito blood feeding patterns. Am J Trop Med Hyg 77, 488-494. De Benedictis J, Chow-Shaffer E, Costero A, et al. (2003) Identification of the people from whom engorged Aedes aegypti took blood meals in Florida, Puerto Rico, using polymerase chain reaction-based DNA profiling. Am J Trop Med Hyg 68, 437-446. Degnan JH, Rosenberg NA (2006) Discordance of species trees with their most likely gene trees. PLoS Genet 2, e68. della Torre A, Fanello C, Akogbeto M, et al. (2001) Molecular evidence of incipient speciation within Anopheles gambiae s.s. in West Africa. Insect Mol Biol 10, 9- 18. Diabate A, Baldet T, Chandre C, et al. (2003) KDR mutation, a genetic marker to assess events of introgression between the molecular M and S forms of

209 Anopheles gambiae (Diptera: Culicidae) in the tropical savannah area of West Africa. J Med Entomol 40, 195-198. Diabate A, Brengues C, Baldet T, et al. (2004) The spread of the Leu-Phe kdr mutation through Anopheles gambiae complex in Burkina Faso: genetic introgression and de novo phenomena. Trop Med Int Health 9, 1267-1273. Dixit J, Arunyawat U, Huong NT, Das A (2014) Multilocus nuclear DNA markers reveal population structure and demography of Anopheles minimus. Mol Ecol. Djouaka RF, Bakare AA, Coulibaly ON, et al. (2008) Expression of the cytochrome P450s, CYP6P3 and CYP6M2 are significantly elevated in multiple pyrethroid resistant populations of Anopheles gambiae s.s. from Southern Benin and Nigeria. BMC Genomics 9, 538. Dontiz W (1901) Nachrichten aus dem Berliner Entomologischen Verein. Inskten Borse 18, 36-38. Drummond AJ, Rambaut A (2007) BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol 7, 214. Duggan AT, Evans B, Friedlaender FR, et al. (2014) Maternal history of Oceania from complete mtDNA genomes: contrasting ancient diversity with recent homogenization due to the Austronesian expansion. Am J Hum Genet 94, 721- 733. Edi CV, Djogbenou L, Jenkins AM, et al. (2014) CYP6 P450 enzymes and ACE-1 duplication produce extreme and multiple insecticide resistance in the malaria mosquito Anopheles gambiae. PLoS Genet 10, e1004236. Ellegren H, Smeds L, Burri R, et al. (2012) The genomic landscape of species divergence in Ficedula flycatchers. Nature 491, 756-760. Erickson SM, Thomsen EK, Keven JB, et al. (2013) Mosquito-parasite interactions can shape filariasis transmission dynamics and impact elimination programs. PLoS Negl Trop Dis 7, e2433. Etang J, Vicente JL, Nwane P, et al. (2009) Polymorphism of intron-1 in the voltage- gated sodium channel gene of Anopheles gambiae s.s. populations from Cameroon with emphasis on insecticide knockdown resistance mutations. Mol Ecol 18, 3076-3086. Excoffier L, Laval G, Schneider S (2005) Arlequin (version 3.0): an integrated software package for population genetics data analysis. Evol Bioinform Online 1, 47-50. Favia G, della Torre A, Bagayoko M, et al. (1997) Molecular identification of sympatric chromosomal forms of Anopheles gambiae and further evidence of their reproductive isolation. Insect Mol Biol 6, 377-383. Felix RC, Muller P, Ribeiro V, Ranson H, Silveira H (2010) Plasmodium infection alters Anopheles gambiae detoxification gene expression. BMC Genomics 11, 312. Foley DH, Bryan JH (2000) Shared salinity tolerance invalidates a test for the malaria vector Anopheles farauti s.s. on Guadalcanal, Solomon Islands. Med Vet Entomol 14, 102-104. Foley DH, Bryan JH, Yeates D, Saul A (1998) Evolution and systematics of Anopheles: insights from a molecular phylogeny of Australasian mosquitoes. Mol Phylogenet Evol 9, 262-275.

210 Foley DH, Cooper RD, Bryan JH (1995) A new species within the Anopheles punctulatus complex in Western Province, Papua New Guinea. J Am Mosq Control Assoc 11, 122-127. Foley DH, Meek SR, Bryan JH (1994) The Anopheles punctulatus group of mosquitoes in the Solomon Islands and Vanuatu surveyed by allozyme electrophoresis. Med Vet Entomol 8, 340-350. Foley DH, Paru R, Dagoro H, Bryan JH (1993) Allozyme analysis reveals six species within the Anopheles punctulatus complex of mosquitoes in Papua New Guinea. Med Vet Entomol 7, 37-48. Fonseca-Gonzalez I, Cardenas R, Quinones ML, McAllister J, Brogdon WG (2009) Pyrethroid and organophosphates resistance in Anopheles (N.) nuneztovari Gabaldon populations from malaria endemic areas in Colombia. Parasitol Res 105, 1399-1409. Fontaine MC, Pease JB, Steele A, et al. (2015) Mosquito genomics. Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science 347, 1258524. Fornadel CM, Norris LC, Glass GE, Norris DE (2010) Analysis of Anopheles arabiensis blood feeding behavior in southern Zambia during the two years after introduction of insecticide-treated bed nets. Am J Trop Med Hyg 83, 848-853. Fraiture M, Baxter RH, Steinert S, et al. (2009) Two mosquito LRR proteins function as complement control factors in the TEP1-mediated killing of Plasmodium. Cell Host Microbe 5, 273-284. Gaunt MW, Miles MA (2002) An insect molecular clock dates the origin of the insects and accords with palaeontological and biogeographic landmarks. Mol Biol Evol 19, 748-761. Genton B, al-Yaman F, Beck HP, et al. (1995) The epidemiology of malaria in the Wosera area, East Sepik Province, Papua New Guinea, in preparation for vaccine trials. I. Malariometric indices and immunity. Ann Trop Med Parasitol 89, 359-376. Gething PW, Elyazar IR, Moyes CL, et al. (2012) A long neglected world malaria map: Plasmodium vivax endemicity in 2010. PLoS Negl Trop Dis 6, e1814. Gething PW, Patil AP, Smith DL, et al. (2011) A new world malaria map: Plasmodium falciparum endemicity in 2010. Malar J 10, 378. Gokool S, Curtis CF, Smith DF (1993) Analysis of mosquito bloodmeals by DNA profiling. Med Vet Entomol 7, 208-215. Gokool S, Smith DF, Curtis CF (1992) The use of PCR to help quantify the protection provided by impregnated bednets. Parasitol Today 8, 347-350. Grabowsky M (2008) The billion-dollar malaria moment. Nature 451, 1051-1052. Graham SP, Hassan HK, Chapman T, et al. (2012) Serosurveillance of eastern equine encephalitis virus in amphibians and reptiles from Alabama, USA. Am J Trop Med Hyg 86, 540-544. Gratz NG (1999) Emerging and resurging vector-borne diseases. Annu Rev Entomol 44, 51-75. Graves PM, Brabin BJ, Charlwood JD, et al. (1987) Reduction in incidence and prevalence of Plasmodium falciparum in under-5-year-old children by

211 permethrin impregnation of mosquito nets. Bull World Health Organ 65, 869- 877. Guelbeogo WM, Grushko O, Boccolini D, et al. (2005) Chromosomal evidence of incipient speciation in the Afrotropical malaria mosquito Anopheles funestus. Med Vet Entomol 19, 458-469. Guindon S, Gascuel O (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 52, 696-704. Guo S, Zhang J, Sun H, et al. (2013) The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions. Nat Genet 45, 51-58. Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072-1075. Hall R (2002) Cenozoic geological and plate tectonic evolution of SE Asia and the SW Pacific: computer-based reconstructions, model and animations. Journal Asian Earth Science 20, 353-431. Hamer GL, Kitron UD, Brawn JD, et al. (2008) Culex pipiens (Diptera: Culicidae): a bridge vector of West Nile virus to humans. J Med Entomol 45, 125-128. Haouas N, Pesson B, Boudabous R, et al. (2007) Development of a molecular tool for the identification of reservoir hosts by blood meal analysis in the insect vectors. Am J Trop Med Hyg 77, 1054-1059. Haubold B, Pfaffelhuber P, Lynch M (2010) mlRho - a program for estimating the population mutation and recombination rates from shotgun-sequenced diploid genomes. Mol Ecol 19 Suppl 1, 277-284. Hay SI, Sinka ME, Okara RM, et al. (2010) Developing global maps of the dominant anopheles vectors of human malaria. PLoS Med 7, e1000209. Heliconius Genome C (2012) Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature 487, 94-98. Hemingway J, Ranson H (2000) Insecticide resistance in insect vectors of human disease. Annu Rev Entomol 45, 371-391. Henry-Halldin CN, Nadesakumaran K, Keven JB, et al. (2012) Multiplex assay for species identification and monitoring of insecticide resistance in Anopheles punctulatus group populations of Papua New Guinea. Am J Trop Med Hyg 86, 140-151. Henry-Halldin CN, Reimer L, Thomsen E, et al. (2011) High throughput multiplex assay for species identification of Papua New Guinea malaria vectors: members of the Anopheles punctulatus (Diptera: Culicidae) species group. Am J Trop Med Hyg 84, 166-173. Hetzel MW, Choudhury AA, Pulford J, et al. (2014) Progress in mosquito net coverage in Papua New Guinea. Malar J 13, 242. Hetzel MW, Gideon G, Lote N, et al. (2012) Ownership and usage of mosquito nets after four years of large-scale free distribution in Papua New Guinea. Malar J 11, 192. Heydon G (1923) Malaria at Rabaul. The Medical Journal of Australia 24, 626-632. Hii JL, Smith T, Vounatsou P, et al. (2001) Area effects of bednet use in a malaria- endemic area in Papua New Guinea. Trans R Soc Trop Med Hyg 95, 7-13. Holt RA, Subramanian GM, Halpern A, et al. (2002) The genome sequence of the malaria mosquito Anopheles gambiae. Science 298, 129-149.

212 Hudson RR (2002) Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18, 337-338. Jiang X, Peery A, Hall AB, et al. (2014) Genome analysis of a major urban malaria vector mosquito, Anopheles stephensi. Genome Biol 15, 459. Jones CM, Liyanapathirana M, Agossa FR, et al. (2012) Footprints of positive selection associated with a mutation (N1575Y) in the voltage-gated sodium channel of Anopheles gambiae. Proc Natl Acad Sci U S A 109, 6614-6619. Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30, 772-780. Kazura JW, Siba PM, Betuela I, Mueller I (2012) Research challenges and gaps in malaria knowledge in Papua New Guinea. Acta Trop 121, 274-280. Keightley PD, Ness RW, Halligan DL, Haddrill PR (2014) Estimation of the spontaneous mutation rate per nucleotide site in a Drosophila melanogaster full-sib family. Genetics 196, 313-320. Kelley DR, Schatz MC, Salzberg SL (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biol 11, R116. Kelly-Hope L, Ranson H, Hemingway J (2008) Lessons from the past: managing insecticide resistance in malaria control and eradication programmes. Lancet Infect Dis 8, 387-389. Kent RJ (2009) Molecular methods for bloodmeal identification and applications to ecological and vector-borne disease studies. Mol Ecol Resour 9, 4-18. Kent RJ, Norris DE (2005) Identification of mammalian blood meals in mosquitoes by a multiplexed polymerase chain reaction targeting cytochrome B. Am J Trop Med Hyg 73, 336-342. Kent WJ (2002) BLAT--the BLAST-like alignment tool. Genome Res 12, 656-664. Keven JB, Henry-Halldin CN, Thomsen EK, et al. (2010) Pyrethroid susceptibility in natural populations of the Anopheles punctulatus group (Diptera: Culicidae) in Papua New Guinea. Am J Trop Med Hyg 83, 1259-1261. Kirstein F, Gray JS (1996) A molecular marker for the identification of the zoonotic reservoirs of Lyme borreliosis by analysis of the blood meal in its European vector Ixodes ricinus. Appl Environ Microbiol 62, 4060-4065. Kocher TD, Thomas WK, Meyer A, et al. (1989) Dynamics of mitochondrial DNA evolution in : amplification and sequencing with conserved primers. Proc Natl Acad Sci U S A 86, 6196-6200. Koella JC, Sorensen FL, Anderson RA (1998) The malaria parasite, Plasmodium falciparum, increases the frequency of multiple feeding of its mosquito vector, Anopheles gambiae. Proc Biol Sci 265, 763-768. Krzywinski J, Besansky NJ (2003) Molecular systematics of Anopheles: from subgenera to subpopulations. Annu Rev Entomol 48, 111-139. Krzywinski J, Grushko OG, Besansky NJ (2006) Analysis of the complete mitochondrial DNA from Anopheles funestus: an improved dipteran mitochondrial genome annotation and a temporal dimension of mosquito evolution. Mol Phylogenet Evol 39, 417-423.

213 Krzywinski J, Li C, Morris M, et al. (2011) Analysis of the evolutionary forces shaping mitochondrial genomes of a Neotropical malaria vector complex. Mol Phylogenet Evol 58, 469-477. Kurtz S, Phillippy A, Delcher AL, et al. (2004) Versatile and open software for comparing large genomes. Genome Biol 5, R12. Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357-359. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10, R25. Lardeux F, Loayza P, Bouchite B, Chavez T (2007) Host choice and human blood index of Anopheles pseudopunctipennis in a village of the Andean valleys of Bolivia. Malar J 6, 8. Laveran N (1902) Sur les Culicides des Nouvelles-Hebrides. CR Soc Biol Paris 54, 908-910. Lawniczak MK, Emrich SJ, Holloway AK, et al. (2010) Widespread divergence between incipient Anopheles gambiae species revealed by whole genome sequences. Science 330, 512-514. Lee Y, Marsden CD, Nieman C, Lanzaro GC (2014) A new multiplex SNP genotyping assay for detecting hybridization and introgression between the M and S molecular forms of Anopheles gambiae. Mol Ecol Resour 14, 297-305. Lee Y, Marsden CD, Norris LC, et al. (2013) Spatiotemporal dynamics of gene flow and hybrid fitness between the M and S forms of the malaria mosquito, Anopheles gambiae. Proc Natl Acad Sci U S A 110, 19854-19859. Leister D (2005) Origin, evolution and genetic effects of nuclear insertions of organelle DNA. Trends Genet 21, 655-663. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics 25, 1754-1760. Li H, Durbin R (2011) Inference of human population history from individual whole- genome sequences. Nature 475, 493-496. Li H, Handsaker B, Wysoker A, et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-2079. Li R, Fan W, Tian G, et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463, 311-317. Lilley I (1992) Papua New Guinea's human past: the evidence of archaeology. In: Human Biology In Pupua New Guinea: The Small Cosmos (eds. Attenborough R, Alpers M), pp. 150-171. Oxford University Press, Oxford. Logue K, Chan ER, Phipps T, et al. (2013) Mitochondrial genome sequences reveal deep divergences among Anopheles punctulatus sibling species in Papua New Guinea. Malar J 12, 64. Logue K, Small ST, Chan ER, et al. (2015) Whole-genome sequencing reveals absence of recent gene flow and separate demographic histories for Anopheles punctulatus mosquitoes in Papua New Guinea. Mol Ecol 24, 1263-1274. Lounibos LP (2002) Invasions by insect vectors of human disease. Annu Rev Entomol 47, 233-266.

214 Mabaso ML, Sharp B, Lengeler C (2004) Historical review of malarial control in southern African with emphasis on the use of indoor residual house- spraying. Trop Med Int Health 9, 846-856. Maffi M (1973) Morphological observations on a population of the Punctulatus complex of Anopheles (Diptera: Culicidae) from the Rennell Island (Solomon Group). Natural history of Rennell Isalnd, British Solomon Islands 7, 29-40. Mahon R (1983) Identification of the three sibling species of Anopheles farauti laveran by the banding pattern of their polytene chromosomes. Journal of Australian Entomological Society 22, 31-34. Marinotti O, Cerqueira GC, de Almeida LG, et al. (2013) The genome of Anopheles darlingi, the main neotropical malaria vector. Nucleic Acids Res 41, 7387- 7400. Martinez-Torres D, Chandre F, Williamson MS, et al. (1998) Molecular characterization of pyrethroid knockdown resistance (kdr) in the major malaria vector Anopheles gambiae s.s. Insect Mol Biol 7, 179-184. Masella AP, Bartram AK, Truszkowski JM, Brown DG, Neufeld JD (2012) PANDAseq: paired-end assembler for illumina sequences. BMC Bioinformatics 13, 31. Mawejje HD, Wilding CS, Rippon EJ, et al. (2013) Insecticide resistance monitoring of field-collected Anopheles gambiae s.l. populations from Jinja, eastern Uganda, identifies high levels of pyrethroid resistance. Med Vet Entomol 27, 276-283. Menge DM, Zhong D, Guda T, et al. (2006) Quantitative trait loci controlling refractoriness to Plasmodium falciparum in natural Anopheles gambiae mosquitoes from a malaria-endemic region in western Kenya. Genetics 173, 235-241. Michael E, Ramaiah KD, Hoti SL, et al. (2001) Quantifying mosquito biting patterns on humans by DNA fingerprinting of bloodmeals. Am J Trop Med Hyg 65, 722- 728. Mitchell SN, Rigden DJ, Dowd AJ, et al. (2014) Metabolic and target-site mechanisms combine to confer strong DDT resistance in Anopheles gambiae. PLoS One 9, e92662. Mitchell SN, Stevenson BJ, Muller P, et al. (2012) Identification and validation of a gene causing cross-resistance between insecticide classes in Anopheles gambiae from Ghana. Proc Natl Acad Sci U S A 109, 6147-6152. Molaei G, Andreadis TG (2006) Identification of avian- and mammalian-derived bloodmeals in Aedes vexans and Culiseta melanura (Diptera: Culicidae) and its implication for West Nile virus transmission in Connecticut, U.S.A. J Med Entomol 43, 1088-1093. Molaei G, Andreadis TG, Armstrong PM, Anderson JF, Vossbrinck CR (2006) Host feeding patterns of Culex mosquitoes and West Nile virus transmission, northeastern United States. Emerg Infect Dis 12, 468-474. Molaei G, Andreadis TG, Armstrong PM, et al. (2007) Host feeding pattern of Culex quinquefasciatus (Diptera: Culicidae) and its role in transmission of West Nile virus in Harris County, Texas. Am J Trop Med Hyg 77, 73-81. Moreno M, Marinotti O, Krzywinski J, et al. (2010) Complete mtDNA genomes of Anopheles darlingi and an approach to anopheline divergence time. Malar J 9, 127.

215 Mumcuoglu KY, Gallili N, Reshef A, Brauner P, Grant H (2004) Use of human lice in forensic entomology. J Med Entomol 41, 803-806. Muturi CN, Ouma JO, Malele, II, et al. (2011) Tracking the feeding patterns of tsetse flies (Glossina genus) by analysis of bloodmeals using mitochondrial cytochromes genes. PLoS One 6, e17284. Najera JA, Gonzalez-Silva M, Alonso PL (2011) Some lessons for the future from the Global Malaria Eradication Programme (1955-1969). PLoS Med 8, e1000412. Namountougou M, Simard F, Baldet T, et al. (2012) Multiple insecticide resistance in Anopheles gambiae s.l. populations from Burkina Faso, West Africa. PLoS One 7, e48412. Neafsey DE, Christophides GK, Collins FH, et al. (2013) The evolution of the Anopheles 16 genomes project. G3 (Bethesda) 3, 1191-1194. Neafsey DE, Lawniczak MK, Park DJ, et al. (2010) SNP genotyping defines complex gene-flow boundaries among African malaria vector mosquitoes. Science 330, 514-517. Neafsey DE, Waterhouse RM, Abai MR, et al. (2015) Mosquito genomics. Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes. Science 347, 1258522. Ngo KA, Kramer LD (2003) Identification of mosquito bloodmeals using polymerase chain reaction (PCR) with order-specific primers. J Med Entomol 40, 215-222. Nichols R (2001) Gene trees and species trees are not the same. Trends Ecol Evol 16, 358-364. Nikou D, Ranson H, Hemingway J (2003) An adult-specific CYP6 P450 gene is overexpressed in a pyrethroid-resistant strain of the malaria vector, Anopheles gambiae. Gene 318, 91-102. Norris LC, Fornadel CM, Hung WC, Pineda FJ, Norris DE (2010) Frequency of multiple blood meals taken in a single gonotrophic cycle by Anopheles arabiensis mosquitoes in Macha, Zambia. Am J Trop Med Hyg 83, 33-37. Norris LC, Norris DE (2013) Heterogeneity and changes in inequality of malaria risk after introduction of insecticide-treated bed nets in Macha, Zambia. Am J Trop Med Hyg 88, 710-717. Oshaghi MA, Chavshin AR, Vatandoost H, et al. (2006) Effects of post-ingestion and physical conditions on PCR amplification of host blood meal DNA in mosquitoes. Exp Parasitol 112, 232-236. Osta MA, Christophides GK, Kafatos FC (2004) Effects of mosquito genes on Plasmodium development. Science 303, 2030-2032. Owen W (1945) A new anopheline from the Solomon Islands with notes on its biology. Journal of Parasitology 31, 236-240. Pamilo P, Nei M (1988) Relationships between gene trees and species trees. Mol Biol Evol 5, 568-583. Paradis E, Claude J, Strimmer K (2004) APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 20, 289-290. Parkinson AD (1974) Malaria in Papua New Guinea 1973. Papua New Guinea Medical Journal 17, 8-16. Perry W (1950) Principal larval and adult habitats of Anopheles farauti Laveran in the British Solomon Islands. Mosquito News 10, 117-126.

216 Pichon B, Egan D, Rogers M, Gray J (2003) Detection and identification of pathogens and host DNA in unfed host-seeking Ixodes ricinus L. (Acari: Ixodidae). J Med Entomol 40, 723-731. Pinto J, Lynd A, Vicente JL, et al. (2007) Multiple origins of knockdown resistance mutations in the Afrotropical mosquito vector Anopheles gambiae. PLoS One 2, e1243. Poinar G, Zavortink T, Pike T, Johnston P (2000) Paleoculicis minutus (Diptera: Culicidae) n. gen., n. sp., from Cretaceous Canadian amber, with a summary of described fossil mosquitoes. Acta Geologica Hispanica 35, 119-128. Posada D (2008) jModelTest: phylogenetic model averaging. Mol Biol Evol 25, 1253- 1256. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155, 945-959. Proft J, Maier WA, Kampen H (1999) Identification of six sibling species of the Anopheles maculipennis complex (Diptera: Culicidae) by a polymerase chain reaction assay. Parasitol Res 85, 837-843. Qiu Q, Zhang G, Ma T, et al. (2012) The yak genome and adaptation to life at high altitude. Nat Genet 44, 946-949. Ranson H, Jensen B, Vulule JM, et al. (2000) Identification of a point mutation in the voltage-gated sodium channel gene of Kenyan Anopheles gambiae associated with resistance to DDT and pyrethroids. Insect Mol Biol 9, 491-497. Ranson H, N'Guessan R, Lines J, et al. (2011) Pyrethroid resistance in African anopheline mosquitoes: what are the implications for malaria control? Trends Parasitol 27, 91-98. Ranson H, Paton MG, Jensen B, et al. (2004) Genetic mapping of genes conferring permethrin resistance in the malaria vector, Anopheles gambiae. Insect Mol Biol 13, 379-386. Replogle J, Lord WD, Budowle B, Meinking TL, Taplin D (1994) Identification of host DNA by amplified fragment length polymorphism analysis: preliminary analysis of human crab louse (Anoplura: Pediculidae) excreta. J Med Entomol 31, 686-690. Rieckmann KH (2006) The chequered history of malaria control: are new and better tools the ultimate answer? Ann Trop Med Parasitol 100, 647-662. Riehle MM, Guelbeogo WM, Gneme A, et al. (2011) A cryptic subgroup of Anopheles gambiae is highly susceptible to human malaria parasites. Science 331, 596- 598. Riehle MM, Markianos K, Niare O, et al. (2006) Natural malaria infection in Anopheles gambiae is regulated by a single genomic control region. Science 312, 577-579. Rinker DC, Pitts RJ, Zhou X, et al. (2013) Blood meal-induced changes to antennal transcriptome profiles reveal shifts in odor sensitivities in Anopheles gambiae. Proc Natl Acad Sci U S A 110, 8260-8265. Riveron JM, Irving H, Ndula M, et al. (2013) Directionally selected cytochrome P450 alleles are driving the spread of pyrethroid resistance in the major malaria vector Anopheles funestus. Proc Natl Acad Sci U S A 110, 252-257.

217 Riveron JM, Yunta C, Ibrahim SS, et al. (2014) A single mutation in the GSTe2 gene allows tracking of metabolically based insecticide resistance in a major malaria vector. Genome Biol 15, R27. Roberts L, Enserink M (2007) Did they really say … eradication? Science 318, 1544- 1545. Ross S (1910) The prevention of malaria E. P. Dutton & Company, New York. Rozeboom L, Knight K (1946) The punctulatus complex of Anopheles (Diptera:Culicidae). Journal of Parasitology 32, 95-131. Rozen S, Skaletsky H (2000) Primer3 on the WWW for general users and for biologist programmers. Methods Mol Biol 132, 365-386. Russell TL, Govella NJ, Azizi S, et al. (2011) Increased proportions of outdoor feeding among residual malaria vector populations following increased use of insecticide-treated nets in rural Tanzania. Malar J 10, 80. Samarawickrema WA, Parkinson AD, Kere N, Galo O (1992) Seasonal abundance and biting behaviour of Anopheles punctulatus and An. koliensis in Malaita Province, Solomon Islands, and a trial of permethrin impregnated bednets against malaria transmission. Med Vet Entomol 6, 371-378. Schloss PD, Westcott SL, Ryabin T, et al. (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75, 7537-7541. Scotese C (2004) A continental drift flipbook. Journal of Geology 112, 729-741. Scott JA, Brogdon WG, Collins FH (1993) Identification of single specimens of the Anopheles gambiae complex by the polymerase chain reaction. Am J Trop Med Hyg 49, 520-529. Scott TW, Githeko AK, Fleisher A, Harrington LC, Yan G (2006) DNA profiling of human blood in anophelines from lowland and highland sites in western Kenya. Am J Trop Med Hyg 75, 231-237. Seah IM, Ambrose L, Cooper RD, Beebe NW (2013) Multilocus population genetic analysis of the Southwest Pacific malaria vector Anopheles punctulatus. Int J Parasitol 43, 825-835. Simpson JT, Wong K, Jackman SD, et al. (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19, 1117-1123. Sinka ME, Bangs MJ, Manguin S, et al. (2012) A global map of dominant malaria vectors. Parasit Vectors 5, 69. Soares VY, Silva JC, Silva KR, et al. (2014) Identification of blood meal sources of longipalpis using polymerase chain reaction-restriction fragment length polymorphism analysis of the cytochrome B gene. Mem Inst Oswaldo Cruz 109, 379-383. Somboon P, Prapanthadara LA, Suwonkerd W (2003) Insecticide susceptibility tests of Anopheles minimus s.l., Aedes aegypti, Aedes albopictus, and Culex quinquefasciatus in northern Thailand. Southeast Asian J Trop Med Public Health 34, 87-93. Soremekun S, Maxwell C, Zuwakuu M, et al. (2004) Measuring the efficacy of insecticide treated bednets: the use of DNA fingerprinting to increase the accuracy of personal protection estimates in Tanzania. Trop Med Int Health 9, 664-672.

218 Sougoufara S, Diedhiou SM, Doucoure S, et al. (2014) Biting by Anopheles funestus in broad daylight after use of long-lasting insecticidal nets: a new challenge to malaria elimination. Malar J 13, 125. Sousa V, Hey J (2013) Understanding the origin of species with genome-scale data: modelling gene flow. Nat Rev Genet 14, 404-414. Spencer M (1992) The history of malaria control in the southwest Pacific region, with particular reference to Papua New Guinea and the Solomon Islands. P N G Med J 35, 33-66. Spencer T, Spencer M, Venters D (1974) Malaria Vectors in Papua New Guinea. Papua New Guinea Medical Journal 17, 22-30. Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics 30, 1312-1313. Steuber S, Abdel-Rady A, Clausen PH (2005) PCR-RFLP analysis: a promising technique for host species identification of blood meals from tsetse flies (Diptera: Glossinidae). Parasitol Res 97, 247-254. Sweeney AW (1987) Larval salinity tolerances of the sibling species of Anopheles farauti. J Am Mosq Control Assoc 3, 589-592. Sweeney AW, Cooper RD, Frances SP (1990) Distribution of the sibling species of Anopheles farauti in the Cape York Peninsula, northern Queensland, Australia. J Am Mosq Control Assoc 6, 425-429. Takeuchi T, Kawashima T, Koyanagi R, et al. (2012) Draft genome of the pearl oyster Pinctada fucata: a platform for understanding bivalve biology. DNA Res 19, 117-130. Tamura K, Stecher G, Peterson D, Filipski A, Kumar S (2013) MEGA6: Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30, 2725-2729. Taylor PG (1996) Reproducibility of ancient DNA sequences from extinct Pleistocene fauna. Mol Biol Evol 13, 283-285. Tempelis CH (1975) Host-feeding patterns of mosquitoes, with a review of advances in analysis of blood meals by serology. J Med Entomol 11, 635-653. Toffaleti J, King W (1947) Some records of mosquito infections in northern New Guinea. Journal of the National Malaria Society 6, 32-36. Turissini DA, Gamez S, White BJ (2014) Genome-wide patterns of polymorphism in an inbred line of the African malaria mosquito Anopheles gambiae. Genome Biol Evol 6, 3094-3104. Turner TL, Hahn MW, Nuzhdin SV (2005) Genomic islands of speciation in Anopheles gambiae. PLoS Biol 3, e285. Untergasser A, Cutcutache I, Koressaar T, et al. (2012) Primer3--new capabilities and interfaces. Nucleic Acids Res 40, e115. Valinsky L, Ettinger G, Bar-Gal GK, Orshan L (2014) Molecular identification of bloodmeals from sand flies and mosquitoes collected in Israel. J Med Entomol 51, 678-685. Van Bortel W, Trung HD, Thuan le K, et al. (2008) The insecticide resistance status of malaria vectors in the Mekong region. Malar J 7, 102. van den Hurk AF, Smith IL, Smith GA (2007) Development and evaluation of real- time polymerase chain reaction assays to identify mosquito (Diptera:

219 Culicidae) bloodmeals originating from native Australian mammals. J Med Entomol 44, 85-92. Washino RK, Tempelis CH (1983) Mosquito host bloodmeal identification: methodology and data analysis. Annu Rev Entomol 28, 179-201. Weetman D, Mitchell SN, Wilding CS, et al. (2015) Contemporary evolution of resistance at the major insecticide target site gene Ace-1 by mutation and copy number variation in the malaria mosquito Anopheles gambiae. Mol Ecol 24, 2656-2672. Weetman D, Steen K, Rippon EJ, et al. (2014) Contemporary gene flow between wild An. gambiae s.s. and An. arabiensis. Parasit Vectors 7, 345. Weill M, Chandre F, Brengues C, et al. (2000) The kdr mutation occurs in the Mopti form of Anopheles gambiae s.s. through introgression. Insect Mol Biol 9, 451- 455. WHO (1973) Handbook of resolutions and decisions of the World Health Assembly and the Executive Board. World Health Organization I, 66-81. WHO (2013) World Malaria Report WHO (2014) World Malaria Report. WHO (2015) World Malaria Report. WHO (March 2014) Vector-borne diseases Fact sheet. http://www.who.int/mediacentre/factsheets/fs387/en/ Wilding CS, Weetman D, Steen K, Donnelly MJ (2009) High, clustered, nucleotide diversity in the genome of Anopheles gambiae revealed through pooled- template sequencing: implications for high-throughput genotyping protocols. BMC Genomics 10, 320. Wilkerson RC, Parsons TJ, Klein TA, et al. (1995) Diagnosis by random amplified polymorphic DNA polymerase chain reaction of four cryptic species related to Anopheles (Nyssorhynchus) albitarsis (Diptera: Culicidae) from Paraguay, , and Brazil. J Med Entomol 32, 697-704. Woodhill AR (1946) Observations on the morphology and biology of the subspecies of Anopheles punctulatus Donitz. Proceedings of the Linnean Society of New South Wales 1xx. Yim HS, Cho YS, Guang X, et al. (2014) Minke whale genome and aquatic adaptation in cetaceans. Nat Genet 46, 88-92. You M, Yue Z, He W, et al. (2013) A heterozygous moth genome provides insights into herbivory and detoxification. Nat Genet 45, 220-225. Zamora Perea E, Balta Leon R, Palomino Salcedo M, Brogdon WG, Devine GJ (2009) Adaptation and evaluation of the bottle assay for monitoring insecticide resistance in disease vector mosquitoes in the Peruvian Amazon. Malar J 8, 208. Zhang G, Cowled C, Shi Z, et al. (2013) Comparative analysis of bat genomes provides insight into the evolution of flight and immunity. Science 339, 456- 460. Zhang G, Fang X, Guo X, et al. (2012) The oyster genome reveals stress adaptation and complexity of shell formation. Nature 490, 49-54. Zheng W, Huang L, Huang J, et al. (2013) High genome heterozygosity and endemic genetic recombination in the wheat stripe rust fungus. Nat Commun 4, 2673.

220 Zhou D, Zhang D, Ding G, et al. (2014) Genome sequence of Anopheles sinensis provides insight into genetics basis of mosquito competence for malaria parasites. BMC Genomics 15, 42.

221