Drug Repurposing for Lymphatic Enabled with Multi-Species Genomics Approaches

Item Type dissertation

Authors Chung, Matthew

Publication Date 2019

Abstract Filarial diseases affect >200 million individuals worldwide, with another ~800 million individuals at risk across ~50 countries. The causative agents of filariasis are roundworms of the superfamily , also known as filarial . The m...

Keywords Filariasis--genetics; Fiariasis--prevention & control

Download date 09/10/2021 06:25:01

Link to Item http://hdl.handle.net/10713/10154 Curriculum Vitae

Matthew Chung Ph.D., 2019 [email protected]

University of Maryland Institute for Genome Sciences 670 W Baltimore St. HSFIII, Room 2050E Baltimore, MD 21201

Education 2014 B.S. in Microbiology, Rutgers University, New Brunswick, NJ, USA. Mentor: Dr. Constantino Vetriani: GPA: 3.39 2019 Ph.D. in Microbiology and Immunology, University of Maryland Baltimore, Baltimore, MD, USA. GPA: 3.49 Mentor: Dr. Julie Dunning Hotopp

Professional Experience 2011-2015 Research Assistant and Laboratory Technician. Mentor: Dr. Constantino Vetriani School of Environmental and Biological Sciences, Rutgers University, New Brunswick, NJ, USA. 2016-2019 Graduate Student. Mentor: Dr. Julie Dunning Hotopp Institute for Genome Sciences, University of Maryland Baltimore, Baltimore, MD, USA.

Honors and Awards 2012-2014 Dean’s List - Rutgers University, New Brunswick, NJ, USA. 2014 George H. Cook Scholar - Rutgers University, New Brunswick, NJ, USA.

Institutional Service 2015-2019 GPILS Microbiology and Immunology Student Recruitment Committee, University of Maryland Baltimore, Baltimore, MD, USA. 2017-2018 Microbiology and Immunology Journal Club, University of Maryland Baltimore, Baltimore, MD, USA.

Student Mentorship Summer 2018 Mentor of undergraduate student from University of Maryland College Park, Rayanna Patkus Summer 2018 Mentor of GPILS rotation student in Molecular Medicine, Britney Hazzard Publications - Peer-Reviewed Journal Articles 1. Giovannelli D.*, M. Chung*, J. Staley, V. Starovoytov, N. Le Bris, C. Vetriani. (2016) “Sulfurovum riftiae sp. nov., a mesophilic, thiosulfate-oxidizing, nitrate- reducing chemolithoautotrophic episilonproteobacterium isolated from the tube of the deep-sea hydrothermal vent polychaete Riftia pachyptila.” Int J Syst Evol Microbiol 66(7). PMID: 27116914. DOI: 10.1099/ijsem.0.001106. *Giovanelli and Chung are co-first authors. 2. Chung, M.*, S. T. Small*, D. Serre, P. A. Zimmerman, and J. C. Dunning Hotopp. (2017) “Draft Genome Sequence of the Wolbachia Endosymbiont of Wuchereria bancrofti wWb.” Pathog Dis 75(9). PMID: 29099918. DOI: 10.1093/femspd/ftx115. *Chung and Small are co-first authors. 3. Watkins, T. N., H. Liu, M. Chung, T. H. Hazen, J. C. Dunning Hotopp, S. G. Filler, and V. M. Bruno. (2018) “Comparative transcriptomics of Aspergillus fumigatus strains upon exposure to human airway epithelial cells.” Microb Genom. PMID: 29345613. DOI: 10.1099/mgen.0.000154. 4. Izzard L, M. Chung, J. C. Dunning Hotopp, G. Vincent, D. Paris, S. Graves, and J. Stenos. (2018) “Isolation of a divergent strain of japonica from Dew's Australian bat Argasid (Argas (Carios) dewae) in Victoria, Australia.” Ticks Borne Dis. PMID: 30025798. DOI: 10.1016/j.ttbdis.2018.07.007. 5. Chung M., L. E. Teigen, H. Liu, S. Libro, A. Shetty, N. Kumar, X. Zhao, R. E. Bromley, L. J. Tallon, L. Sadzewicz, C. M. Fraser, D. A. Rasko, S. G. Filler, J. M. Foster, M. L. Michalski, V. M. Bruno, J. C. Dunning Hotopp (2018) “Targeted enrichment outperforms other enrichment techniques and enables more multi- species RNA-Seq analyses” Sci Rep. PMID: 30190541. DOI: 10.1038/s41598-018- 31420-7. 6. Chung M., L. E. Teigen, S. Libro, R. E. Bromley, N. Kumar, L. Sadzewicz, L. J. Tallon, J. M. Foster, M. L. Michalski, J. C. Dunning Hotopp (2018) “Multispecies Transcriptomics Data Set of Brugia malayi, Its Wolbachia Endosymbiont wBm, and Aedes aegypti across the B. malayi Life Cycle” Microbiol Resour Announc. PMID: 30533772. DOI: 10.1128/MRA.01306-18 7. Chung, M., J. B. Munro, J. C. Dunning Hotopp. (2018) “Using Core Genome Alignments to Assign Bacterial Species” mSystems. PMID: 30534598. DOI: 10.1128/mSystems.00236-18 8. Chung, M., L. E. Teigen, S. Libro, R. E. Bromley, D. Olley, N. Kumar, L. Sadzewicz, L. J. Tallon, A. Mahurkar, J. M. Foster, M. L. Michalski, J. C. Dunning Hotopp. (2019) “Drug repurposing of bromodomain inhibitors as a novel therapeutic for lymphatic filariasis guided by multi-species transcriptomics” Nat Commun, in second revision. 9. Chung, M., R. S. Adkins, J. S. A. Mattick, K. R. Bradwell, A. C. Shetty, L. Sadzewicz, L. J. Tallon, C. M. Fraser, D. A. Rasko, A. Mahurkar, J. C. Dunning Hotopp. (2019) “FADU: A quantification tool for prokaryotic transcriptomic analyses” Genome Biol, submitted. 10. Shetty, A., M. Chung, A. Mahurkar, S. Filler, C. M. Fraser, D. A. Rasko, V. M. Bruno, J. C. Dunning Hotopp. (2019) “The Most Cost Effective, Experimentally Robust Differential Expression Analysis for Human/Mammalian, Pathogen, and Dual-Species Transcriptomics” in preparation. 11. Krauland, M. G., M. P. Pacey, M. Chung, J. C. Dunning Hotopp, L. H. Harrison. (2019) “Bacterial Genomics for Hospital Outbreak Investigation” J Clin Microbiol., in preparation. 12. Foster, J. F., A. Grote, J. Mattick, A. Tracey, Y-C. Tsai, K. Brookes, M. Chung, J. A. Cotton, T. A. Clark, A. Geber, N. Holroyd, J. Korlach, S. Libro, J. Lomax, S. Lustigman, M. L. Michalski, M. Paulini, M. B. Rogers, A. Twaddle, M. Berriman, J. C. Dunning Hotopp, E. Ghedin. (2019) “Genomics of Sex Chromosome Evolution in Parasitic Nematodes of Humans” in preparation. 13. Libro, S., M. Chung, J. Mattick, J. M. Foster, and J. C. Dunning Hotopp. (2019) “Genome sequencing and analysis of the Wolbachia endosymbiont of Brugia pahangi” in preparation. 14. Robinson K.M., L. Morningstar-Wright, M. Chung, N. Kumar, T.G. Blanchard, J.C. Dunning Hotopp. (2019) “Helicobacter-human genomics and co- transcriptomics in vitro and in vivo” in preparation.

Proffered Communications Proffered Oral Presentations 1. Chung, M. and J. C. Dunning Hotopp. (2018) “A taxonomic analysis of the using core genome alignments to recommend species guidelines.” 10th International Wolbachia Conference, Salem, MA, USA. 2. Chung, M., R. S. Patkus, B. Hazzard , L. E. Teigan, S. Libro, A. Grote, N. Kumar, R. E. Bromley, L. Sadzewicz, L. J. Tallon, E. Ghedin, M. L. Michalski, J. M. Foster, and J. C. Dunning Hotopp. (2010) “Transcriptional characterization of Wolbachia-B. malayi lateral gene transfers.” Molecular Helminthology, San Antonio, TX, USA. Proffered Oral Presentations (author, but not presenter) 1. Gasser, M., M. Chung, J. C. Dunning Hotopp. (2018) “wAna Genome from MinION Long-Read Sequencing Data: DNA Preparation to Assembly” 10th International Wolbachia Conference, Salem, MA, USA. 2. Bromley, R. E., J. Mattick, M. Chung, R. S. Adkins, J. C. Dunning Hotopp (2018) “Wolbachia Lateral Gene Transfer Events in Insects: A Whole Genome Analysis of Bugs and Beetles” 10th International Wolbachia Conference, Salem, MA, USA. 3. Adkins, R. S., J. Mattick, R. Bromley, M. Chung, D. Riley, K. Sieber, K. Robinson, A. Mahurkar, and J. C. Dunning Hotopp. (2017) “LGTSeek – A Robust, Distributable Lateral Gene Transfer Pipeline” CSHL Genome Informatics Conference, Cold Spring Harbor, NY, USA. 4. Chung, M., M. L. Michalski, L. Teigan, S. Libro, N. Kumar, R. E. Bromley, L. Sadzewicz, L. J. Tallon, J. M. Foster, and J. C. Dunning Hotopp (2018). “The Stage-specific Transcriptome of the Wolbachia Endosymbiont of Brugia malayi wBm through the Lifecycle” 10th International Wolbachia Conference, Salem, MA, USA.

Proffered Poster Presentations 1. Chung, M., L. Teigan, S. Libro, N. Kumar, R. E. Bromley, L. Sadzewicz, L. J. Tallon, J. M. Foster, M. L. Michalski, and J. C. Dunning Hotopp. (2018) “Discovery of a potential novel therapeutic for lymphatic filariasis guided by multi- species transcriptomics.” 67th American Society of Tropical Medicine & Hygeine General Meeting, New Orleans, LA, USA. 2. Chung, M., L. Teigan, S. Libro, N. Kumar, R. E. Bromley, L. Sadzewicz, L. J. Tallon, J. M. Foster, M. L. Michalski, and J. C. Dunning Hotopp. (2019) “Discovery of a potential novel therapeutic for lymphatic filariasis guided by multi-species transcriptomics.” Graduate Research Conference, University of Maryland Baltimore, Baltimore, MD, USA. Proffered Posters (author, but not presenter) 1. Adkins, R. S., J. Mattick, R. Bromley, M. Chung, D. Riley, K. Sieber, K. Robinson, A. Mahurkar, and J. C. Dunning Hotopp. (2017) “LGTSeek – A Robust, Distributable Lateral Gene Transfer Pipeline” CSHL Genome Informatics Conference, Cold Spring Harbor, NY, USA. 2. Mattick, J., S. Adkins, R. Bromley, M. Chung, and J. C. Dunning Hotopp (2018) “A Novel Tool for Detecting Lateral Gene Transfer – LGTSeek” 10th International Wolbachia Conference, Salem, MA, USA. 3. Libro, S., M. Chung, J. Mattick, J. Foster, and J. C. Dunning Hotopp (2018) “Genome Sequencing and Analysis of the Wolbachia Endosymbiont of Brugia pahangi” 10th International Wolbachia Conference, Salem, MA, USA. 4. Chung, M., L. Teigan, S. Libro, N. Kumar, R. E. Bromley, L. Sadzewicz, L. J. Tallon, J. M. Foster, M. L. Michalski, and J. C. Dunning Hotopp. (2019) “Discovery of a potential novel therapeutic for lymphatic filariasis guided by multi-species transcriptomics.” Molecular Helminthology, San Antonio, TX, USA. Abstract

Title of Dissertation: Drug Repurposing for Lymphatic Filariasis Enabled with Multi- Species Genomics Approaches Matthew Chung, Doctor of Philosophy, 2019 Dissertation Directed by: Julie C. Dunning Hotopp, PhD, Associate Professor, Department of Microbiology & Immunology, Institute for Genome Sciences

Filarial diseases affect >200 million individuals worldwide, with another ~800 million individuals at risk across ~50 countries. The causative agents of filariasis are roundworms of the superfamily Filarioidea, also known as filarial nematodes. The most common human filarial disease is lymphatic filariasis, caused by one of three filarial nematodes: Wuchereria bancrofti, Brugia malayi, and Brugia timori. There are four different antihelminthic therapeutics for lymphatic filariasis—albendazole, diethylcarbamazine, ivermectin, and doxycycline. Three are primarily microfilaricidal and have limited efficacy on adult worms. The fourth therapeutic, doxycycline, adulticidal, but causes complications when administered to pregnant women or children

<9 years of age. Combined with the emergence of drug resistance in filarial nematodes, new therapeutics are needed for the treatment for lymphatic filariasis. To identify potential drug targets, we conducted a multi-species transcriptomics analysis of B. malayi, its obligate mutualistic endosymbiont Wolbachia endosymbiont wBm, and its laboratory vector host A. aegypti across the entire B. malayi life cycle. We validated the use of the Agilent SureSelect RNA-Seq capture platform to enrich for B. malayi and wBm sequencing reads in low coverage samples, namely those taken during the B. malayi vector life stages in A. aegypti. For the analysis of wBm we developed FADU (Feature

Aggregate Depth Utility), a quantification tool designed for prokaryotic RNA-Seq analyses, with an emphasis on accurately quantifying operonic genes. From our

transcriptomics analysis, we identified an over-enrichment of the bromodomain and

extra-terminal (BET) family of transcription factors, upregulated in the adult female,

embryo, and microfilariae life stages. In previous studies, knockdowns of the BET

proteins in Caenorhabiditis elegans lead to adult worm sterility and sometimes lethality.

We treated adult female B. malayi in vitro with the BET inhibitor JQ1(+) and found that

JQ1(+) induced sterility and worm death at lower concentrations than ivermectin, suggesting BET inhibitors may be promising antihelminthics. We also sequenced the genome of the Wolbachia endosymbiont of W. bancrofti, wWb, and used comparative

genomics to identify conserved genes between lymphatic filariae Wolbachia.

Additionally, we developed an approach for determining species using core genome

alignments to reassess the supergroup designations in the genus Wolbachia. Drug Repurposing for Lymphatic Filariasis Enabled with Multi-Species Genomics Approaches

by Matthew Chung

Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, Baltimore in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019

©Copyright 2019 Matthew Chung All rights reserved

Acknowledgements

First and foremost, I want to thank my mentor Dr. Julie Dunning Hotopp for everything

she has done for me over my graduate school career. I don’t think that I can express how

much she’s done for me on paper, much less in a single paragraph. Whether it was

helping me grow as a scientist or improving my confidence in my own work, I wouldn’t

be anywhere near where I am today without her guidance. I also want to thank all the

members in the Dunning Hotopp Lab over the years for all their help, especially John

Mattick, Robin Bromley, Mark Gasser, Eric Tvedte, and Nikhil Kumar. I couldn’t have asked for a better group of people to work with.

I would like to thank our collaborators Drs. Jeremy Foster, Silvia Libro, and Michelle

Michalski for all their work with the nematode samples and the transcriptomics project.

I would like to thank Drs. Sasisekhar Bennuru, Vincent Bruno, Joana Carneiro da Silva,

Claire Fraser, David Serre for serving on my thesis committee and helping me not just with my projects, but with my future career plans. I also have to thank Drs. Bret Hassel and Heather Ezelle for all their help in the Molecular Microbiology and Immunology

Program.

I would like to thank my family for all their help and support throughout graduate school, especially my amazing girlfriend Eugenie, my parents Andy and Tiffany, and my cousins

Joanna, Gary, and Charlene. I’ve only gotten this far because of all of your support.

Lastly, I would like to thank all the friends and colleagues I’ve met in Baltimore. I’ve heard a lot of complaints about graduate school and I frankly can’t relate much because of all of you. iii

Table of Contents

Acknowledgements ...... iii Table of Contents ...... iv List of Figures ...... ix List of Tables ...... xi List of Abbreviations ...... xii Chapter 1: Introduction ...... 1 Human filariasis ...... 1 Lymphatic filariasis: Wuchereria bancrofti, Brugia malayi, and Brugia timori ...... 1 Onchocerciasis: Onchocerca volvulus ...... 2 Loaisis: Loa loa ...... 3 Mansonellosis: ozzardi, , and Mansonella streptocerca ...... 3 Diagnosing filariasis ...... 4 Filarial vaccine development ...... 7 Filariasis treatment options ...... 7 The supergroup classification system ...... 10 Wolbachia host interactions ...... 12 Nematode Wolbachia host interactions ...... 15 Leveraging Wolbachia dependence for infectious disease control ...... 17 ‘Omics-based apparoches to filarial nematodes and their Wolbachia endosymbionts .. 19 Concluding remarks ...... 21 Authors’ Contributions ...... 24 Chapter 2: Targeted enrichment outperforms other enrichment techniques and enables more multi-species RNA-Seq analyses ...... 25 Abstract ...... 25 Introduction ...... 26 Methods ...... 28 Mosquito preparation ...... 28 Nematode preparation ...... 29 Mosquito/Nematode/Wolbachia RNA isolation ...... 30 A. fumigatus infection ...... 31 A. fumigatus RNA isolation ...... 31

iv

Illumina NEBNext Ultra Directional RNA libraries without capture ...... 32 AgSS probe design ...... 32 Wolbachia AgSS RNA capture ...... 33 B. malayi AgSS RNA capture ...... 33 A. fumigatus AgSS capture ...... 34 Sequence alignment, feature counts, and fold enrichment calculations ...... 35 Results ...... 36 Agilent SureSelect probe G + C content and specificity ...... 36 Comparison of poly(A)-enriched AgSS to only poly(A)-enriched libraries for B. malayi ...... 37 Comparison of poly(A)-enriched AgSS to only poly(A)-enrichment libraries for A. fumigatus ...... 44 Comparison of total RNA AgSS to rRNA-, poly(A)-depleted libraries for the bacterial Wolbachia endosymbiont wBm ...... 47 Comparison of AgSS B. malayi transcriptome data from total RNA and poly(A)- enriched samples ...... 55 Recovering the transcriptome of the major organism from libraries prepared with the AgSS of the minor organism ...... 58 Discussion ...... 60 Data Availability ...... 63 Additional Information ...... 63 Acknowledgements ...... 63 Authors’ Contribution ...... 63 Chapter 3: FADU: A quantification tool for prokaryotic transcriptomic analyses ...... 64 Abstract ...... 64 Importance ...... 65 Introduction ...... 65 Results ...... 69 Comparing the performance of FADU against other quantification tools ...... 69 Clustering analyses of the different quantification modes ...... 78 Timing and memory benchmarks ...... 79 Discussion ...... 80 Materials and Methods ...... 82 Implementation ...... 82 Computing non-overlapping annotation feature coordinates ...... 82 v

Defining fragments versus reads ...... 82 Processing overlaps between alignments and features ...... 83 Writing output ...... 84 Availability of data sets ...... 84 Quantification tool comparisons...... 84 Acknowledgements ...... 85 Authors’ Contributions ...... 85 Chapter 4: Drug repurposing of bromodomain inhibitors as a novel therapeutic for lymphatic filariasis guided by multi-species transcriptomics ...... 86 Abstract ...... 86 Keywords ...... 87 Background ...... 87 Results ...... 89 B. malayi transcriptome ...... 89 wBm transcriptome ...... 100 A. aegypti transcriptome ...... 101 B. malayi drug target identification and in vitro validation ...... 106 Discussion ...... 108 Methods ...... 111 Sample preparation--Nematodes ...... 111 Sample preparation--Mosquitoes...... 112 RNA isolation ...... 113 Library preparation, capture system design, and stranded RNA sequencing ...... 114 Read alignments, feature counting, and differential expression analysis ...... 116 WGCNA module detection and functional term enrichments ...... 117 B. malayi response to JQ1(+) ...... 118 Declarations ...... 119 Ethics approval and consent to participate ...... 119 Consent for publication ...... 119 Availability of data and materials ...... 119 Competing interests ...... 119 Funding ...... 119 Authors’ Contributions ...... 120 Acknowledgements ...... 120 vi

Chapter 5: Draft genome sequence of the Wolbachia endosymbiont of Wuchereria bancrofti wWb ...... 121 Abstract ...... 121 Keywords ...... 121 Introduction ...... 121 Materials and Methods ...... 122 Genome sequencing, assembly and annotation ...... 122 Phylogenetic and comparative genomic analyses ...... 123 Identification of lower confidence positions in wWb ...... 124 Results and Discussions ...... 126 Summary ...... 132 Funding ...... 133 Supplementary Data ...... 133 Authors’ Contribution ...... 133 Chapter 6: Using core genome alignments to assign bacterial species ...... 134 Abstract ...... 134 Importance ...... 135 Keywords ...... 135 Introduction ...... 135 Results ...... 139 Advantages of nucleotide alignments over protein alignments for bacterial species analyses ...... 139 Can core genome alignments be constructed from within a genus? ...... 143 Substitution saturation ...... 145 Assessment of genus designations ...... 146 Identifying a CGASI for species delineation ...... 147 phylogenomic analyses ...... 151 phylogenomic analyses ...... 156 Other taxa ...... 165 Discussion ...... 165 Workflow for the taxonomic assignment of new genomes at the genus and species levels ...... 165 The genomics and data era may necessitate increased flexibility or a modification of the rules in bacterial nomenclature ...... 168 Limitations ...... 170 vii

Conclusions ...... 170 Materials and Methods ...... 171 Synteny plots ...... 171 Core genome alignments ...... 172 Core protein alignments ...... 173 Authors’ Contributions ...... 173 Acknowledgements ...... 173 Chapter 7: Considerations for adopting species designations in the genus Wolbachia .. 174 Abstract ...... 174 Introduction ...... 174 Methods for protein and whole-genome based phylogenetic analyses ...... 178 The International Code of Nomenclature ...... 182 The Candidatus epithet ...... 184 Naming Wolbachia species ...... 185 The designation of Wolbachia type material ...... 186 Genome quality standards for use as type material ...... 187 Future Directions ...... 187 Authors’ Contributions ...... 189 Chapter 8: Conclusions ...... 190 Considerations for multi-species transcriptomics experiments ...... 190 Prokaryotic-based quantification tools for RNA-Seq analyses ...... 191 The comprehensive transcriptome of lymphatic filariasis in the jird model ...... 193 BET inhibitors as a potential therapeutic for lymphatic filariasis ...... 194 The state of the Wolbachia ...... 196 Authors’ Contributions ...... 198 References ...... 199

viii

List of Figures

Figure 2.1: G+C content of Brugia malayi AgSS probes ...... 37 Figure 2.2: The impact of G+C content on capturing Brugia malayi genes ...... 38 Figure 2.3: Sample and library preparation...... 39 Figure 2.4: Genes identified only in either the poly(A)-enriched AgSS or poly(A)- enrichment libraries from B. malayi samples...... 40 Figure 2.5: Comparison of the poly(A)-enriched to the poly(A)-enriched AgSS B. malayi transcriptomes from 18 hpi, 4 dpi, and 8 dpi vector life stages...... 42 Figure 2.6: Genes identified only in either the poly(A)-enriched AgSS or poly(A)- enrichment libraries from A. fumigatus samples ...... 45 Figure 2.7: Genes identified only in either the total RNA AgSS or rRNA-, poly(A)- depletion libraries from wBm samples ...... 48 Figure 2.8: Comparison of the rRNA-, poly(A)-depleted to the total RNA AgSS wBm transcriptomes from the female 24 dpi and adult B. malayi life stages...... 53 Figure 2.9: Comparison of the poly(A)-enriched AgSS to the total RNA AgSS B. malayi transcriptomes from 18 hpi and 8 dpi vector life stages ...... 54 Figure 2.10: Genes identified only in either the total RNA AgSS or poly(A)-enriched AgSS libraries from B. malayi samples ...... 56 Figure 2.11: Comparison of the poly(A)-enrichment to the total RNA A. fumigatus AgSS library preparations of the transcriptomes of 2 dpi immunocompromised mouse...... 59 Figure 2.12: Scatter plot of the percentage of poly(A)-enrichment or depletion reads mapped against the fold enrichment value observed using the AgSS follows a power law ...... 61 Figure 3.1: Comparing fragments counts from FADU versus other quantification tools. 70 Figure 3.2: MA plots comparing wBm fragment counts obtained using FADU against the other quantification tools...... 71 Figure 3.3: The performance of quantification tools on deriving counts for operons...... 74 Figure 3.4: The improper quantification of fragments that span multiple features...... 76 Figure 3.5: Clustering patterns of the fragment counts derived from different quantification methods...... 77 Figure 3.6: FADU timing and memory benchmarking...... 78 Figure 4.1: Brugia malayi life cycle and sample selection ...... 90 Figure 4.2: Hierarchical clustering and PCA of differentially expressed B. malayi genes ...... 92 Figure 4.3: B. malayi WGCNA expression modules ...... 93 Figure 4.4: Hierarchical clustering and PCA of the differentially expressed wBm genes 98 Figure 4.5: wBm WGCNA expression modules and 6S RNA expression ...... 99 Figure 4.6: Hierarchical clustering and PCA of the differentially expressed A. aegypti genes ...... 102 Figure 4.7: A. aegypti WGCNA expression modules ...... 103 Figure 4.8: Effects of JQ1(+) treatment on adult female B. malayi ...... 109 Figure 5.1: Sequencing depth distribution of the wWb assembly...... 125 Figure 5.2: Validation of four median absolute deviations (MADs) from the major mode as a LGT cutoff...... 126

ix

Figure 5.3: Identification of high sequence variation regions...... 127 Figure 5.4: wWb phylogeny and synteny...... 128 Figure 5.5: Circos plot of NUCmer linkages between W. bancrofti and wWb, wWb sequencing depth, and wWb SNPs and indels...... 129 Figure 5.6: Low confidence positions in the wWb genome identified using sequencing depth, sequence variation, and shared genomic regions between wWb and W. bancrofti...... 130 Figure 6.1: Synteny between the 10 complete Wolbachia genomes...... 142 Figure 6.2: Comparison of phylogenomic trees generated using the core nucleotide alignments versus protein-based alignments for 10 complete Wolbachia genomes...... 144 Figure 6.3: Assessing substitution saturation for core genome alignments...... 148 Figure 6.4: ANI, dDDH, and CGASI correlation analysis...... 149 Figure 6.5: Analysis of the ANI, dDDH, and CGASI values of 69 Rickettsia genomes.153 Figure 6.6: Analysis of the ANI, dDDH, and CGASI values of three Orientia genomes...... 154 Figure 6.7: Analysis of the ANI, dDDH, and CGASI values of 16 genomes. 156 Figure 6.8: Analysis of the ANI, dDDH, and CGASI values of 30 Anaplasma genomes...... 158 Figure 6.9: Analysis of the ANI, dDDH, and CGASI values of 4 genomes...... 161 Figure 6.10: Analysis of the ANI, dDDH, and CGASI values of 23 Wolbachia genomes...... 163 Figure 6.11: Analysis of the ANI, dDDH, and CGASI values of 66 Neisseria genomes...... 164 Figure 6.12: Workflow for the taxonomic assignment of a novel genome at the genus and species levels...... 167 Figure 7.1: Neighbor-network tree of 23 Wolbachia genomes...... 182

x

List of Tables

Table 2.1: RNA-Seq statistics for poly(A)-enriched AgSS and poly(A)-enrichment libraries from B. malayi samples collected following infection of the mosquito vector... 41 Table 2.2a: RNA-Seq statistics for AgSS and poly(A)-enrichment libraries from A. fumigatus...... Error! Bookmark not defined. Table 2.2b: RNA-Seq statistics for AgSS and poly(A)-enrichment libraries from A. fumigatus...... Error! Bookmark not defined. Table 2.3: RNA-Seq statistics for AgSS and rRNA-, poly(A)-depletion libraries from wBm...... 50 Table 2.3 (cont’d): RNA-Seq statistics for AgSS and rRNA-, poly(A)-depletion libraries from wBm ...... 51 Table 2.4: RNA-Seq statistics for B. malayi transcriptome data using AgSS with total RNA or poly(A)-enriched RNA ...... 57 Table 2.1: RNA-Seq life cycle samples and library preparations ...... 91 Table 5.1: Wolbachia genomes used for phylogenetic analysis ...... 124 Table 6.1: Ten complete Wolbachia genomes used for PhyloPhlAn and nucleotide and protein core genome analyses ...... 140 Table 6.2: Core genome alignment statistics ...... 145

xi

List of Abbreviations

A-WOL: Anti-Wolbachia Consortium

AF293: A. fumigatus strain designation

AgSS: Agilent SureSelect

ALB: albendazole

ANI: average nucleotide identity

ANNOVAR: Annotate Variation

ATP: adenosine triphosphate

BALB/c: an albino, laboratory-bred strain of mouse

BAM: binary alignment map

BLAST: Basic Local Alignment Search Tool

BET: bromo- and extra-terminal domain

BRD: bromodomain containing 4 gene

BTB: broad-complex, tramtrack and bric a brac domain

BWA: Burrow Wheels Aligner

CGASI: core genome alignment sequence identity

CI: cytoplasmic incompatibility

CPM: counts per million

DDBJ: DNA Data Bank of Japan dDDH: digital DDH

DDH: DNA-DNA hybridization

DEC: diethylcarbamazine

DMSO: dimethylsulfoxide

xii

DNA: deoxyribonucleic acid dpi: days post-infection

EGF: epidermal growth factor

EMBL: European Molecular Biology Laboratory

ENA: European Nucleotide Archive

FAD: flavin adenine dinucleotide

FADU: Feature Aggregate Depth Utility

FADU-O: Feature Aggregate Depth Utility - Operons

FDR: false discovery ratio

FEMSPD: Federation of European Microbiological Societies; Pathogens and Disease

FR3: Filariasis Research Reagent Resource Center

GC/G + C: guanine and cytosine gEAR: gene expression analysis resource

GFF: general feature format

GGDC: Genome-to-Genome Distance Calculator

GLIMMER: gene locator and interpolated Markov modeler

GluCl channels: glutamate-gated chloride channels

GPS: global positioning system

GO: gene ontology

GOLD: Golgi dynamics

GTDB-Tk: Genome Taxonomy Database Toolkit

GTR + GAMMA: general time-reversible with a gamma distribution; DNA sequence

evolution model

xiii

HISAT: hierarchical indexing for spliced alignment of transcripts

hpi: hours post-infection

IACUC: Institutional Care and Use Committee

ICNP: International Code of Nomenclature of Prokaryotes

IGS: Institute for Genome Sciences

IJSB: International Journal of Systematic Bacteriology

IJSEM: International Journal of Systematic and Evolutionary Microbiology

IVM: ivermectin

JQ1(+): BET inhibitor

JQ1(-): negative control/enantiomer of JQ1(+)

LAMP: loop-mediated isothermal amplification

LCB: locally collinear blocks

LGT: lateral gene transfer

LIPS: luciferase immunoprecipitation System

LISXP-1: L. loa antigen used as biomarker for loiasis

LPSN: List of Prokaryotic Names with Standing in Nomenclature

LSM: like Sm domain

MA plot: Bland-Altman plot used to visualize genomic data

MAAFT: Multiple Alignment using Fast Fourier Transform

MDA: mass drug administration

ML: maximum-likelihood

MLSA: multi-locus sequence analysis

MLST: multi-locus sequence typing

xiv mRNA: messenger RNA

MTT: 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide; tetrazolium dye

NCBI: National Center for Biotechnology Information ncRNA: noncoding RNA

NEB: New England Biolabs

NF-κB: nuclear factor kappa-light-chain-enhancer of activated B cells

NIAID: National Institute of Allergy and Infectious Diseases

NIH: National Institute of Health

ORF: open reading frame

PCA: principal component analysis

PCR: polymerase chain reaction

PHD: plant homeodomain qCML: quantile-adjusted conditional maximum likelihood method poly(A): polyadenylated

RACE: rapid amplification of cDNA ends

RED: relative evolutionary divergence

RNA: ribonucleic acid

RNAi: RNA interference rRNA: ribosomal RNA

RNA-Seq: RNA-sequencing

RPMI + P/S: Roswell Park Memorial Institute medium + 0.4 U penicillin and 4 µg streptomycin per mL

RSEM: RNA_Seq by expectation-maximization

xv

SANT/Myb: Swi3, Ada2, N-Cor, and TFIIIB domain

SFG: group

SILVA: 16S rRNA gene database

SRA: Sequencing Read Archive

TPM: transcripts per million

tRNA: transfer RNA

UMSOM: University of Maryland, School of Medicine

UTR: untranslated region

UWO: University of Wisconsin Oshkosh

WGCNA: weighted correlation network analysis

WO virus: prophage found in Wolbachia

xvi

Chapter 1: Introduction

Human filariasis

Nematodes are one of the most common causes of parasitic infections in humans (1).

Within this, nematodes of the superfamily Filarioidea, also known as filarial nematodes, infect ~200 million individuals worldwide with >800 million individuals across >50 countries at risk (2-4). Filarial worms infect two hosts across their life cycle, with , such as flies or mosquitoes, serving as their vector host and a mammal serving as their definitive host. Filarial nematodes are spread as infected vector hosts deposit infective L3 larvae into the definitive host upon taking a blood meal. The larvae mature into sexually-active adults over the course of months and produce microfilariae

(5). The microfilariae then migrate into the blood of the definitive host, where they can be taken up by a vector taking a blood meal (6). Once in the vector, the microfilariae mature through three molts to the infective L3 larval stage, at which point the life cycle repeats. While filarial nematodes are known to infect a variety of mammalian hosts, eight nematodes are known to infect humans and cause filariasis (4).

Lymphatic filariasis: Wuchereria bancrofti, Brugia malayi, and Brugia timori

Three of the filarial nematodes that infect humans—Wuchereria bancrofti, Brugia malayi, and Brugia timori—cause lymphatic filariasis, an affliction in which the nematodes infect and damage the lymphatic system of infected humans (7). Of the different types of filariasis, lymphatic filariasis is the most prevalent with Wuchereria bancrofti and Brugia malayi infections being the first and second most commonly reported cases of filarial nematode infections, respectively. Lymphatic filariasis affects

1 over 120 million people worldwide across 72 countries in Asia, Africa, and South

America (8).

Although most infected individuals are asymptomatic, a small percentage of infected individuals develop lymphedema years after infection (7). As the filarial nematodes migrate through the lymphatic vessels of the infected individual, the nematodes have the potential of clogging or damaging the lymphatic system. This blockage of the lymphatic system leads to an inability to drain lymph fluid. This build-up of fluid causes the swelling of a limb corresponding to the area of the damaged lymphatic, known as lymphedema (7). While the legs are most often affected by the lymphedema, it can also affect the arms, breasts, or genitals. In males, lymphedema can also manifest in scrotal swelling due to a collection of liquid from the damaged intrascrotal lymphatics, termed hydrocele (9). Additionally, the damage to the lymphatic system leaves the affected host more susceptible to bacterial infections (7, 10).

Onchocerciasis: Onchocerca volvulus

Onchocerca volvulus is the third most commonly reported filarial nematode infection, being responsible for onchocerciasis, also known as river blindness. Cases of onchocerciasis are spread across 34 countries, with areas in sub-Saharan Africa and

Yemen having the highest incidence of disease. As of 2015, over 15.5 million people are affected by onchocerciasis, with ~1 million having some degree of vision loss (2, 11).

Onchocerciasis is transmitted specifically by infected black flies. Instead of the nematode migrating to the lymphatic system, O. volvulus larvae migrate to the skin

2 and/or eyes. The death of O. volvulus in the skin or eyes of the afflicted individual triggers host inflammation that can lead to skin disease and/or blindness.

Loaisis: Loa loa

Loa loa is a filarial nematode responsible for a rare subset of filariasis known as loiasis, based primarily in Western and Central Africa (12). Loiasis is transmitted by Chrysops horseflies and resides in the subcutaneous tissues of infected individuals while its microfilariae reside in the blood. While loiasis is also typically asymptomatic, infected individuals have reported intermittent, itchy and nonpainful swellings as symptoms.

Additionally, cases of loiasis have reported instances of Loa loa crawling under the surface of the affected individual’s eye (12).

Mansonellosis: Mansonella ozzardi, Mansonella perstans, and Mansonella streptocerca

The last three of the human-infecting filarial nematodes—Mansonella perstans,

Mansonella ozzardi, and Mansonella streptocerca—are responsible for a subset of filarial diseases known as mansonellosis (13). While the three worms have been classified as members of the same genus, the cases of mansonellosis that arise from each of the three species have marked differences. Cases of M. ozzardi infections are primarily located in the Caribbean in Central and South America. M. perstans is endemic to North, and sub-

Saharan Africa along with South America, while M. streptocerca is endemic to Western and Central Africa (13). The reported vectors for M. ozzardi include the family of biting-midges and Simulium black flies, while only biting-midges serve as the vector host for M. perstans and M. streptocerca (14). M. ozzardi, M.

3

perstans, and M. streptocerca also reside in different areas of their human host. The

microfilariae of M. ozzardi resides in both the blood and skin of its human host while M. perstans resides in only the blood and M. streptocerca resides in only the skin. The

filariae of M. ozzardi reside in the lymphatic vessels, pericardium, and peritoneum while

M. perstans filariae only reside in the pericardium and the peritoneum. In stark contrast,

M. streptocerca resides in the subcutaneous tissues of the dermis (13, 14). Despite cases

of mansonellosis often being asymptomatic, reported infections with each of the three

species have had differing clinical symptoms. Fever, joint pains, keratitis, and face edema

have been reported for M. ozzardi; itching, joint pains, and enlarged lymph glands have

been reported for M. perstans; and dermatitis and rash have been reported for M.

streptocerca (14).

Diagnosing filariasis

Traditionally for the diagnosis of loiasis, lymphatic filariasis, and certain types of

mansonellosis, blood is taken from the patient and checked for microfilariae using

Giemsa stains (4, 13, 15). However, the times in which the blood samples are taken vary

depending on the circadian periodicity of each filarial nematode. By only appearing in the

blood of their definitive host at times coinciding with the feeding times of their vector

host, filarial nematodes are better able to avoid immune clearance while optimizing their

propagation by vector hosts. For individuals infected with W. bancrofti and B. malayi,

blood samples taken at night are often found to contain an increased concentration of

microfilariae, coinciding with the feeding times of their vector host (6). However, the

periodicity of filarial nematodes has been shown to vary within the same species, with W.

bancrofti in the Southern Pacific having diurnal periodicity (16). For L. loa, diurnal 4

periodicity is observed, with the maximum microfilariae density in the blood of its

infected host occurring during daytime, coinciding with the feeding times of the vector

Chrysops host (17).

While M. perstans and M. ozzardi microfilariae have been detected in blood, the

microfilariae of M. streptocerca are only found in the skin and are instead detected using

skin snips, small biopsies of the upper dermal layer of the skin (13). In comparison,

Mansonella microfilariae show very weak, if any, periodicity (13, 18, 19).

Onchocerciasis is diagnosed through the detection of microfilariae in skin snips during

the day, due to the diurnal periodicity observed with O. volvulus (4). However, the detection of filarial nematodes through microscopy lacks sensitivity, with microfilariae detection often depending on the type of infection, volume of blood sampled, and the time of sampling due to the different periodicities observed in filarial nematodes (20).

Additionally, a high degree of technical expertise is needed to differentiate different filarial species through microscopy. Furthermore, for infected individuals, microfilariae are not always detected with blood smears or skin snips. As an example, L. loa microfilariae are only detected in 30% of all infected individuals (15).

Because of the limitations of microscopy-based diagnostic techniques, antigen detection- based methods are increasingly being used for the diagnosis of filarial diseases (4).

Immunochromatographic tests (ICTs) are currently the most sensitive method for detecting and diagnosing bancroftian lymphatic filariasis. By adding non-pretreated human sera to a sample pad containing dried polyclonal antifilarial antibodies coupled to colloidal gold, sera positive for W. bancrofti antigens can be detected in <5 min (20).

5

Compared to microscopy-based techniques, diagnoses using ICTs are independent of

sampling time, being able to detect residual antigens in the blood of affected individuals.

Serology-based tests for the diagnosis of filarial diseases continue to be improved by

increasing the sensitivity for specific filarial antigens while minimizing the cross-

reactivity between antigens of different filarial species. Serology-based lateral flow tests

are commercially available for the detection of W. bancrofti and O. volvulus antigens. SD

Bioline currently stocks ICT kits for diagnosing lymphatic filariasis and onchocerciasis through the detection of IgG4 antibodies to the Wb123 antigen for W. bancrofti and the

Ov16 antigen for O. volvulus from blood samples (21, 22). While no serology-based diagnostic tools are currently available for mansonellosis (14), a luciferase immunoprecipitation system (LIPS) is currently in development for the diagnosis of loiasis through the detection of the LISXP-1 antigen of L. loa (23, 24).

In addition to antigen-based techniques, there are numerous molecular-based diagnostic techniques for filariasis. Polymerase chain reaction (PCR) techniques have been used to diagnose and/or monitor filariasis using primers specific to the different filarial nematode species (25-29). More recently, loop-mediated isothermal amplification (LAMP) assays have improved upon PCR-based diagnostic methods. Similar to PCR methods, LAMP assays measure the amplification of a DNA product from primers specific to different filarial nematode species. However, LAMP assays do not require thermocyclers and product amplification is measured by fluorescence or turbidity directly from the reaction tube, saving both time and resources (30, 31). LAMP assays are currently available for

W. bancrofti, B. malayi, O. volvulus, and L. loa (30). PCR and LAMP-based assays have

6

also been adapted for xenomonitoring, in which the progress of filariasis elimination

programs is evaluated based on the detection of filarial DNA in environmental vectors

(32-34).

Filarial vaccine development

As an approach to preventing and blocking the transmission of filariasis, vaccines are

currently in development against filarial nematodes. Despite being in development since

the 1940s, there are currently no available vaccines against any form of filariasis (35).

This is partly due to the limited ability of the human filarial nematodes to infect non-

human mammals, with only B. malayi, O. volvulus, and L. loa being known to infect

different mammalian . While numerous vaccine studies have been conducted in

different model systems for B. malayi, the most relevant was conducted in in rhesus

monkeys. Using irradiated L3 B. malayi larvae, a protective immune response was able to be generated against B. malayi for 12 months post-vaccination (36). However, protection using L3 larvae in different filarial models has been shown to vary, with a study in chimpanzees showing that vaccinations with irradiated L3 O. volvulus larvae offer no additional protection (37). Similarly, in the mandrill model of L. loa, vaccination using irradiated L3 larvae showed a delay in the time of peak microfilaremia but no significant decrease in total L. loa microfilariae (38).

Filariasis treatment options

Current mass drug administration (MDA) programs administer albendazole (ALB) in combination with ivermectin (IVM) or diethylcarbamazine (DEC) for lymphatic filariasis

(4). ALB is a carbamate benzimidazole that inhibits the formation of microtubules and

7

the polymerization of worm β-tubulin (39). When used in combination with DEC or

IVM, ALB has been observed to be more effective in reducing microfilarial loads for a longer period compared to DEC or IVM alone (40). IVM is a macrocyclic lactone known to be microfilaricidal and sterilizing to adult females. IVM functions by interacting and inhibiting postsynaptic glutamate-gated chloride (GluCl) channels specific to the

Nematoda and Arthropoda phyla. Inhibition of the GluCl channels has been found to both prevent the release of microfilariae by female worms and immobilize microfilariae, which can then be easily cleared by the host immune system (41). DEC is a piperazine derivative thought to be the most effective therapeutic for human filariasis. Although its mechanism is currently unknown, DEC has been hypothesized to affect the cyclooxygenase, lipoxygenase, and NF-kB signaling pathways in microfilariae (42). DEC is also thought to alter pathways in the human host including the arachidonic acid and nitric oxide metabolic pathways, which are thought to play a key role in facilitating a host immune response to microfilariae (43, 44).

While three different antihelminthics are currently prescribed for the treatment of lymphatic filariasis, IVM is the sole drug prescribed for the treatment of onchocerciasis

(4) with ALB not having any effects on microfilariae loads (45, 46). The use of DEC in patients with heavy parasitemia is not advised due to the elicitation of a large host inflammatory response from the large release of Wolbachia from dead nematodes (47,

48). In cases of ocular onchocerciasis, the strong inflammatory response elicited from

DEC can lead to death (4). For loiasis, the treatment of choice is DEC combined with

ALB, with IVM having been found to cause encephalopathy when used as a treatment

(49). Additionally, DEC in loiasis patients with heavy parasitemia are under risk of brain 8

inflammation (49). Compared to the other three forms of filariasis, there are currently no

universal guidelines for the best treatment of mansonellosis. Treatment options vary

between Mansonella species, with IVM being used to treat M. ozzardi, DEC and

mebendazole having been shown to be the best treatment for M. perstans, and DEC being

the most common treatment for M. streptocerca (14).

An issue with the current filariasis treatments for lymphatic filariasis and onchocerciasis

are that they are largely microfilaricidal, with limited adulticidal activity (4, 50). While

these microfilaricidal treatments have been successful in preventing the transmission of

lymphatic filariasis, expensive and prolonged treatments are often required for full

nematode clearance, since adult nematodes continuously generate microfilariae and can

live in affected individuals for 6-8 years (8). However, coendemic regions for filariasis

remain an issue for current filariasis treatments due to therapeutics for one form of

filariasis potentially eliciting severe adverse effects for other forms of filariasis (47-49).

Because of this, coendemic regions have adapted a test and not treat approach for

filariasis. As an example, in certain regions of Central Africa coendemic for

onchocerciasis and loiasis, a test and not treat approach is currently being used for loiasis

due to the potential of post-treatment adverse effects from treating L. loa infected individuals with IVM (51).

As an alternative to targeting filarial nematodes using antihelminthics, studies have

shown the efficacy of using antibiotics targeting the Wolbachia endosymbionts of the

lymphatic filariae and O. volvulus (52). The Wolbachia are a genus of obligate, intracellular bacteria needed by their filarial nematodes host for proper development and

9

reproduction. Doxycycline functions by binding to the 30S ribosomal subunit in bacteria,

inhibiting protein synthesis. Treatment of bancroftian and brugian filariasis with a 4-8 week treatment regime of doxycycline causes long term sterility and has both macrofilaricidal and microfilaricidal effects (53, 54). For onchocerciasis, a 6 week course of doxycycline has been found to have microfilaricidal effects and sterilize female worms

(55, 56). When used in combination with ivermectin, the 6 week doxycycline regimen has been shown to have a greater macrofilaricidal effect compared to IVM alone (57).

The success of doxycycline in the treatment of filariasis has led to an increasing number of anti-Wolbachia drug discovery strategies. The A-WOL Consortium uses mass drug screens to identify potential anti-Wolbachia compounds that show adulticidal activity against the lymphatic filariae and O. volvulus (58). As a result, the A-WOL Consortium currently has multiple candidate anti-Wolbachia therapeutics for filariasis awaiting clinical testing (59-62).

The supergroup classification system

Although other bacterial genera use species designations, Wolbachia endosymbionts are instead currently classified by supergroup designations. The development of the supergroup classification system occurred at a time when 16S rRNA sequence similarity was used as an objective method to assign bacterial species designations (63). However, in the case of bacteria with reductive genomes, the slow evolutionary rate of the 16S rRNA provides inadequate resolution in dividing species based on their unique phenotypes. This was the case in the Wolbachia, in which the application of 16S rRNA species guidelines was unable to divide the Wolbachia based on a key phenotype that arthropod Wolbachia inflict upon their host: cytoplasmic incompatibility (63). While the 10 use of 16S rRNA sequences was able to divide the Wolbachia into broad groupings of isolates, it could not resolve the different CI phenotypes observed in the different strains.

The Wolbachia community elected to instead use the sequence similarity of the faster- evolving, single-copy wsp gene to assign supergroup designations. The first use of the supergroup designations was able to split the arthropod Wolbachia into two groupings, supergroup A and B. At that time, based on CI phenotypes, a host infected with supergroup A Wolbachia would be unable to mate with a host infected with supergroup B

Wolbachia and produce viable offspring. However, today we understand CI to not necessarily be a phylogenetically-informative trait, especially with the ability of arthropod Wolbachia to cross-species boundaries (64). While the supergroup designations are not intended to be analogous to species designations used in other bacterial genera, it provided an alternative way to classify Wolbachia past the genus level.

The continued use of wsp as a phylogenetic marker has revealed the wsp gene to be confounded by significant levels of recombination. Additionally, wsp has been found to have differing rates of selection between arthropod and nematode Wolbachia, indicating that it is unsuitable for Wolbachia phylogenetic analyses. This eventually led to the development of a Wolbachia multilocus sequence typing system (MLST) for the classification of Wolbachia species. By using sequences from five single-copy conserved

Wolbachia genes (ftsZ, gatB, coxA, hcpA, and fbpA), along with wsp as an optional strain marker, the MLST system offered a high-resolution method for typing Wolbachia supergroups while overcoming the single-locus recombination biases from using single gene phylogenetic methods.

11

The current Wolbachia supergroup classification scheme separates the Wolbachia into 17

distinct supergroups. Of these, twelve supergroups exclusively infect arthropods (A, B, E,

G, H, I, K, M, N, O, P, Q), three exclusively infect filarial nematodes (C, D, J), one

exclusively infects plant parasitic nematodes (L), and one that infects both arthropods and

filarial nematodes (F) (65). However, the initial design of the MLST system used

Wolbachia sequences from only 4 of the 16 currently established Wolbachia supergroups

(A, B, D, and F) (66-68) and occurred during a time when the only complete Wolbachia

genomes available were of wMel (69) and wBm (70), indicating a need to update the

current Wolbachia classification system. Despite the plethora of Wolbachia supergroups,

Wolbachia are commonly classified as being arthropod Wolbachia or nematode

Wolbachia where they interact differently with their host in key ways.

Arthropod Wolbachia host interactions

The most studied arthropod Wolbachia phenotypes are all forms of reproductive parasitism, in which the endosymbiont manipulates its host’s reproduction to increase the prevalence of Wolbachia-colonized female hosts to provide themselves a selective advantage (71, 72). The Wolbachia are unique in that they are the only bacteria known to cause all four commonly recognized forms of reproductive parasitism: (a) cytoplasmic incompatibility, (b) male-killing, (c) feminization, and (d) parthenogenesis induction

(72). Cytoplasmic incompatibility (CI) is a phenomenon used to describe an inability to generate viable offspring when a Wolbachia-infected male mate with a non-infected female or a female infected with another Wolbachia strain (71). CI is thought to occur in two steps: modification and rescue. Modification occurs in spermatogenesis, where the presence of intracellular Wolbachia modifies the sperm in the infected male host. Rescue 12

occurs in fertilized eggs, where the presence of the same strain of Wolbachia in infected

female hosts can “rescue” the deleterious modifications in the host sperm, allowing for

the formation of viable offspring. Incompatible crosses due to CI have shown to lead to

the paternal chromosomes being unable to condense in the first mitotic divisions of the

pronucleus, leading to the loss of the paternal chromosomes. This causes stunted

development and the eventual death of the embryo (73, 74). Recently, two genes

transcribed by the Wolbachia endosymbiont of Drosophila melanogaster, wMel, have

been shown to play a major role in CI. The two genes, cifA and cifB, are found in the

eukaryotic association module of the Wolbachia prophage WO. Both the expression of

cifA and cifB has been found to contribute to embryonic lethality in the males of

incompatible crosses while only the expression of cifA has been found to nullify CI- induced embryonic death (75, 76).

Through male-killing and feminization, Wolbachia alter the ratio of their arthropod hosts’ broods, increasing the numbers of the female sex. Compared to CI, which is only caused by bacteria in the Wolbachia and Cardinium genera, male-killing endosymbionts are more diverse and can also be found in the Arsenophonus, Flavobacteria, Rickettsia, and

Spiroplasma (72). Male-killing occurs during larval development, in which the presence

of male-killing endosymbionts in the male larvae causes developmental defects and

eventual death (77). Because of this, insects harboring male-killing endosymbionts lose

approximately half their brood, with the surviving larvae having highly biased female sex

ratios. While male-killing is not necessarily advantageous for the hosts, male-killing is

thought to be advantageous for maternally-transmitted endosymbionts. The decreased

13

number of males increases the fitness of the surviving female hosts due to decreased

levels of antagonistic sibling interactions, inbreeding, and sibling egg consumption (78).

With feminization, Wolbachia facilitates the conversion of genetic males to phenotypic females, leading to the production of all female broods. While in the majority of insects, sex is determined primarily by genotype, arthropods containing feminizing endosymbiont bacteria, such as members of the Wolbachia, Arsenophonus, Cardinium, and Spiroplasma can have their sex altered during larval development (79). Wolbachia-mediated male feminization has been found to continuously occur on genetic males during larval development. While still speculative, Wolbachia are thought to interact with a sex specific molecular mechanism in their arthropod hosts that controls the expression of female-specific phenotypes (80).

Parthenogenesis induction is a phenomenon in which females infected with a parthenogenesis-inducing bacteria, such as Wolbachia, Cardinium, or Rickettsia (72), lay, in the absence of fertilization, diploid eggs that develop into females, with males being very rare or close to absent. Thelytoky, a term describing parthenogenesis in which females are produced from unfertilized eggs, is thought to occur as a post-meiotic modification induced by the Wolbachia of the arthropod host (81). In the insect order

Hymenoptera, thelytoky occurs through a diploidization process known as gamete duplication, in which the first mitotic division of two haploid sets of chromosomes in the embryo fails to separate, leading to the formation of a single diploid nucleus containing two identical sets of chromosomes (81). By causing thelytoky, Wolbachia are able to generate female hosts to populate in the absence of fertilization and males. While

14 thelytoky is beneficial for the Wolbachia, it is thought to be evolutionarily detrimental to the arthropod hosts due to the accumulation of deleterious mutations associated with asexual reproduction.

A common theme among the four reproductive parasitological phenotypes caused by

Wolbachia is a focus on generating female-biased broods. Because Wolbachia are transmitted like mitochondria, from the maternal arthropod to her offspring, the

Wolbachia have little use for the paternal arthropod. Coupled with the ability to induce thelytoky parthenogenesis, female host production and Wolbachia transmission is not necessarily limited by fertilization or the presence of male arthropods. All four forms of sexual manipulation are all thought to be deleterious and therefore parasitic to the

Wolbachia host. However, studies have described beneficial effects of Wolbachia infections such as improving arthropod fecundity, protecting from viral infections, and provisioning for metabolites during nutritional stress periods (81-83).

Nematode Wolbachia host interactions

Compared to arthropod Wolbachia, Wolbachia endosymbionts in nematodes are more restrictive, primarily being observed in nematodes within the family (83).

While in arthropods the status of Wolbachia infections varies within a species, the presence of Wolbachia has been observed to be fixed within most filarial nematodes, indicative of Wolbachia mutualism as opposed to parasitism (84). Similarly, the

Wolbachia within the plant-parasitic nematode Pratylenchus penetrans has been hypothesized to be mutualistic from genomic and phylogenetic analyses (85).

15

The Wolbachia population within a single nematode varies at different life stages. In B.

malayi, the ratio of Wolbachia/nematode DNA has been observed to be the lowest from

the microfilariae through the vector-borne larval stages. Within the first week of infection in the mammalian hosts, Wolbachia numbers increase dramatically as the worm matures to its adult life stage. Additionally, mature adult female worms appear to harbor the greatest number of Wolbachia, as the embryonic larvae in the adult female worms become infected with Wolbachia (86). The periods of rapid growth in Wolbachia as its nematode host matures has led to the hypothesis that Wolbachia has a role in provisioning nutrients or metabolites needed for the metabolically-intensive processes during the later stages of nematode development into the adult life stages (87). Upon antibiotic depletion of Wolbachia with tetracycline, B. malayi growth and embryogenesis

in B. malayi is impaired (4). Additionally, using jird infection models, Wolbachia

depletion of the filarial nematodes, Brugia pahangi, Dirofilaria immitis, and

Litomosoides sigmodontis resulted in filarial infertility and growth retardation, further

indicating that Wolbachia are required for proper development and reproduction in

filarial nematodes (88, 89).

In B. malayi, histological studies have revealed female germ cells in mature adult females

to be heavily infected with Wolbachia while no Wolbachia were observed at any stage in

male germ cells during all of spermatogenesis (90). Upon depletion of Wolbachia in adult male and female worms using tetracycline, no defects were observed in spermatogenesis while over a third of proliferative germline nuclei were lost in females (90). From this, it is hypothesized that Wolbachia may have a role in controlling the proliferation of germline cells in adult female B. malayi by either stabilizing or acting in parallel to the B. 16

malayi effectors that govern female germline replication, function, and proper reproduction, similar to what is observed with CI in insects (91).

Despite Wolbachia being integral to the development of most filarial nematodes, a subset of nematodes, including Acanthocheilonema viteae, Loa loa, and Onchocerca flexuosa do not harbor Wolbachia endosymbionts (92, 93). As such, tetracycline does not adversely affect the development or reproduction of Wolbachia-free nematodes. However, molecular studies of both A. viteae and O. flexuosa have identified Wolbachia-like sequences integrated into the genomes of both species through lateral gene transfer events

(94). Of these lateral gene transfer events, approximately half are transcribed from annotated pseudogenes (94), This indicates that despite the absence of Wolbachia endosymbionts, A. viteae and O. flexuosa could still be dependent on Wolbachia-derived sequences for proper development. However, such events in L. loa have not been described (93).

Leveraging Wolbachia dependence for infectious disease control

Wolbachia endosymbionts have been exploited in several ways to combat infectious diseases. The large-scale release of male Aedes aegypti mosquitoes infected with the wMel strain of Wolbachia has been proposed as a method to suppress native mosquito populations, specifically in Chikungunya, Dengue, and Zika endemic regions (95, 96).

The release of specifically male, wMel-infected mosquitoes leverages the CI phenotype induced by the Wolbachia on its host. As no viable offspring will be produced in crosses between the male, wMel-infected mosquitoes and virus-infected, wMel-uninfected, female mosquitoes, this functions as a method of limiting the transmission of viral

17

infection (97). Additionally, the release of both male and female wMel-infected mosquitoes can also limit viral transmision due to the observation that Wolbachia- infected mosquitoes have a reduced ability to transmit viruses (98-103). Additionally, the release of both male and female Wolbachia-infected mosquitoes suppresses viral

transmission without drastically altering mosquito populations. Wolbachia-infected A. aegypti mosquitoes have been released in over 12 countries (104-106) by the World

Mosquito Program in an attempt to control A. aegypti-transmitted viruses.

Because of the dependence of filarial nematodes on their Wolbachia endosymbionts, a common drug development strategy against filarial diseases such as lymphatic filariasis or onchocerciasis is to target the Wolbachia endosymbiont of the nematode host (83).

Although antihelminthic therapies are available, current drugs have primarily microfilaricidal activity against filarial nematodes. For lymphatic filariasis, multiple studies have established that doxycycline and rifampicin cause inhibition of embryogenesis, stunted development, and infertility, and macrofilaricidal activity though

Wolbachia depletion (83). Additionally, a three to eight-week course of doxycycline and rifampicin has been shown to have macrofilaricidal activity on B. malayi, O. volvulus, and W. bancrofti in humans (54, 55, 57, 107, 108). Despite this, doxycycline is contra- indicated for pregnant or breast-feeding woman and children under eight years of age and combined with the possible emergence of drug resistance, additional therapies are needed for the treatment of lymphatic filariasis (4). Currently, the A-WOL Consortium has identified multiple prospective anti-Wolbachia therapeutics for lymphatic filariasis through mass drug screens for anti-Wolbachia compounds (58-62).

18

‘Omics-based apparoches to filarial nematodes and their Wolbachia endosymbionts

In 2007, the draft genome assembly of B. malayi genome enabled the prediction of drug

targets by assessing the gene repertoire of a filarial nematode (109). Combined with the complete genome assembly of the Wolbachia endosymbiont of B. malayi, wBm, released two years prior, an increasing number of studies used genome-based predictions to facilitate filarial drug and vaccine development for lymphatic filariasis (70, 110). By

assessing the gene content of B. malayi and wBm, the mechanisms outlining the codependency of the two organisms could be elucidated. Based on its gene content, it has been proposed that B. malayi is dependent on its Wolbachia endosymbiont for

glutathione, heme, and riboflavin due to the lack of any such biosynthesis genes in the B.

malayi genome. Similarly, the wBm genome lacks genes coding for the complete

pathways for the de novo biosynthesis of coenzyme A, biotin, lipoic acid, ubiquinone,

folate, and pyridoxal phosphate, indicating that these vitamins and cofactors must be

obtained from the nematode host (70). Similarly, genome sequencing of O. volvulus and its Wolbachia endosymbiont, wOv, has revealed O. volvulus to be dependent on its

Wolbachia endosymbiont for fatty acid metabolism, nucleotide metabolism, and heme

biosynthesis (111). However, a comparison of the genomes of the Wolbachia-free L. loa

and other filarial nematodes reveals that vitamin B6 synthesis and salvage is the only

different metabolic pathway in L. loa, possibly indicating that the role of Wolbachia in

filarial nematodes could be more subtle than previously realized (93).

Genomic-based tools provided a new strategy for the development of drug targets for filarial nematodes (112, 113). Through comparative genomics with the free-living nematode Caenorhabditis elegans, orthologs were identified between C. elegans and B. 19

malayi. Because vital genes in C. elegans are well documented from genome-wide RNAi

screens, orthology mapping can be used to identify vital genes in B. malayi as potential drug targets (114). Additionally, genomic screens of Wolbachia genomes have provided potential targets for anti-Wolbachia drug development studies. By screening for expressed genes in Wolbachia that have little homology with mammalian genes, active genes encoding for phosphoglycerate mutase and pyruvate phosphate dikinase were identified as potential target candidates for wBm (115, 116).

Transcriptomics-based approaches have also been frequently used to identify potential drug target candidates based on expression across the nematode life stage. Highly and differentially expressed genes coinciding with time points of interest, such as during specific larval molts, serve as promising novel drug targets. Multiple transcriptomics studies assessing the transcriptome of B. malayi and its Wolbachia endosymbionts have been published and deposited in repositories for the identification of novel drug targets

(117, 118). Both studies have independently uncovered a set of male-specific protein kinases that could be targeted by repurposing existing kinase inhibitors (119). A transcriptomics study of O. volvulus has similarly uncovered kinases specifically expressed in adult males alongside sets of stage-specific G-protein coupled receptors

(GPCRs) that could potentially be targeted (120). Transcriptomics studies have been paired with proteomics datasets to identify potential infection biomarkers for improved diagnostic tools. For O. volvulus, mass spectrometry data sets of the O. volvulus secretome was paired with transcriptomics and immunomics approaches to identify novel biomarkers for O. volvulus infection (120).

20

‘Omics studies on filarial nematodes have also attempted to identify potential vaccine targets. Instead of using irradiated larvae, transcriptomics and proteomics data have been used to identify targets that are expressed heavily in the (a) infective vector L3 life stage of the nematode to prevent incoming infections or (b) microfilariae, in order to block transmission and reduce pathology (121). Proteomics studies aimed at assessing the abundance of proteins secreted by B. malayi have identified 11 small transthyretin-like proteins, which have also been identified as potential vaccine candidates against other helminth diseases (122). Numerous metabolites have been proposed as potential vaccine candidates for filariasis, including the fatty acid and retinol binding proteins in B. malayi and O. volvulus (123). Additionally, another proteomics study has identified 27 potential vaccine candidates on the surface of cells in the B. malayi digestive tract that can potentially interact with host antibodies. The 27 candidate proteins have been identified to have high homology to W. bancrofti and O. volvulus, indicating that these candidate proteins could potentially be protective to multiple types of filariasis (124). Another O. volvulus proteomics study has identified key metabolites and potential biomarkers using mass spectrometry data from O. volvulus-infected individuals (125). In a combined transcriptomics and proteomics O. volvulus study, six O. surface volvulus proteins identified and expressed in the infective L3 larvae stage have been identified as potential novel vaccine candidates. For L. loa, transcriptomics and proteomics approaches have been used to identify additional L. loa biomarkers (126, 127).

Concluding remarks

While the current treatment options available for filariasis have been shown to be effective in reducing disease transmission rates, they are lacking in adulticidal activity. 21

The persistence of adult worms in infected individuals requires prolonged and expensive

treatment periods that can be improved. Additionally, antibiotic resistance has started to

emerge in filarial nematodes, with there being reports of DEC and IVM resistance in W.

bancrofti and O. volvulus (128, 129). The deficits in the current treatment options for filariasis calls for studies to identify additional filarial therapeutics.

Although several transcriptomic studies (117, 118) have analyzed the transcriptome of B. malayi, wBm, and its laboratory vector host A. aegypti, no comprehensive study has simultaneously assessed the transcriptome of all three organisms across the entirety of the

B. malayi life cycle. One of the primary reasons is the inability to recover a sufficient number of reads from B. malayi in its early vector life stages and wBm in all vector B. malayi life stages. Additionally, for the Wolbachia life stages analyzed in previous studies, the number of reads recovered is lacking, with samples having between ~14,000-

67,000 reads mapping to protein-coding genes (118).

However, in order to generate a comprehensive transcriptomics data sets that captures the biological interplay between the different organisms in lymphatic filariasis, novel technologies need to be developed, validated and adapted. For the mammalian life stages of B. malayi, the transcriptome of two organisms—B. malayi and wBm—must be adequately captured from a single sample while for the vector stages, the transcriptome of three organisms—B. malayi, wBm, and A. aegypti—must be captured. This proves

difficult when the wBm reads amount to <5% of sequenced reads in some mammalian

samples and <0.1% of reads in the vector samples. The Agilent SureSelect (AgSS)

platform serves as hybridization-based enrichment method that uses specific baits to

22

target transcripts of interest in RNA-Seq experiments. By designing sequence-specific baits for B. malayi or wBm, it becomes possible to extract transcripts of interest, such as mRNAs, while discarding unwanted RNA types. However, the AgSS capture needs to be validated in its efficacy and to see whether it imparts a bias on the recovered transcriptome relative to the standard RNA-Seq enrichments.

Additionally, the currently available quantification tools for RNA-Seq analyses are lacking in that they have been designed and tested using human data sets (130-134), a system in which full-length transcript sequences are known. In most prokaryotic systems, only coding sequences of transcripts are known due to the laborious nature of the laboratory techniques required to identify the untranslated regions in transcripts. In prokaryotes, genes are often transcribed together in transcriptional units known as operons, with the >5000 genes in the genome being transcribed in 630-

700 estimated operons (135). Because prokaryotic genomes lack full length transcript annotations, operons are often not annotated leading to each operonic gene being considered an individual transcriptional unit. This becomes problematic when reads are present that overlap multiple genes and cannot be unambiguously assigned. These problems become exacerbated in smaller, more reductive genomes such as those observed in Wolbachia in which it becomes difficult to differentiate between genes that are transcribed together versus genes in close proximity.

The application of the AgSS platform and development of a prokaryotic-specific quantification tool will aid in carrying out and improving upon existing transcriptomic studies for Brugian filariasis. The data set will serve as a comprehensive resource for the

23 community to search for genes upregulation at specific life stages for the development of potential biomarkers, therapeutics, or vaccine candidates. Additionally, we sequenced the

Wolbachia endosymbiont of Wuchereria bancrofti, wWb, and with comparative genomics analyses we can identify shared, core genes between wWb and wBm. We also address the current issues with the Wolbachia supergroup taxonomic scheme. We plan to adapt species designations to the Wolbachia to bring Wolbachia in line with the rest of bacterial taxonomy. My objectives for my thesis are as follows:

1) Develop and validate tools for prokaryotic and multi-species RNA-Seq analyses

(Chapters 2 and 3).

2) Characterize the transcriptome of Brugia malayi, its Wolbachia endosymbiont

wBm, and its laboratory vector host A. aegypti to identify potential drug targets

for lymphatic filariasis (Chapter 4).

3) Use comparative genomics to analyze the compare the Wolbachia endosymbionts

of the Wuchereria bancrofti and Brugia malayi. (Chapter 5)

4) Address issues with the current Wolbachia taxonomic scheme (Chapters 6 and 7)

Authors’ Contributions

M.C. wrote the manuscript. M.C. and J.C.D.H. edited the manuscript.

24

Chapter 2: Targeted enrichment outperforms other enrichment techniques and enables more multi-species RNA-Seq analyses 1

Abstract

Enrichment methodologies enable the analysis of minor members that are present in low

abundance in multi-species transcriptomic data. We qualitatively compared the standard

enrichment of bacterial and eukaryotic mRNA to a targeted enrichment using an Agilent

SureSelect (AgSS) capture for Brugia malayi, Aspergillus fumigatus, and the Wolbachia

endosymbiont of B. malayi (wBm). Without introducing significant systematic bias, the

AgSS quantitatively enriched samples, resulting in more reads mapping to the target

organism. The AgSS-enriched libraries consistently had a positive linear correlation with

their unenriched counterparts (r2 = 0.559–0.867). Up to a 2,242-fold enrichment of RNA

from the target organism was obtained following a power law (r2 = 0.90), with the

greatest fold enrichment achieved in samples with the largest ratio difference between the

major and minor members. While using a single total library for prokaryote and

eukaryote enrichment from a single RNA sample could be beneficial for samples where

RNA is limiting, we observed a decrease in reads mapping to protein coding genes and an

increase in multi-mapping reads to rRNAs in AgSS enrichments from eukaryotic total

RNA libraries compared to eukaryotic poly(A)-enriched libraries. Our results support a

recommendation of using AgSS targeted enrichment on poly(A)-enriched libraries for

eukaryotic captures, and total RNA libraries for prokaryotic captures, to increase the

robustness of multi-species transcriptomic studies.

1 Chung M, Teigen L, Liu H, et al. Targeted enrichment outperforms other enrichment techniques and enables more multi-species RNA-Seq analyses. Sci Rep. 2018;8(1):13377. 25

Introduction

Dual species transcriptomic experiments have been increasingly employed as a method to

analyze the transcriptomes of multiple species within a system (118, 136-143). However, obtaining a sufficient quantity of reads from each organism is a major issue when simultaneously analyzing the transcriptomes of multiple organisms within a single sample. When extracting total RNA from a host sample, the signal from host transcripts typically overwhelms the signal from secondary organism transcripts under most biologically meaningful conditions (144, 145). Differential enrichments have been designed to physically extract RNA from secondary organisms in samples by depleting highly abundant rRNAs and selecting transcripts based on differing properties between the two organisms, such as the differential poly-adenylation status of transcripts in the case of samples containing a mixture of eukaryotic and prokaryotic RNA (146-149).

However, in eukaryote-eukaryote dual species transcriptomics experiments, these differences are usually not present to be exploited. Additionally, methods used to enrich for prokaryotic RNA from samples dominated by eukaryotic RNA can fail when the poly(A)-depletion method is unable to remove eukaryotic long non-coding RNAs that lack 3′-polyadenylatation, and when the efficacy of rRNA depletion is low, as observed with some organisms (146).

The Agilent SureSelect (AgSS) platform serves as a hybridization-based enrichment method that uses specific baits to target transcripts of interest

(www.genomics.agilent.com). By designing sequence-specific baits for an organism, it becomes possible to extract transcripts of interest, such as mRNA, while avoiding unwanted RNA types, such as rRNA and tRNA. In dual species transcriptomics, one of 26

the main advantages of the AgSS system is the ability to extract reads in systems where

an organism has low relative abundance (150). To test the efficacy of the AgSS platform on dual- and tripartite-species systems where insufficient numbers of reads are obtained from the minor, or lowly abundant, member(s), we designed AgSS baits for: (1) Brugia malayi, a filarial nematode and the causative agent of lymphatic filariasis, (2) wBm, the obligate mutualistic Wolbachia endosymbiont of B. malayi, and (3) Aspergillus fumigatus

AF293, a fungal pathogen known to cause aspergillosis in immunocompromised

individuals.

In filarial nematode transcriptomics, it is relatively straightforward to isolate nematode

samples from the mammalian/definitive host while minimizing contaminating host

material. Therefore, the Brugia transcriptome can be easily obtained from poly(A)- enriched RNA from Brugia samples originating from the mammalian host, which is often an experimentally infected gerbil (Meriones unguiculatus). However, sampling the

Wolbachia endosymbiont transcriptome in these worms is problematic because the endosymbionts contribute less RNA to the total RNA pool. Therefore, wBm reads are overwhelmed by B. malayi reads and must be enriched using a rRNA- and poly(A)- depletion (146), which can still result in an insufficient number of reads being obtained.

In the mosquito vector host, the parasitic worms develop in the thoracic muscles where they are not easily isolated. Therefore, whole infected mosquito thoraces containing the larval B. malayi are used for RNA isolation (117). As such the B. malayi reads are overwhelmed by reads from those of the mosquito vector, typically experimentally infected Aedes aegypti, and the reads of the endogenous wBm are even further dwarfed compared to samples obtained from the definitive host (151). In a separate dual-species 27

transcriptomics experiment examining A fumigatus infections, the eukaryotic A.

fumigatus reads are overwhelmed by reads from the eukaryotic human or mouse host

making it difficult to analyze the fungal transcriptome in this host-pathogen interaction.

Given that both are eukaryotes there are not differences in transcript characteristics like

polyadenylation to enable the enrichment of the mRNA from the minor member,

specifically A. fumigatus.

For each eukaryotic system, the performance of the poly(A)-enrichment was compared to

that of poly(A)-enrichment supplemented with the AgSS platform in extracting B. malayi and A. fumigatus reads. Additionally, the efficacy of the AgSS platform in extracting prokaryotic reads was determined by comparing the total RNA AgSS capture to the

RiboZero-treated, poly(A)-depletion method in extracting wBm reads.

Methods

Mosquito preparation

Aedes aegypti black-eyed Liverpool strain mosquitoes were obtained from the

NIH/NIAID Filariasis Research Reagent Resource Center (FR3) and maintained in the biosafety level 2 insectary at the University of Wisconsin Oshkosh (UWO). Desiccated mosquito eggs were hatched in deoxygenated water and the larvae maintained on a slurry of ground TetraMin fish food (Blacksburg, VA, USA) at 27 °C and 80% relative humidity. Female pupae were separated from males using a commercial larval pupal separator (The John Hock Company, Gainesville, FL, USA) and maintained on cotton pads soaked in sucrose solution. Adult female mosquitos were deprived of sucrose ~8 h prior to blood feeding. Mosquitoes were infected with the B. malayi FR3 strain by

28

feeding on microfilaremic cat blood (FR3) through parafilm via a glass jacketed artificial

membrane feeder. Microfilaremic cat blood was diluted using uninfected rabbit blood to

achieve a suitable parasite density for infection (100–250 mf (microfilariae)/20 µL) with mosquitoes being allowed to feed to repletion. Mosquitoes were maintained in insect incubators until time of worm harvest.

Nematode preparation

To examine larval development in the vector, groups of mosquitoes were sampled at 18 h post infection (hpi), 4 days post infection (dpi), and 8 dpi. Because larval development occurs in the thoracic muscle of the mosquito, thoraces of infected mosquitoes containing larval B. malayi were (1) separated from the head, abdomen, legs, and wings, (2) flash frozen in liquid nitrogen, (3) and stored at −80 °C prior to RNA isolation. To generate third stage larvae (L3) of B. malayi, infected mosquitoes at 9–16 dpi were processed in bulk using the NIAID/NIH Filariasis Research Reagent Resource Center (FR3) Research

Protocol 8.4 (www.filariasiscenter.org). Larvae were isolated in RPMI media containing

0.4 U penicillin and 4 µg streptomycin per mL (RPMI + P/S), flash frozen in liquid nitrogen, and stored at −80 °C. To generate fourth larval stages (L4) and adult worms, freshly isolated L3s were injected into the peritoneal cavities of Mongolian gerbils

(Meriones unguiculatus). Briefly, male Mongolian gerbils three months of age or older

(Charles River, Wilmington, MA, USA) were anesthetized with 5% isoflurane, immobilized on a thermal support, and administered ocular Paralube. The inguinal region for each gerbil was shaved and disinfected with iodine. Infections were performed by delivering L3 larvae into the peritoneal cavity using a butterfly catheter, which was left in place and flushed afterwards with 1 mL warmed RPMI + P/S to ensure delivery of all 29

larvae. Afterwards, gerbils were removed from the plane of anesthesia and allowed to

fully recover in a hospital cage prior to returning to group housing. Worms were later

harvested by euthanizing the gerbils and soaking the peritoneal cavities in warm

RPMI + P/S. Worms were washed in warm RPMI + P/S to remove traces of gerbil tissue,

flash-frozen with liquid nitrogen, and stored at −80 °C. All animal care and use protocols were carried out in accordance with the relevant guidelines and regulations and were approved by the UWO IACUC.

Mosquito/Nematode/Wolbachia RNA isolation

Mosquito thoraces were combined with TRIzol (Zymo Research, Irvine, CA, USA) at a ratio of 1 mL TRIzol per 50–100 mg mosquito tissue while nematode samples were processed using a 3:1 volume ratio of TRIzol to sample. β-mercaptoethanol was added to a final concentration of 0.1%. The tissues were homogenized in a TissueLyser (Qiagen,

Germantown, MD) at 50 Hz for 5 min. The homogenate was transferred to a new tube and centrifuged at 12,000 × g for 10 min at 4 °C. After incubating at room temperature for 5 min, 0.2 volumes of chloroform were added. The samples were shaken by hand for

15 s, incubated at room temperature for 3 min, then loaded into a pre-spun, phase lock gel heavy tube (5Prime, Gaithersburg, MD, USA) and centrifuged for 5 min at 12,000 × g at

4 °C. The upper phase was removed to a new tube and one volume of 100% ethanol was added prior to loading onto a PureLink RNA Mini column (Ambion, Austin, TX). The samples were then processed following manufacturer instructions, quantified using a

Qubit fluorometer (Qiagen, Germantown, MD, USA) and/or a NanoDrop spectrometer

(NanoDrop, Wilmington, DE, USA). The RNA was subsequently treated with the

30

TURBO DNA-free kit (Ambion, ThermoFisher Scientific, Waltham, MA, USA)

according to the manufacturer’s protocol.

A. fumigatus infection

Two different models of invasive aspergillosis, both using male BALB/c mice, were

employed (152, 153). In the non-neutropenic model, the mice were immunosuppressed

with 5 doses of cortisone acetate, with 500 mg/kg administered subcutaneously every

other day, starting four days pre-infection. In the neutropenic model, the mice were

administered cyclophosphamide, 250 mg/kg intraperitoneally, and cortisone acetate,

250 mg/kg subcutaneously, two days pre-infection. Four days post infection, the mice

were given a second dose of cyclophosphamide, 200 mg/kg intraperitoneally, and

cortisone acetate, 250 mg/kg subcutaneously. All mice were infected by placing them in a

chamber containing an aerosol of A. fumigatus conidia for 1 h. On days 2 and 4 for the

non-neutropenic model and on days 4 and 7 for the neutropenic model, 3 mice per time

point were sacrificed, after which their lungs were harvested and stored in RNAlater

(Ambion, ThermoFisher Scientific, Waltham, MA, USA) for subsequent RNA isolation.

A. fumigatus RNA isolation

Each lung sample was placed in a lysing matrix C tube (MP Biomedicals, Santa Ana, CA,

USA) with a single 0.25 inch diameter ceramic sphere (MP Biomedicals, Santa Ana, CA,

USA) and homogenized with a bead beater (FastPrep FP120, Qbiogene, Montreal,

Quebec, Canada).The RNA was isolated using the RiboPure kit (Ambion, ThermoFisher

Scientific, Waltham, MA, USA) following the manufacturer’s instructions, treated with

31

DNase I (ThermoFisher Scientific, Waltham, MA, USA), and purified using the RNA

clean & concentrator kit (Zymo Research, Irvine, CA, USA).

Illumina NEBNext Ultra Directional RNA libraries without capture

Whole transcriptome libraries were constructed for sequencing on the Illumina platform using the NEBNext Ultra Directional RNA Library Prep Kit (New England Biolabs,

Ipswich, MA, USA). When targeting eukaryotic mRNA, polyadenylated RNA was isolated using the NEBNext Poly(A) mRNA magnetic isolation module. When targeting bacterial mRNA, samples underwent rRNA- and poly(A)-reductions, as previously described (146, 149). SPRIselect reagent (Beckman Coulter Genomics, Danvers, MA,

USA) was used for cDNA purification between enzymatic reactions and size selection.

For indexing, the PCR amplification step was performed with primers containing a 7-nt

index sequence. Libraries were evaluated using the GX touch capillary electrophoresis

system (Perkin Elmer, Waltham, MA) and sequenced on a HiSeq2500, generating 100-bp

paired end reads.

AgSS probe design

Capture probes were designed using the SureSelect DNA Advanced Design Wizard for

B. malayi, A. fumigatus, and wBm. Probes were designed to capture every 120 bp for

each coding sequence in each of the three organisms with no overlap to ensure the entire

coding sequence was covered. DustMasker was used to identify and mask low

complexity regions of the genome from the probe design (154). A total of 167,997 probes were designed for B. malayi to capture 11,085 genes, ranging from 1–540 probes/gene with an average of 15.3 probes/gene. For A. fumigatus, 126,394 probes were designed to

32 capture 9,835 genes, ranging from 1–213 probes/gene with an average of 12.9 probes/gene. For wBm, 6,393 probes were designed with 1–72 probes/gene and an average of 8.0 probes/gene. The baits were not tested for specificity against the host genome.

Wolbachia AgSS RNA capture

Pre-capture libraries were constructed from 500–1000 ng of total RNA samples using the

NEBNext Ultra Directional RNA Library Prep kit (NEB, Ipswich, MA, USA). First strand cDNA was synthesized without mRNA selection to retain non-polyadenylated transcripts and was fragmented at 94 °C for 8 min. After adaptor ligation, cDNA fragments were amplified with 10 cycles of PCR before capture. Wolbachia transcripts were captured from 200 ng of the amplified libraries using an Agilent SureSelectXT

RNA (0.5–2 Mbp) bait library designed specifically for wBm. Library-bait hybridization reactions were incubated at 65 °C for 24 h then bound to MyOne Streptavidin T1 dynabeads (Invitrogen, Carlsbad, CA, USA). After multiple washes, bead-bound captured library fragments were amplified with 18 cycles of PCR. The libraries were loaded on a

HiSeq4000 generating 151-bp paired end reads.

B. malayi AgSS RNA capture

Pre-capture libraries were constructed from 1000 ng of total RNA samples using

NEBNext Ultra Directional RNA Library Prep kit (NEB #E7420, Ipswich, MA, USA).

Except where noted in the text, poly(A)-enrichment according to the manufacturer’s protocol was used. After adaptor ligation, cDNA fragments were amplified for 10 cycles of PCR before capture. B. malayi transcripts were captured from 200 ng of the amplified

33

libraries using an Agilent SureSelectXT Custom (12–24 Mbp) bait library designed

specifically for B. malayi. Library-bait hybridization reactions were incubated at 65 °C for 24 h then bound to MyOne Streptavidin T1 Dynabeads (Invitrogen, Carlsbad, CA).

After three rounds of washes, bead-bound captured library fragments were amplified with

16 cycles of PCR. The libraries were loaded on a HiSeq4000 generating 151-bp paired

end reads. In comparisons describing captures on poly(A)-enriched libraries with total

RNA libraries, total RNA libraries were constructed after first strand cDNA was

synthesized without mRNA extraction to retain non-polyadenylated transcripts and

fragmented at 94 °C for 8 min.

A. fumigatus AgSS capture

Pre-capture libraries were constructed from 700 ng of total RNA samples using NEBNext

Ultra Directional RNA Library Prep kit (NEB #E7420, Ipswich, MA, USA) according to

the manufacturer’s protocol. Briefly, mRNA was extracted with oligo-d(T) beads and

reverse-transcribed into first strand cDNA with random primers. First strand cDNA was

fragmented at 94 °C for 8 min. After adaptor ligation, cDNA fragments were amplified in

a thermocycler for 10 cycles before capture. Aspergillus transcripts were captured from

100 ng of the amplified libraries using an Agilent SureSelectXT Custom (12–24 Mbp)

bait library for A. fumigatus. Libraries were incubated at 65 °C for 24 h then bound to

MyOne Streptavidin T1 Dynabeads (Invitrogen, Carlsbad, CA, USA). After three rounds

of washes, bead-bound captured library fragments were amplified with 16 cycles of PCR.

The libraries were loaded on a HiSeq4000 generating 151-bp paired end reads.

34

Sequence alignment, feature counts, and fold enrichment calculations

The sequence reads originating from all samples used in all B. malayi and A. fumigatus

AF293 comparisons were mapped to the B. malayi genome WS259 (www.wormbase.org) or A. fumigatus AF293 genome CADRE34 (www.aspergillusgenome.org), respectively, using the TopHat v1.4 aligner (155). Read counts for all genes were obtained using

HTSeq (v0.5.3p9) (130), with the mode set to “union.” For the B. malayi genome,

RNAmmer v1.2 (156) identified 5 protein-coding genes (WBGene00228061,

WBGene00268654, WBGene00268655, WBGene00268656, WBGene00268657) overlapping with predicted rRNAs. Reads overlapping with these genes were excluded from all gene-based analyses.

The sequencing reads for the wBm comparisons were mapped to the wBm assembly (70) using Bowtie v0.12.9 (157). Counts for each wBm gene were calculated by summing the sequencing depth per base pair, obtained using the DEPTH function of SAMtools v1.1

(158), for each of the unique genomic positions in each gene. This value was then divided by the average read length for the sample, yielding a read count value for each gene.

Across all comparisons, read counts were normalized between each of the different samples through a conversion to transcripts per million (TPM).

The fold enrichment, Fe, of reads conferred by the AgSS to its target organism compared to the other enrichment techniques was calculated as

= 𝑀𝑀𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 � 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴( �) 𝑒𝑒 𝑆𝑆 𝐹𝐹 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝( 𝐴𝐴) 𝑀𝑀 � 𝑆𝑆𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝐴𝐴 � 35 where the ratio of the number of reads mapped with AgSS capture (MAgSS) to the number of reads sequenced with AgSS capture (SAgSS) is divided by the ratio of the number of reads mapped with poly(A)-enrichment or the rRNA-, poly(A)-depletion (Mpoly (A)) to the number of reads sequenced with poly(A)-enrichment or the rRNA-, poly(A)-depletion

(Spoly (A)). By taking a ratio of the percentage of mapped reads for each enrichment technique, the fold enrichment value represents an amount of enrichment conferred by the AgSS when compared to alternative enrichment technique for a given sample. The fold enrichment values when comparing the efficacy of AgSS capture on poly(A)- enriched libraries to total RNA libraries were calculated similarly as

( ) = / ( ) 𝑀𝑀𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑀𝑀𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝐴𝐴 𝐹𝐹𝑒𝑒 � � � � 𝑆𝑆𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 𝑆𝑆𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝐴𝐴 where MAgSS−poly(A) and SAgSS−poly(A) represent the mapped and sequenced reads from poly(A)-enriched AgSS capture libraries, while MAgSS−totalRNA and SAgSS−totalRNA represent the mapped and sequencing reads from AgSS capture from the total RNA libraries.

Results

Agilent SureSelect probe G + C content and specificity

The G + C content of the B. malayi AgSS probes, our largest set of designed probes, was assessed to determine whether the G + C content of a probe impacted its ability to capture its target gene. For each individual gene in the B. malayi genome, the G + C content of its set of probes ranged from 19.2–66.2%, with an average G + C content of 38.6% (Figure

2.1). For each G + C content, in increments of 1%, the average ratio of the AgSS TPM to the poly(A)-selected TPM average ratio of all genes with probe sets of that specified

36

Figure 2.1: G+C content of Brugia malayi AgSS probes For each gene in the B. malayi genome, the average G+C content for its set of probes was determined and plotted on a histogram. The average G+C content of the B. malayi probe sets is 38.6%, with a range of 19.2-66.2%.

G + C content were calculated and compared. Between 26–55%, there were >10 genes

with probe sets at each of the G + C increments. At these percentages and across six

different samples, the relationship between the AgSS TPM to poly(A)-selected TPM remained constant, indicating that the ability of a probe set to capture its gene set is not largely affected by G + C content (Figure 2.2). For B. malayi, we assessed the number of invalid probes, defined as probes whose intended capture positions had a larger read count in the poly(A)-selected sample relative to its AgSS counterpart. Across the six B. malayi samples, <2,004 probes were invalid, which accounts for ~1.2% of the 167,997 designed probes for B. malayi.

Comparison of poly(A)-enriched AgSS to only poly(A)-enriched libraries for B. malayi

Poly(A)-enriched, AgSS libraries were compared to only poly(A)-enriched libraries from

six RNA samples originating from three different time points of the B. malayi vector life

37

Figure 2.2: The impact of G+C content on capturing Brugia malayi genes At increments of 1% G+C, probe sets for each gene were binned to a specific G+C content. Points colored in red, between 26-55% G+C content, indicate ≥10 genes have probe sets at that specific G+C content. Error bars indicate standard error of the mean for each G+C content. For any G+C content, the average ratio of the log2 AgSS TPM to the poly(A)-selected TPM for each protein-coding is calculated. Because 26-55% G+C content, the capture ability of a probe set is independent of its G+C content, as seen by the linear relationship between these points. stages (18 hpi, 4 dpi, and 8 dpi) (Figure 2.3A). Across these six samples, the poly(A)- enrichment alone yielded 0.38–23.35% of reads mapping to the B. malayi genome, with 38

Figure 2.3: Sample and library preparation. This schematic illustrates the sample (auburn rectangle) and library (blue rectangle) preparation workflow to generate the libraries that were loaded on the Illumina sequencer. (A) For B. malayi and A. fumigatus, a poly(A)-selected sample was created from an aliquot of total RNA that was used to create a poly(A)-selected library. (B) The B. malayi or A. fumigatus AgSS baits were subsequently used to capture the targeted RNA from poly(A)-selected libraries. (C) For AgSS-enriched wBm libraries, an RNA library was constructed from an aliquot of total RNA that underwent targeted enrichment with the Wolbachia AgSS baits. Unlike the eukaryotic enrichments, the bacterial AgSS capture is performed on total RNA. For a limited number of libraries described in the text, an RNA library was constructed from an aliquot of total RNA (i.e. without poly(A)-enrichment) that underwent targeted enrichment with the Brugia AgSS baits. (D) For poly(A)/rRNA-depleted libraries enriched for wBm, an aliquot of total RNA from either mosquito thoraces or adult nematodes was enriched for bacterial mRNA by removing Gram-negative and human rRNAs with two RiboZero removal kits and polyadenylated RNAs with DynaBeads.

61.25–82.51% of reads mapping to the A. aegypti genome. Of the poly(A)-enriched libraries captured using the AgSS baits, 56.14–81.52% of reads mapped to the B. malayi genome, with 2.28–24.02% of reads mapping to the A. aegypti genome (Figure 2.4,

Table 2.1). The fold enrichment conferred to libraries captured with AgSS was found to be inversely proportional to the number of reads mapped to its poly(A)-enrichment only

39

Figure 2.4: Genes identified only in either the poly(A)-enriched AgSS or poly(A)- enrichment libraries from B. malayi samples. For six B. malayi transcriptome comparisons, Venn diagrams were generated to display the number of genes able to be extracted using the poly(A)-enriched AgSS (orange), the poly(A)-enrichment (green), or both (blue). The numbers in parentheses indicate the proportion of total protein-coding genes identified in each gene subset. counterpart, such that experiments with the fewest reads mapped in the poly(A)- enrichment had the highest fold enrichment with the AgSS capture. The 18 hpi samples, which have the lowest percentage of mapped reads to the B. malayi genome has the greatest fold enrichment (110–146x) while the 8 dpi samples, with the highest percentage of B. malayi mapped reads, has the lowest fold enrichment (3–4x).

A principal component analysis (P CA) using the log2-TPM values for the 11,085 predicted B. malayi protein-coding genes reveals that the B. malayi samples cluster based

40

4 (81.52%) 10,821,489 98 (0.88%) 17 (0.15%) 166 (1.5%) 255,299,760 116,515,442 620 (5.59%) 597 (5.39%) 313,191,594 146,982,974 8 dpi 2 replicate 8,642,083 (2.76%) 28,214,113 (19.20%) 95,185,530 (64.76%) B. malayi

3 (0.95%) 12,178,223 93,452,941 34 (0.31%) 105 562 (5.07%) 533 (4.81%) 168 (1.52%) 251,625,132 131,311,710 8 1 replicate dpi 5,744,265 (2.28%) 80,432,339 (61.25%) 30,655,896 (23.35%) 201,634,043 (80.13%)

11 3,239,521 52,035,411 96 (0.87%) 32 (0.29%) 951 (8.58%) 943 (8.51%) 142 (1.28%) 138,602,654 113,590,496 4 dpi 2 replicate 5,881,837 (4.24%) 7,842,585 (6.90%) 86,735,409 (76.36%) 109,847,909 (79.25%)

enrichment libraries from

-

14 2,722,556 61,264,753 81 (0.73%) 30 (0.27%) 960 (8.66%) 147 (1.33%) 163,852,962 117,570,910 1023 (9.23%) 4 dpi 1 replicate 7,619,199 (4.65%) 6,594,734 (5.61%) 91,427,308 (77.76%) 128,370,529 (78.34%)

110 212,377 6 (0.05%) 27,429,885 65 (0.59%) 97,896,288 120 (1.08%) 100,358,314 1916 (17.28%) 2115 (19.08%) 542,732 (0.54%) 18 hpi 2 replicate 58,274,601 (59.53%) 18,052,270 (18.44%) 82,803,823 (82.51%)

enriched AgSS and poly(A) and AgSS enriched

-

146 143,097 1 (0.01%) 23,788,380 84 (0.76%) 90,868,184 103 (0.93%) 103,949,620 2804 (25.3%) 2235 (20.16%) 399,638 (0.38%) 18 hpi 1 replicate 51,010,120 (56.14%) 21,827,300 (24.02%) 85,698,901 (82.44%)

enriched, - - enriched, enrichment - -

enrichment) enriched, Agilent - - enriched, Agilent enrichment) enriched, Agilent enrichment) - -

- - (poly(A) Seq statistics for poly(A) for statistics Seq enriched, Agilent SureSelect) enrichment) - - - reads

(poly(A) genes (poly(A) genes (poly(A) (poly(A)

genes not identified in either library library either in identified genes not genes identified in only poly(A) in only genes identified poly(A) in only genes identified genes biased towards poly(A) genes biased towards poly(A)

B. malayi B. aegypti A. malayi B.

malayi B. malayi B. malayi B. malayi B.

B. malayi B. to Reads mapped malayi Number B. of A. aegypti (poly(A) aegypti A. to Reads mapped Table 2.1: RNA Table collectedsamples following infection the of mosquito vector Sample to Reads mapped SureSelect) malayi B. to Reads mapped Fold enrichment B. malayi for to Reads mapped SureSelect) to Reads mapped SureSelect) Number of Agilent SureSelect Number of enrichment Number of Agilent SureSelect (>3X SD) Number of (>3X SD) Total readssequenced (poly(A) Total readssequenced (poly(A) Total

on life stage rather than library preparation (Figure 2.5A), suggesting that the capture

41

Figure 2.5: Comparison of the poly(A)-enriched to the poly(A)-enriched AgSS B. malayi transcriptomes from 18 hpi, 4 dpi, and 8 dpi vector life stages. Poly(A)-enriched transcriptomes were compared to the poly(A)-enriched AgSS B. malayi transcriptomes for the 18 hpi, 4 dpi, and 8 dpi vector life stages. (A) A principal component analysis (PCA) plot was generated using the log2 TPM values for the poly(A)-enriched only (triangles) and poly(A)-enriched AgSS (circles) B. malayi from 18 hpi (red), 4 dpi (green), and 8 dpi (blue) vector samples. The size of each symbol denotes the number of reads mapped to protein-coding genes for each sample. The samples cluster based on B. malayi sample rather than by the enrichment method, indicating the poly(A)-enriched AgSS does not substantially differ from its poly(A)-enrichment only counterpart in representing the B. malayi transcriptome for the shown samples. Samples with the lowest number of reads (e.g. the 18 hpi samples) cluster further from their replicates than expected, likely attributed to an insufficient number of reads and indicating that enrichment is most desirable in these situations. (B) For each sample, the log2 TPM values for genes in the poly(A)-enriched AgSS samples were plotted against the log2 TPM values for genes in the poly(A)-enrichment only samples when expression was observed with both enrichment methods for the sample. Genes with similar expression in the AgSS and the poly(A)-enrichment samples are expected to lie close to the identity line (y = x; red). Genes whose expression values are more elevated in the poly(A)-enriched AgSS sample compared to the poly(A)-enrichment sample lie below the identity line while genes more elevated in the poly(A)- enrichment compared to the poly(A)-enriched AgSS lie above the identity line. (C) For all genes expressed using both enrichment methods for each sample, the frequency distribution of the log2- transformed ratio of the poly(A)-enriched AgSS TPM to the poly(A) enrichment TPM was plotted. Genes with log2 ratio values > 0 have higher expression values in the poly(A)-enriched AgSS sample while log2 ratio values < 0 are representative of genes with higher expression values with the poly(A) enrichment. Significantly biased genes were defined as those differing by >3 standard deviations (red) from the mean of the log2-transformed ratio values (red). Across each comparison, the number of genes with significantly elevated expression in the AgSS comprised at most 0.31% of the total number of B. malayi protein-coding genes while the number of genes with significantly elevated expression in the poly(A)-enrichment samples comprised at most 1.52% of B. malayi protein-coding genes. does not introduce a systematic bias to the transcriptome of a sample. The exceptions are

42

the 18 hpi vector samples, which we attribute to the poly(A)-enrichment only samples

having an insufficient number of reads mapped to features (143,078 and 212,342 reads,

accounting for <20 reads/gene assuming an equal distribution) to accurately represent the

transcriptome. Supporting this, in the 18 hpi samples, an average of 2,460 protein coding

genes (22.18% of the B. malayi protein coding genes) had reads mapping to them in the

poly(A)-enriched AgSS samples but not its poly(A)-enriched counterpart, while in the 4

dpi and 8 dpi samples, an average of 952 (8.58%) and 566 (5.09%) genes, respectively,

had reads mapped in the poly(A)-enriched AgSS libraries but not the poly(A)-enriched libraries. Collectively, across all 6 samples, 392 unique genes were detected in only the poly(A)-enrichment, but not its poly(A)-enriched AgSS counterpart in at least one comparison. The AgSS probe design was based on an older version of the annotation compared to the version used for feature calling, and as such, 91 of these 392 genes were not covered by a probe.

The log2 TPM values from the poly(A)-enriched AgSS and the poly(A)-enriched libraries had a positive linear correlation with r2 values ranging from 0.559–0.867, with the 4 dpi

and 8 dpi correlations being the strongest (Figure 2.5B), for the reasons discussed above,

namely that fewer reads were recovered from the 18 hpi vector samples without AgSS

enrichment. To identify individual genes in which the library preparations significantly

affected the calculated TPM, the log2 ratio of the poly(A)-enriched AgSS TPM relative to

the poly(A) enrichment TPM for each gene was calculated (Figure 2.5C). Genes with

relatively similar expression levels in the two library preparations would have a log2 ratio

of 0 while genes more highly expressed in the poly(A)-enriched AgSS preparation would

be >0 and genes more highly expressed in the poly(A) enrichment would be <0. Genes 43

with a log2-ratio value of >3 standard deviations from the mean were defined as having

significantly altered levels of expression between the two libraries. Across all six

comparisons, the number of genes with significantly higher TPM values in the AgSS

libraries never accounted for >0.31% (<34 genes) of all protein-coding genes in the B.

malayi genome. The number of genes with significantly higher TPM values in the

poly(A)-enrichment ranged from 0.93–1.52% (103–168 genes) of all protein-coding

genes in B. malayi across the six comparisons (Figure 2.4, Table 2.1).

Comparison of poly(A)-enriched AgSS to only poly(A)-enrichment libraries for A.

fumigatus

The A. fumigatus AF293 transcriptome from poly(A)-enriched AgSS and poly(A)-

enriched only libraries were compared for 11 RNA samples from immunocompromised

mice or neutropenic mice infected with A. fumigatus AF293 (Figure 2.3B). Of the 11 samples, 6 samples originated from 3 replicates of immunocompromised mice taken 2 dpi and 4 dpi with A. fumigatus. The remaining 5 samples consist of two replicates of neutropenic mice taken 4 dpi with A. fumigatus and three replicates of neutropenic mice

taken 7 dpi with A. fumigatus.

Across all 11 samples, the total percentage of reads mapping to A. fumigatus from the

poly(A)-enriched samples was consistently ≤0.02%, while 0.30%–11.74% of reads

mapped in the poly(A)-enriched AgSS samples. In the poly(A)-enrichment only samples,

a maximum of 27,531 reads mapped to A. fumigatus AF293, accounting for <3

reads/gene assuming an even distribution. As such, comparisons between AgSS and

poly(A)-enriched libraries are not possible.

44

Figure 2.6: Genes identified only in either the poly(A)-enriched AgSS or poly(A)- enrichment libraries from A. fumigatus samples For 11 A. fumigatus transcriptome comparisons, Venn diagrams were generated to display the number of genes able to be extracted using the poly(A)-enriched AgSS (orange), the poly(A)-enrichment (green), or both (blue). The numbers in parentheses indicate the proportion of total protein-coding genes identified in each gene subset.

There are 4,168–7,387 genes (41.68–76.70%) that are identified only in the poly(A)-

enriched AgSS libraries but not the poly(A)-enriched libraries, compared to the maximum of 52 genes identified only in the poly(A)-enriched libraries across all A. fumigatus comparisons (Figure 2.6, Table 2.2). For all 11 comparisons, using the AgSS methodology conferred a 614x to 1,212x fold enrichment of A. fumigatus AF293 reads relative to the poly(A)-enrichment, going from data that was unamenable to a traditional transcriptomics analysis to obtaining data where robust statistical tests can be performed.

The proportion of reads mapping to features relative to the total number of reads mapped is consistent between samples of the same library preparation, with A. fumigatus poly(A)-

45

Table 2.2a: RNA-Seq statistics for AgSS and poly(A)-enrichment libraries from A. fumigatus 2 dpi 2 dpi 2 dpi 4 dpi 4 dpi 4 dpi 4 dpi Sample replicate 1 replicate 2 replicate 3 replicate 1 replicate 2 replicate 3 replicate 1 neutropenic Mouse IC mouse IC mouse IC mouse IC mouse IC mouse IC mouse mouse Total reads sequenced (poly(A)-enriched, 1,188,035,380 444,956,050 291,247,114 66,910,882 69,572,000 140,996,494 626,096,386 Agilent SureSelect) Total reads sequenced 142,761,100 123,935,456 126,465,996 111,073,836 165,059,836 100,226,124 131,441,050 (poly(A)- enrichment)

Reads mapped to A. fumigatus 3,720,790 4,976,567 6,638,513 7,531,474 8,166,066 6,262,973 3,639,285 (poly(A)-enriched, (0.31%) (1.12%) (2.28%) (11.26%) (11.74%) (4.44%) (0.58%) Agilent SureSelect)

Reads mapped to A. 15,177 27,531 6,036 369 (0%) 1,639 (0%) 3,026 (0%) 768 (0%) fumigatus(poly(A)- (0.01%) (0.02%) (0.01%) enrichment)

Fold enrichment for A. fumigatus 1,212 846 953 824 704 738 995 reads

Reads mapped to A. fumigatus genes 1,949,698 2,652,304 3,488,234 4,063,189 4,480,786 3,414,516 1,900,680 (poly(A)-enriched, (52.40%) (53.30%) (52.55%) (53.95%) (54.87%) (54.52%) (52.23%) Agilent SureSelect)

Reads mapped to A. fumigatus genes 787 1,476 7,326 13,049 2,885 366 189 (51.22%) (poly(A)- (48.02%) (48.78%) (48.27%) (47.40%) (47.80%) (47.66%) enrichment)

Number of A. fumigatus genes 2,095 2,392 1,995 2,243 1,834 1,956 2,344 not identified in (21.75%) (24.84%) (20.71%) (23.29%) (19.04%) (20.31%) (24.34%) either library

Number of A. fumigatus genes 6,722 6,808 4,820 4,168 6,253 7,016 identified in only 7,387 (76.7%) (69.8%) (70.69%) (50.05%) (43.28%) (64.93%) (72.85%) poly(A)-enriched, Agilent SureSelect

Number of A. fumigatus genes identified in only 0 (0%) 9 (0.09%) 6 (0.06%) 52 (0.54%) 28 (0.29%) 4 (0.04%) 4 (0.04%) poly(A)- enrichment enriched AgSS samples ranging from 52.23–54.87% and the poly(A)-enrichment samples ranging from 45.89–51.22%.

46

Table 2.2b: RNA-Seq statistics for AgSS and poly(A)-enrichment libraries from A. fumigatus 4 dpi 7 dpi 7 dpi 4 dpi 4 dpi 7 dpi 7 dpi Sample replicate 3 replicte 1 replicate 2 replicate 2 replicate 3 replicte 1 replicate 2 neutropenic neutropenic neutropenic neutropenic neutropenic neutropenic neutropenic Mouse mouse mouse mouse mouse mouse mouse mouse Total reads sequenced (poly(A)-enriched, 397,712,920 110,566,202 95,475,172 873,133,438 397,712,920 110,566,202 95,475,172 Agilent SureSelect) Total reads sequenced 84,865,448 127,403,048 123,297,624 100,226,124 84,865,448 127,403,048 123,297,624 (poly(A)- enrichment)

Reads mapped to A. fumigatus 4,925,561 9,380,405 8,137,172 2,602,550 4,925,561 9,380,405 8,137,172 (poly(A)-enriched, (1.24%) (8.48%) (8.52%) (0.30%) (1.24%) (8.48%) (8.52%) Agilent SureSelect)

Reads mapped to A. 13,973 17,122 13,973 17,122 1,205 (0%) 377 (0%) 1,205 (0%) fumigatus(poly(A)- (0.01%) (0.01%) (0.01%) (0.01%) enrichment)

Fold enrichment for A. fumigatus 872 774 614 792 872 774 614 reads

Reads mapped to A. fumigatus genes 2,583,967 4,975,426 4,283,629 1,324,599 2,583,967 4,975,426 4,283,629 (poly(A)-enriched, (52.46%) (53.04%) (52.64%) (50.90%) (52.46%) (53.04%) (52.64%) Agilent SureSelect)

Reads mapped to A. fumigatus genes 6,413 7,978 6,413 7,978 569 (47.22%) 173 (45.89%) 569 (47.22%) (poly(A)- (45.90%) (46.60%) (45.90%) (46.60%) enrichment)

Number of A. fumigatus genes 2,226 1,578 1,434 2,226 1,578 1,434 2,215 (23%) not identified in (23.11%) (16.38%) (14.89%) (23.11%) (16.38%) (14.89%) either library

Number of A. fumigatus genes 7,028 5,544 5,311 7,278 7,028 5,544 5,311 identified in only (72.97%) (57.56%) (55.14%) (75.57%) (72.97%) (57.56%) (55.14%) poly(A)-enriched, Agilent SureSelect

Number of A. fumigatus genes identified in only 1 (0.01%) 11 (0.11%) 15 (0.16%) 1 (0.01%) 1 (0.01%) 11 (0.11%) 15 (0.16%) poly(A)- enrichment

Comparison of total RNA AgSS to rRNA-, poly(A)-depleted libraries for the bacterial Wolbachia endosymbiont wBm

With both of the previously discussed biological systems, we tested the effectiveness of the AgSS platform to study eukaryote-eukaryote host-pathogen interactions. By enriching for wBm reads in B. malayi samples, we sought to test the effectiveness of AgSS libraries 47

Figure 2.7: Genes identified only in either the total RNA AgSS or rRNA-, poly(A)-depletion libraries from wBm samples For nine wBm transcriptome comparisons, Venn diagrams were generated to display the number of genes able to be extracted using the total RNA, AgSS (purple), the rRNA, poly(A)-depletion (yellow), or both (blue). The numbers in parentheses indicate the proportion of total protein-coding genes identified in each gene subset. in extracting prokaryotic reads. The standard protocol for AgSS capture in eukaryotes

relies on capturing targets from poly(A)-enriched libraries, but for prokaryotic samples the capture is performed on total RNA libraries. These total RNA AgSS libraries were compared to rRNA-, poly(A)-depleted library preparations from nine B. malayi samples

taken at various stages of the life cycle to analyze the transcriptome of its Wolbachia

endosymbiont, wBm (Figure 2.3C). Three RNA samples were obtained from the

mammalian portion of the Brugia lifecycle where worms are large enough to be

physically separated from the mammalian tissue. These RNA samples generate

predominantly Brugia reads with a minority of Wolbachia reads. Of these three samples,

48 one sample was recovered at 24 dpi, representative of immature female worms while the other two samples are from two reproductively mature adult females recovered 3 months post-infection. The other six samples are from the vector portion of the B. malayi lifecycle where tripartite RNA samples are obtained, consisting of predominantly mosquito reads, with a smaller percentage of Brugia reads, and an even smaller percentage of Wolbachia reads. These six samples consist of two A. aegypti replicates each taken at 18 hpi, 4 dpi, and 8 dpi with B. malayi, representative of the L1 through early L3 B. malayi life stages. Of the six samples taken from the vector life stages, at most 1,552 reads (<0.01% of sequenced reads) mapped to the wBm genome from the rRNA-, poly(A)-depleted samples, of which <730 reads were identified as mapping to protein-coding genes for an average of <1 read mapped/gene (Figure 2.7, Table 2.3). In contrast, the rRNA-, poly(A) depleted adult female B. malayi samples taken from the mammalian host had a minimum of 17,107 reads mapping to wBm protein-coding genes, equating to ~20 reads/gene while the 24 dpi sample had 246,549 reads mapped to protein-coding genes, equating to ~294 reads/gene.

The AgSS-enriched samples had an increased number of wBm-mapping reads for all 9 samples, with 20.4–37.4 million reads (18.68–32.53%) mapping to wBm genes for the three gerbil samples and 0.2–4.0 million reads (0.46–6.43% of sequenced reads) mapping to wBm genes for the mosquito vector samples. The AgSS libraries from gerbil samples were 20–41x fold enriched for wBm reads relative to the rRNA-, poly(A)-depleted libraries while the libraries from vector samples were 353–2,242x fold enriched. The proportion of wBm reads mapped to protein-coding genes were consistently higher in the

AgSS-capture sample for both the gerbil and vector samples, ranging from 88.31–92.44% 49

Bm

2,165 0 (0%) 188 (0%) 1 (0.12%) 2 (0.24%) 78,959,932 30,676,768 85 (45.21%) 54 (28.72%) 186 (0.02%) 117 (13.95%) 691 (82.36%) 972,409 (92.81%) 1,047,694 (1.33%) 18 hpi w vector

Bm

Bm 2,242 0 (0%) 126 (0%) w 1 (0.12%) 1 (0.12%) 76,297,602 30,623,652 74 (58.73%) 28 (22.22%) 394 (0.06%) 147 (17.52%) 659 (78.55%) 703,856 (0.92%) 652,585 (92.72%) 18 hpi w vector

Bm

38 5 (0.6%) 29 (3.46%) 70 (8.34%) 28 (3.34%) 15 (1.79%) 52,924,416 124,173,684 488,165 (2.1%) 27,635 (10.62%) 75,263 (28.92%) 260,260 (90.49%) 23,197,210 (18.68%) 20,486,191 (88.31%) adult femaleadult w jird

depletion libraries from from libraries depletion -

Bm

41 , poly(A) 21 (2.5%) 7 (0.83%) - 36 (4.29%) 15 (1.79%) 31,173,032 147,440,774 134 (15.97%) 157,260 (0.5%) 377,950 (1.23%) 17,107 (10.88%) 35,523 (22.59%) 30,688,306 (20.81%) 27,689,593 (90.23%) adult femaleadult w jird

Bm

20 3 (0.36%) 21 (2.5%) 14 (1.67%) 11 (1.31%) 40 (4.77%) 73,230,576 124,210,676 225,659 (0.56%) 24 dpi jird w 24 jird dpi 246,549 (20.11%) 394,884 (33.82%) 1,167,706 (1.59%) 40,408,338 (32.53%) 37,356,885 (92.44%)

, - - - , -

depletion) - , poly(A) , poly(A) - - Seq statistics for AgSS and rRNA and AgSS for statistics Seq

depletion) - , poly(A) -

Bm reads Bm- (rRNA Bm genes (rRNA Bm (rRNA rRNA Bm (total Agilent RNA, Bm genes (total RNA, Agilent Bm (total rRNA RNA, Agilent

w w w w w

Bm genes not identified in either either in Bm genes identified not totalBm RNA, genes in only identified rRNA Bm only in genes identified w w w

depletion SD) (>2X depletion

- -

Reads mapped to w to Reads mapped Fold enrichment w for Total readssequenced (poly(A) Total Table 2.3: RNA Table Sample readsTotal sequenced RNA, (total Agilent SureSelect) to Reads mapped SureSelect) to Reads mapped SureSelect) to Reads mapped SureSelect) to Reads mapped depletion) to Reads mapped depletion) Number of library Number of Agilent SureSelect Number of poly(A) Number of wBm genes biased towards total RNA, Agilent SureSelect (>2X SD) Number wBm of genes biased towards rRNA poly(A)

and 92.04–92.81%, respectively, compared to the 10.62–20.11% and 45.21–58.73%

50

Bm

w

1,652 5 (0.6%) 2 (0.24%) 1 (0.12%) 1552 (0%) 70,535,594 41,159,150 93 (11.08%) 1,392 (0.3%) 729 (46.97%) 438 (28.19%) 541 (64.48%) 8 dpi vector 4,392,522 (6.23%) 4,056,936 (92.36%)

Bm

w

1,929 depletion libraries from from libraries depletion 5 (0.6%) 714 (0%) 2 (0.24%) 2 (0.24%) - 74,052,092 28,267,988 93 (11.08%) 1,590 (0.4%) 376 (52.66%) 158 (22.09%) 613 (73.06%) 8 dpi vector 3,607,912 (4.87%) 3,320,806 (92.04%)

, poly(A) -

Bm

w

1,081 508 (0%) 3 (0.36%) 2 (0.24%) 1 (0.12%) 96 (18.9%) 51,233,802 30,669,394 534 (0.6%) 276 (54.33%) 136 (16.21%) 608 (72.47%) 917,488 (1.79%) 846,225 (92.23%) 4 dpi vector

Bm

w

353 500 (0%) 2 (0.24%) 4 (0.48%) 2 (0.24%) 64,398,742 38,453,926 301 (0.1%) 94(18.73%) 261 (52.20%) 135 (16.09%) 610 (72.71%) 295,964 (0.46%) 273,036 (92.25%) 4 dpi vector

Seq statistics for AgSS and rRNA and AgSS for statistics Seq - , - - - , -

depletion) - , poly(A) , poly(A) - - : RNA : depletion)

, poly(A) -

identified in only rRNA in only identified Bm reads w (cont’d) Bm (total Agilent RNA, Bm- (rRNA Bm genes (total RNA, Agilent Bm (total rRNA RNA, Agilent Bm genes (rRNA Bm (rRNA rRNA

w w w w w w

Bm genes not identified in either either in Bm genes identified not totalBm RNA, genes in only identified Bm genes w w w

depletion SD) (>2X depletion

- -

Bm Table 2.3 Table w Sample readsTotal sequenced RNA, (total Agilent SureSelect) readssequenced (poly(A) Total to Reads mapped SureSelect) to Reads mapped Fold enrichment for to Reads mapped SureSelect) to Reads mapped SureSelect) to Reads mapped depletion) to Reads mapped depletion) Number of library Number of Agilent SureSelect Number of poly(A) Number of wBm genes biased towards total RNA, Agilent SureSelect (>2X SD) Number wBm of genes biased towards rRNA poly(A)

observed in the rRNA-, poly(A)-depleted samples. This difference in the number of wBm reads mapping to protein-coding genes can be partially attributed to a greater number of reads mapping to rRNAs in the rRNA-, poly(A)-depleted samples, with 18.73–33.82% of wBm mapped reads mapping to rRNAs, compared to 0.1–2.1% observed in the total

RNA, AgSS samples (Figure 2.7, Table 2.3). 51

A principal component analysis using the TPM values of the three gerbil life stage

comparisons demonstrates that individual samples segregate based on library preparation

(Figure 2.8A). All three of the total RNA AgSS samples are clustered close together,

while the rRNA-, poly(A)-depleted samples are further distributed, likely due to the

insufficient coverage of the wBm transcriptome in the absence of AgSS enrichment. The

mosquito vector life stage samples were not used for the principal component analysis

due to an insufficient number of reads mapping in the poly(A)-depletion portion of the

comparison to compare the two enrichment methods. Like the B. malayi comparisons, the

log2 TPM values from the wBm AgSS libraries and the rRNA- and poly(A)-depletion

libraries made from mammalian life stages had a positive linear correlation with r2 values

ranging from 0.575–0.744 (Figure 2.8B).

A total of 45 unique genes (~5.4% of protein-coding genes in wBm) were detected in at

least one of the three gerbil comparisons in the sample treated using the rRNA-, poly(A)- depletion libraries but not its total RNA AgSS counterpart. Of these 45 genes, 41 were not covered by any of the AgSS probes while the remaining 4 genes had <11% of their length covered by a probe. To identify genes with significantly greater expression in one enrichment method over the other, a cutoff of three standard deviations from the log2

ratio of the AgSS TPM to the rRNA-, poly(A)-depleted TPM was used (Figure 2.8C).

Genes with a log2 ratio of >3 standard deviations below the average log2 ratio were

determined to have significantly higher expression in the rRNA-, poly(A)-depletion

sample while genes with a log2 ratio >3 standard deviations above the average, were

identified as having significantly higher expression in the total RNA AgSS sample.

Across the three comparisons, a total of 20 unique genes had TPM values significantly 52

Figure 2.8: Comparison of the rRNA-, poly(A)-depleted to the total RNA AgSS wBm transcriptomes from the female 24 dpi and adult B. malayi life stages. The rRNA-, poly(A)-depleted transcriptomes were compared to the total RNA AgSS wBm transcriptomes for the female 24 dpi and adult B. malayi life stages. (A) A principal component analysis of the rRNA-, poly(A)-depleted (circle) and total RNA AgSS (triangle) wBm transcriptome from three samples of the B. malayi gerbil life stage was conducted, with one sample from female 24 dpi B. malayi (red) and two samples from adult female B. malayi (blue). The size of each point is relative to the number of reads mapped to protein-coding genes for each sample. In this instance, the samples no longer cluster based on the sample type. Instead, the samples distinctly cluster apart based on enrichment method, most likely due to the smaller number of reads present in samples lacking the AgSS capture. Consistent with this, the sample with the fewest reads, one of the rRNA-, poly(A)- depleted adult females, is most distant. The three samples enriched using the AgSS cluster together, possibly indicating the similarity of wBm transcriptomes originating from female 24 dpi and adult B. malayi. (B) For each sample, the log2 TPM values for genes detected with the AgSS capture on total RNA were plotted against the log2 TPM values for genes detected with the rRNA-, poly(A)-depletion when a gene was detected in both enrichments. Genes with similar expression in both the total RNA AgSS and the rRNA-, poly(A)-depleted samples are expected to lie close to the identity line (y = x; red). Genes whose expression values are more elevated in the total RNA AgSS compared to the rRNA-, poly(A)-depletion lie below the identity line while genes more elevated in the rRNA-, poly(A)- depletion compared to the total RNA AgSS lie above the identity line. (C) The frequency distribution log2-transformed ratio of the total RNA AgSS TPM to the rRNA- and poly(A)-depleted TPM for each gene detected in both enrichment methods was plotted. The blue, dashed line indicates the mean log2- transformed ratio while the red dashed line marks three standard deviations from the mean log2- transformed ratio. Genes with log2 ratio values >3 standard deviations above or below the mean were marked as biased towards the total RNA AgSS or the rRNA-, poly(A)-depletion, respectively. Across all 9 comparisons, an average of ~0.4 genes were detected with significantly higher expression with the total RNA AgSS, while an average of ~4.0 genes had significantly higher expression in the rRNA-, poly(A)-depleted samples. higher in the rRNA-, poly(A)-depleted sample compared to the AgSS sample in at least

53

Figure 2.9: Comparison of the poly(A)-enriched AgSS to the total RNA AgSS B. malayi transcriptomes from 18 hpi and 8 dpi vector life stages The poly(A)-enriched AgSS transcriptome was compared to the total RNA AgSS B. malayi transcriptomes for the 18 hpi and 8 dpi vector life stages. (A) For both the 18 hpi and 8 dpi vector samples, the log2 TPM values for each B. malayi gene detected in both the poly(A)-enriched AgSS and the total RNA AgSS were plotted against one another. Most genes in both plots fall close to the identity line (y = x; red), indicating their similar expression values in both the poly(A)-enriched AgSS and the total RNA AgSS libraries. (B) A histogram of the log2 ratios of the poly(A)-enriched AgSS TPM to the total RNA AgSS TPM was generated for each sample using all genes detected with both enrichment methods. The blue dashed line marks the average log2 ratio value while the red lines mark three standard deviations from the average log2 ratio. Genes with log2 ratio values >3 standard deviations above the average are marked as significantly elevated in the poly(A)-enriched AgSS while genes with log2 ratio values >3 standard deviations below the average are marked as significantly elevated in the total RNA AgSS samples. For both the library preparations, ≤1.4% (155 genes) of protein-coding genes are biased towards a single enrichment method, indicating no significant bias towards an enrichment. one comparison. Of these genes, 14 genes were not covered by any probe, four genes had

<17% of their length covered by a probe, and the remaining two genes are covered in the entirety of their length by the probe design. Additionally, there were 152 unique genes

(~18.1% of protein-coding genes in wBm) detected in the total RNA, AgSS but not the poly(A)-depletion in at least one of the comparisons. Of the 152 genes, 114 had average

54

TPM values in the bottom quartile of the expressed genes across the three AgSS samples,

highlighting the ability of the AgSS preparation to detect low abundance transcripts.

Comparison of AgSS B. malayi transcriptome data from total RNA and poly(A)- enriched samples

For eukaryote transcriptome experiments, the AgSS platform is typically preceded by construction of a poly(A)-selected library that is then hybridized to the bait probes. The poly(A)-enrichment step serves to remove high abundance, non-mRNA transcripts, such as rRNAs, tRNAs, and other ncRNAs. For bacterial samples, libraries must be constructed on total RNA since bacterial mRNAs lack the poly(A) tails present in eukaryotic transcripts. For eukaryote-prokaryote transcriptomic experiments, it may be advantageous to use the AgSS platform for both bacteria and eukaryotes on the same single library constructed from total RNA, especially for clinical samples where RNA may be limiting. Therefore, we sought to compare the AgSS enrichment on a library constructed from total B. malayi RNA with one constructed following poly(A)-

enrichment using two samples taken from the mosquito vector portion of the B. malayi

life stages at 18 hpi and 8 dpi (Figure 2.3D).

For both the 18 hpi and 8 dpi time points, most genes yield comparable log2 TPM values

between the two enrichment methods (Figure 2.9A) with <1.5% of genes having a bias

towards a particular library construction protocol (Figure 2.9B). However, despite the 8

dpi comparison yielding a similar number of genes detected unique to each enrichment

method, in the 18 hpi sample comparison, 1,033 genes (9.3% of all protein-coding genes)

were detected only in the poly(A)-enriched AgSS library while 323 (2.91%) were only

55

Figure 2.10: Genes identified only in either the total RNA AgSS or poly(A)- enriched AgSS libraries from B. malayi samples For two B. malayi transcriptome comparisons, Venn diagrams were generated to display the number of genes able to be extracted using the poly(A)-enriched, AgSS (orange), the total RNA AgSS (yellow), or both (blue). The numbers in parentheses indicate the proportion of total protein-coding genes identified in each gene subset.

detected with the total RNA AgSS library (Figure 2.10, Table 2.4). In both cases, a

majority of the genes detected in only one library construction method were within the

bottom 20% of expressed genes in the sequencing data collected, with 815 genes (78.9%)

in the selection conducted with the total RNA library and 233 genes (72.14%) in the

selection conducted with the poly(A)-enriched library. Similarly, in the 8 dpi sample

comparison, 2.35% (261 genes) of B. malayi protein-coding genes were identified only in the AgSS poly(A)-enriched library while 1.71% (189 genes) were identified only in the target AgSS total RNA library. Again, in both cases, a majority of the genes detected in only one library construction method were within the bottom 20% of expressed genes in the sequencing data collected, with 236 genes (90.42%) in the selection conducted with the poly(A)-enriched RNA library and 179 genes (94.71%) in the selection conducted with the total RNA library. Based off these results, both AgSS methods can detect low abundance transcripts, but the poly(A)-enriched AgSS was able to select for a greater number of these transcripts in both comparisons.

56

Table 2.4: RNA-Seq statistics for B. malayi transcriptome data using AgSS with total RNA or poly(A)-enriched RNA Sample 18 hpi vector 8 dpi vector Total reads sequenced (poly(A)-enriched, Agilent SureSelect) 90,868,184 251,625,132 Total reads sequenced (total RNA, Agilent SureSelect) 149,907,628 255,771,318 51,010,120 201,634,043 Reads mapped (poly(A)-enriched, Agilent SureSelect) (56.14%) (80.13%) 34,424,709 178,638,997 Reads mapped (total RNA, Agilent SureSelect) (22.96%) (69.84%) Fold enrichment for B. malayi reads 2.4 1.1 23,788,380 93,452,941 Reads mapped to genes (poly(A)-enriched, Agilent SureSelect) (46.63%) (46.35%) 6,960,950 57,535,140 Reads mapped to genes (total RNA, Agilent SureSelect) (20.22%) (32.21%) 3,026,700 10,921,606 Multimapping reads (poly(A)-enriched, Agilent SureSelect) (5.93%) (5.42%) 24,195,610 70,178,500 Multimapping reads (total RNA, Agilent SureSelect) (70.29%) (39.29%) Multimapping reads to rRNA (poly(A)-enriched, Agilent 416,329 170,569 SureSelect) (13.75%) (1.56%) 19,554,080 49,024,697 Multimapping reads to rRNA (total RNA, Agilent SureSelect) (80.82%) (69.86%) Number of genes not identified in either library prep 1,996 (18.01%) 478 (4.31%) Number of genes identified in only poly(A)-enriched, Agilent 1,033 (9.32%) 261(2.35%) SureSelect Number of genes identified in only total RNA, Agilent SureSelect 323 (2.91%) 189 (1.71%) Number of genes biased towards poly(A)-enriched, Agilent 43 (0.39%) 70 (0.63%) SureSelect (>3X SD) Number of genes biased towards total RNA, Agilent SureSelect 157 (1.42%) 144 (1.30%) (>3X SD) The percentage of reads mapping to the B. malayi genome were greater in the poly(A)- enriched AgSS samples compared to the total RNA AgSS samples, with 56.14% and

80.13% of reads mapping to the B. malayi genome, compared to 22.96% and 69.84% of reads mapped for the 18 hpi and 8 dpi samples respectively. The poly(A)-enriched AgSS samples were enriched for reads mapping to B. malayi by 2.4x for the 18 hpi sample and

1.1x for the 8 dpi sample relative to the total RNA AgSS. Additionally, in the 18 hpi and

8 dpi poly(A)-enriched AgSS samples, 46.63% and 46.35% of their mapped reads mapped to protein-coding genes while in the total RNA AgSS samples only 20.22% and

32.21% mapped to protein-coding genes (Figure 2.10, Table 2.4). This is explained by the increased number of multi-mapping reads observed in the total RNA AgSS samples,

57 with 70.29% and 39.29% of the total number of mapped reads in the 18 hpi and 8 dpi vector samples multi-mapping to the B. malayi genome. In comparison, only 5.93% and

5.42% of mapped reads were multi-mapping in the 18 hpi and 8 dpi poly(A)-enriched

AgSS samples. A total of 80.82% and 69.86% of these multi-mapping reads in the total

RNA AgSS 18 hpi and 8 dpi vector samples were found to map to the B. malayi rRNAs compared to only 13.76% and 1.56% in the poly(A)-enriched AgSS samples (Figure

2.10, Table 2.4). The AgSS platform on poly(A)-enriched RNA was able to obtain both a higher percentage of B. malayi mapped reads and a higher percentage of B. malayi reads mapping to protein-coding genes. Therefore, we recommend the use of the AgSS enrichment on poly(A)-enriched libraries for eukaryotic systems whenever possible.

Recovering the transcriptome of the major organism from libraries prepared with the AgSS of the minor organism

We explored whether sequencing data from both the major and minor organism could be accurately obtained using only data collected from an AgSS capture of the minor organism for eukaryote-eukaryote dual-transcriptomics experiments. Using the sample with the highest fold enrichment conferred by the AgSS platform, an immunocompromised mouse at 2 dpi, the host mouse transcriptome was compared between two libraries enriched using only the poly(A)-enrichment versus a poly(A)- enriched library that was AgSS-captured for A. fumigatus transcripts. When plotting the

TPM values of two library preparations against one another (Figure 2.11), the TPM values share a positive linear correlation with a r2 of 0.85, similar to the correlation values observed in the B. malayi comparisons (Figure 2.5B), indicating that the use of the A. fumigatus AgSS did not appear to largely impact the mouse transcriptome. 58

Figure 2.11: Comparison of the poly(A)-enrichment to the total RNA A. fumigatus AgSS library preparations of the transcriptomes of 2 dpi immunocompromised mouse.

Using the sample taken at 2 dpi of an immunocompromised mouse, the log2 TPM values for mouse protein-coding genes in the poly(A)-enriched, A. fumigatus AgSS were plotted against the log2 TPM values for mouse protein-coding genes in the poly(A)-enrichment only. Genes with similar expression in both enrichments lie close to the identity line (y = x; red). Mouse genes whose expression values are more elevated in the poly(A)-enriched A. fumigatus AgSS sample compared to the poly(A)-enrichment sample lie below the identity line while genes more elevated in the poly(A)-enrichment compared to the poly(A)-enriched A. fumigatus AgSS lie above the identity line.

However, of the 39,179 mouse genes, 5,424 (13.8%) were detected in only the poly(A)- library but not the corresponding AgSS-captured library, despite having designed probes.

In comparison, only 128 mouse genes were detected only in the A. fumigatus AgSS

library but not the poly(A)-library. While the transcripts captured seem to have the same expression profile, >10% of the genes were missed, and thus, we would recommend only assessing the organism for which the AgSS capture system was designed if the goal is to obtain the complete transcriptional profile of the major member.

59

Discussion

With multi-species RNA-Seq experiments emerging at the forefront of experimental designs for transcriptome analyses, novel methods are being developed to more efficiently extract the transcriptomes for multiple organisms from a single sample. By comparing the transcriptomes generated with standard enrichment methods (e.g. poly(A)- enrichment or rRNA-/poly(A)-depletion) to the AgSS platform, we sought to independently validate the use of the AgSS platform in enriching for transcripts originating from a specific organism of interest while minimizing any type of library bias on the transcriptome. When analyzing the B. malayi transcriptomes of samples prepared with or without the AgSS enrichment, most genes could be detected using both library preparations. However, the samples prepared using the AgSS system enriched the number of reads mapped by 3x–146x compared to samples enriched only with a poly(A)- enrichment enabling the consistent detection of a greater number of genes, most of which were low abundance transcripts. In the A. fumigatus comparisons, the poly(A)- enrichment by itself was unable to extract a sufficient number of A. fumigatus reads to accurately represent the transcriptome of the sample. When the poly(A)-enrichment was supplemented with AgSS capture, the enrichment of A. fumigatus reads was increased

614–1,212x. When using the AgSS instead of the rRNA-, poly(A)-depletion preparation for the Wolbachia endosymbiont wBm transcriptome, the AgSS samples were enriched for wBm mapped reads by 20–41x in the gerbil life cycle samples and 353–2,242x in the vector life cycle samples.

For each of these samples, if the fold enrichment for the number of mapped reads in the poly(A)-enriched or total RNA AgSS is plotted against the percentage of reads mapped in 60

Figure 2.12: Scatter plot of the percentage of poly(A)-enrichment or depletion reads mapped against the fold enrichment value observed using the AgSS follows a power law The fold enrichment value for each sample used in the poly(A)-enrichment/depletion versus AgSS comparisons for B. malayi (red), A. fumigatus (blue), and wBm (green) was calculated by taking the ratio of the percentage of AgSS reads mapped to the percentage of poly(A)-enrichment/depletion reads mapped. The relationship between these fold enrichment values and the percentage of the poly(A)- enrichment/depletion reads mapped to the reference genome can be fitted to a power law. The fold enrichment conferred using the AgSS platform exponentially increases with the exponential decrease of the percentage of reads mapped using the standard enrichment/depletion methods up to an ~1300-fold enrichment.

its poly(A)-enrichment or poly(A), rRNA depletion counterpart, the data follows a power

law (Figure 2.12). As expected, the fold enrichment obtained with the AgSS capture is inversely proportional to the percentage of mapped reads obtained with the currently used enrichment methods of poly(A)-enrichment to capture eukaryotic RNA, or rRNA- and poly(A)-depletion to capture bacterial RNA from multi-species samples, indicating samples with a lower relative abundance of RNA from the target organism yield the greatest fold enrichment.

61

While the AgSS can enrich a sample for transcripts of interest, the platform requires

probes designed for each gene to avoid any enrichment bias. Most instances of the AgSS

failing to detect a gene that was detected in samples enriched using only the poly(A)-

enrichment or rRNA-, poly(A)-depletion was due to a lack of designed probes for the

gene. Because the B. malayi annotation is currently being updated, the baits were

designed with an older annotation than the one used for feature counting. Similarly, the

bait design for wBm was based on the annotation published by Foster et al. (70) while the

annotation used for feature counting originated from GenBank (NC_006833.1). In both

cases, most genes present only in the annotation used for the bait design were not

detected with the AgSS platform.

These results illustrate the power and advantages of using probe-based enrichment to

facilitate the analysis of samples that are the most insightful, like those in animal tissue at

a biologically relevant multiplicity of infection. The AgSS platform serves as a method to

extract reads from a secondary or tertiary organism in a sample as well as to detect low

abundance transcripts. Using this platform, we successfully extracted a sufficient number

of reads to represent the B. malayi and wBm transcriptomes in the vector portion of the B.

malayi life cycle, the A. fumigatus transcriptome in a mouse infection, and the wBm

transcriptome in the mammalian portion of the B. malayi life cycle. Provided a

transcriptome experiment has an adequate bait design, we believe the AgSS platform to

be ideal and necessary in enriching samples for multi-species transcriptomic experiments, especially those in which secondary organisms are of low abundance.

62

Data Availability

The data set(s) supporting the results of this article are available in the Sequence Read

Archive (SRA) repository. The B. malayi datasets are available in SRP068692, the wBm datasets are available in SRP068711, and the A. fumigatus datasets are available in

PRJNA421149.

Additional Information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Acknowledgements

This project was funded by federal funds from the National Institute of Allergy and

Infectious Diseases, National Institutes of Health, Department of Health and Human

Services under grant number U19 AI110820.

Authors’ Contribution

MC and JCDH wrote the manuscript. LT, HL, SL, NK, XZ and REB performed sample preparation, library construction, capture experiments, and/or sequencing. MC and AS conducted bioinformatic data analysis. LJT, LS, CMF, DAR, JMF, VMB and JCDH were involved in study design and data interpretation. SGF. and MLM provided samples. All authors read, edited, and approved the manuscript.

63

Chapter 3: FADU: A quantification tool for prokaryotic transcriptomic analyses 1

Abstract

Quantification tools are often designed and tested using human transcriptomics data sets, in which full-length transcript sequences are well annotated. For prokaryotic transcriptomics experiments, full-length transcript sequences are seldomly known and coding sequences must instead be used for alignments and/or quantification. The problems of using coding sequences instead of transcript sequences is exacerbated in prokaryotes due to the presence of operons, in which a single transcript does not necessarily equate to a single gene. We demonstrate here that quantification tools such as featureCounts or HTSeq will not derive any counts for ambiguous reads by default, instead discarding any fragment that maps to multiple genes, such as those in operons.

Additionally, we demonstrate that tools that attempt to derive counts from ambiguous reads such as eXpress, kallisto, or Salmon have difficulty assigning counts to smaller operonic genes (<350 bp). With FADU (Feature Aggregate Depth Utility), we designed a quantification tool designed specifically for prokaryotic RNA-Seq analyses. FADU assigns partial count values proportional to the length of the fragment overlapping the target feature. To assess the ability of FADU in quantifying genes for prokaryotic transcriptomics analyses, we compared its performance to eXpress, featureCounts,

HTSeq, kallisto, and Salmon across three paired-end read data sets of: (1) , (2) Escherichia coli, and (3) the Wolbachia endosymbiont wBm. Across each

1 Chung M, Adkins RS, Mattick JSA, Bradwell KR, Shetty AC, Sadzewicz L, Tallon LJ, Fraser CM, Rasko DA, Makurkar A, Dunning Hotopp JC. FADU: A quantification tool for prokaryotic transcriptomic analyses. Genome Biol., submitted 2019. 64

of the three data sets, we find that FADU can more accurately quantitate operonic genes

by deriving proportional counts for multi-gene fragments within operons. FADU is available at https://github.com/IGS/FADU.

Importance

Most currently available quantification tools for transcriptomics analyses have been

designed for human data sets, in which full-length transcript sequences, including the untranslated regions, are annotated. In most prokaryotic systems, full-length transcript

sequences have yet to be characterized, leading to prokaryotic transcriptomics analyses

being performed based on only coding sequences. In contrast to eukaryotes, prokaryotes

contain polycistronic transcripts and when genes are quantified based on coding

sequences instead of transcript sequences, this leads to an increased abundance of

improperly assigned ambiguous multi-gene fragments, specifically those mapping to

multiple genes in operons. Here we describe FADU, a quantification tool designed

specifically for prokaryotic RNA-Seq analyses with the purpose of quantifying operonic

genes while minimizing the pitfalls associated with improperly assigning fragment counts

from ambiguous transcripts.

Introduction

The quantification step in a typical differential expression analysis of transcriptomic data

involves the calculation of fragment counts overlapping each gene. Quantification tools,

such as featureCounts (131) or HTSeq (130) requires first aligning paired end reads to a

reference genome. This typically involves the mapping of sequencing reads to whole

genome assemblies using aligners such as Bowtie (157, 159), BWA (160), or HISAT

65

(161). The quantification step uses the output alignment file in combination with an

annotation file to assess the number of sequenced fragments that intersects the

coordinates of each target gene. A different subset of quantification tools were developed

to be used in conjunction with de novo transcriptome assemblies, including the cufflinks

suite of tools (162-165), RSEM (134), and eXpress (166). These tools require an

alignment file mapped to reference transcripts instead of whole genomes. Most recently,

k-mer-based alignment-free tools, such as kallisto (132) and Salmon (133), have been developed that use raw sequencing reads and reference transcript sequences to directly assign counts to target features using pseudoalignments and k-mer based counting approaches.

However, an issue regarding these approaches is the requirement of a well-annotated reference genome in which the sequences of full-length transcripts have been identified.

Each of the aforementioned quantification tools were developed using human transcriptomics data sets as a template and because of this, these tools are deficient when used to quantify genes for prokaryotic transcriptomics studies. In ideal conditions, transcriptomic studies would quantify genes at the transcript level, but the lack of complete transcript annotations for many non-model eukaryotes and most prokaryotes forces transcriptomic analyses to be conducted at the coding sequence level. This can be problematic in genomes with dense coding capacity and overlapping transcripts. It is especially problematic when analyzing prokaryotes due to the abundance of polycistronic transcripts from operons, with there being an estimated 630-700 operons in the >5000 genes found in the E. coli genome (135).

66

At the core of these problems is the method with which ambiguous reads are quantified.

Ambiguous reads can be divided in two categories: (1) multi-gene fragments, in which a paired-end fragment maps uniquely but overlaps multiple features and (2) multi-mapping fragments, in which a paired-end fragment maps to multiple genomic regions equally, such as in the case of reads originating from duplicated genes or genomic regions. Each current tool has a different method to quantify these types of ambiguous reads. As an example, the default mode of HTSeq, HTSeq union, does not quantify multi-gene fragments and instead marks them as ambiguous (130). By default, featureCounts will assign a multi-gene fragment to the feature that maps to the majority of the individual reads in the paired-end fragment. As an example, if one paired-end read of a fragment maps to geneA and geneB while the second paired-end read maps to geneB, the count will be assigned to geneB. In cases where the left paired-end read maps to geneA and the right paired-end read maps to geneB, the fragment is marked as ambiguous and not quantified (131). However, each of these approaches leads to an underestimation of the actual mapped reads for these genes and by extension, an under estimation of gene expression.

Tools such as cufflinks, eXpress, kallisto, and Salmon apply abundance estimation algorithms to assign partial counts for both multi-gene fragments and multi-mapping reads. As an example, RSEM, eXpress, kallisto, and Salmon all apply an estimation- maximization algorithm to ambiguous fragments. Briefly, the expectation-maximization algorithm uses the number of fragments that unambiguously align to a target transcript to estimate the counts from ambiguous fragments originating from that target transcript.

67

This assigned probability from each of the ambiguous fragments will be based on the

number of secondary mappings along with each of their mapping qualities (167).

As we demonstrate here, each of these tools have issues when used to quantify prokaryotic genes. For featureCounts and HTSeq, the occurrence of multi-gene fragments occurs frequently in operons. Discarding these reads under-estimates the abundance of operonic genes, especially smaller genes in the middle of operons. The tools cufflinks, eXpress, kallisto, RSEM, and Salmon all require a reference containing transcript sequences. When coding sequences are used as the reference instead, fragments that map primarily to the untranslated regions of genes are likely to be discarded. Additionally, the absence of transcript annotations for operons further complicates the analysis. Using coding sequences as the units for quantification implies that each coding sequence is a unique observation, which is not the case for operons. For fragments that overlap multiple genes in an operon this causes the fragment count to be improperly split into multiple genes when it should be counted equally for both genes. Ideally, this issue would be solved by identifying the full-length transcript sequences for prokaryotes using laboratory techniques such as 5’- and 3’- RACE (rapid amplification of cDNA ends), or

direct RNA sequencing. However, this is currently not practical given the number of

different prokaryotic systems being studied.

In this study we developed FADU (Feature Aggregate Depth Utility), a quantification

tool specifically designed for prokaryotic transcriptomics analyses. FADU uses an

alignment file generated by aligning reads to a whole genome assembly and handles

ambiguous fragments by proportionally assigning fragment counts. Given a multi-gene

68

fragment, FADU assesses the proportion each fragment overlaps with the nonunique

positions of each of its overlapped features and assigns a proportional fragment count. By

assigning proportional read counts, FADU avoids the pitfalls other tools have in

quantifying operonic genes while minimizing the errors derived from assigning counts

from multi-gene fragments. Here, we describe the implementation of FADU and compare

its performance and utility to the alignment-dependent quantification tools featureCounts,

HTSeq, and eXpress and the alignment-free quantification tools kallisto and Salmon.

Results

Comparing the performance of FADU against other quantification tools

We compared the performance of FADU to the default modes of alignment-based quantification tools eXpress, featureCounts, and HTSeq and the alignment-free quantification tools kallisto and Salmon for three different transcriptomics data sets: (a) paired end reads from a standard (i.e. not strand-specific) library constructed from

Escherichia coli RNA, (b) paired end reads from a standard library constructed from

Ehrlichia chaffeensis RNA, and (c) paired-end reads from a strand-specific library constructed from Wolbachia endosymbiont of Brugia malayi wBm RNA. For the alignment based quantification tools, reads were mapped to a genomic reference for featureCounts and HTSeq or a coding sequence reference for eXpress. For the alignment- free quantification tools kallisto and Salmon, reads were directly quantified using a coding sequence reference. Across all three datasets, the counts obtained using FADU are highly correlated to those obtained with the five other quantification tools (r2 > 0.8)

(Figure 3.1).

69

Figure 3.1: Comparing fragments counts from FADU versus other quantification tools. Across the three prokaryotic transcriptomic data sets of unstranded (A) E. coli, (B) E. chaffeensis, and (C) stranded wBm, fragment counts obtained using FADU were compared to counts obtained using eXpress, featureCounts, HTSeq union, kallisto and Salmon. For each plot, each point indicates a gene with its x-axis value dictated by its log2 count obtained using FADU and its y-axis value dictated by the log2 counts obtained using the noted tool. Genes in red are ≤200 bp in size.

To identify instances where FADU was obtaining significantly higher or lower read

70

Figure 3.2: MA plots comparing wBm fragment counts obtained using FADU against the other quantification tools. The fragment counts for the stranded wBm data set obtained using FADU were compared to the counts obtained using eXpress, FADU, featureCounts, HTSeq union, kallisto, and Salmon with an MA plot. Each point denotes a gene, with red points being indicative of genes ≤200 bp. The x-axis denotes the mean average of the two compared counts (A) while the y-axis denotes the log2 ratio of the two compared counts (M). The horizontal orange dotted lines on each plot are drawn at log2 ratio values of 2 and -2. Points above 2 were defined as genes significantly under-counted by FADU relative to its compared tool while points below -2 are indicative of genes significantly over-counted by FADU relative to its counterpart. counts relative to the other quantification tools, we used the wBm data set to generate

MA plots between FADU and the other five quantification tools (Figure 3.2). Each plot represents a pairwise comparison between FADU and one of the five other quantification tools. The mean average of the log2 counts (A) is plotted on the x-axis while the log2 ratio

counts (M) is plotted on the y-axis. Genes with a log2 count ratio ≥2 or ≤2 were defined

as being significantly under- or over-counted by FADU relative to its counterpart. We

identified 63 genes in the wBm data set across the five pairwise comparisons that were

significantly un der-counted by FADU. Of these genes, 39 (61.9%) were under-counted

by FADU relative to only kallisto and Salmon. The fragment count contributions from

71

these 39 genes are largely from ambiguous multimapping reads. While FADU,

featureCounts, and HTSeq all discard flagged multimapping reads for quantification, the

use of the expectation-maximization algorithm allows eXpress, kallisto and Salmon,

combined with bootstrapping support in the case of kallisto and Salmon, to assign

probabilistic read counts to these genes. While eXpress also uses an expectation-

maximization model to assign multi-mapping reads, the use of an alignment file rather than reads as an input likely affects the tool’s ability to quantify these genes.

Additionally, 11 (15.9%) of these genes have significantly higher count values with

featureCounts, HTSeq, kallisto, and Salmon. For all 11 genes, the contributing reads are

largely in the dovetail orientation, in which one mate pair extends past the beginning of

the other. While these reads were paired successfully using HISAT2, they were not

marked as concordant, leading to these reads being discarded during quantification for

both eXpress and FADU. Similarly, 7 genes with significantly higher counts in only

featureCounts, kallisto, and Salmon are largely dovetail reads with the difference that

they are in close proximity to nearby genes.

While tools such as eXpress, kallisto, and Salmon use an expectation-maximization algorithm to assign ambiguous fragment counts, by default featureCounts and HTSeq discard ambiguous fragments and require additional options for their quantification. The -

O options of featureCounts allows for counts from a multi-gene fragment to be assigned to each overlapped gene. Additionally, when used in conjunction with the –fraction option, featureCounts will assign a fractional count based on the number of genes a multi-gene fragment overlaps. As an example, if a multi-gene fragment overlaps two

72

genes, a count value of 1 would be assigned to both genes using the -O option and a

count value of 0.5 would be assigned to both genes when using both the -O and the –

fraction options. We also tested FADU against three additional options for HTSeq: (a) -

mode intersection-nonunique, (b) -mode intersection-strict, and (c) -mode union used in conjunction with -nonunique all. Compared to HTSeq union, HTSeq intersection- nonempty is more liberal in assigning multi-gene fragments. Given a multi-gene fragment, HTSeq intersection-nonempty takes the intersect of the features found at each non-empty position and if only one feature is returned, a count is assigned to that feature.

HTSeq intersection-strict is more conservative and takes the intersect of the features found at all positions of the fragment and again, if only one feature is returned, a full count is assigned to that feature. Additionally, HTSeq contains the option –nonunique all, which assigns a count value of 1 to all genes overlapped by a multi-gene fragment, similar to the -O option of featureCounts. In the case of a multi-gene fragment that spans two genes within an operon, a count value of 1 will be assigned to both genes. Because each individual fragment corresponds to a single transcript, and both these genes are within the same transcript, the fragment should be counted equally for both features.

In the wBm data set, 78 genes were identified that were significantly under-counted in at least one quantification tool relative to FADU (Figure 3.2). Of these 78 genes, 19

(24.4%) were under-counted in all five analyzed quantification tools relative to FADU. In all 19 cases, these discrepancies were due to fragments that span two different coding sequences. Of the 19 genes, 6 are within a putative 15-gene ribosomal-protein operon

(Figure 3.3A). Within this 15 gene region, empirically 13-14 genes should have roughly

73

Figure 3.3: The performance of quantification tools on deriving counts for operons.

(A) Using the wBm stranded RNA-Seq data set, the log10 read depth (orange) was plotted for part of an operon containing 15 genes that all code for different ribosomal proteins. Genes labeled and marked in turquoise are all significantly under-counted (log2 counts ratio <-1) by one of the other quantification tools as assessed in Figure 3. (B) For each quantification tool, the read depth per bp was calculated for all 15 genes displayed and divided by the median read depth per bp for each quantification mode to obtain a normalized relative count value. The log2-transformed normalized relative count values are displayed in the individual cells of the table. Because the 15 ribosomal genes are likely transcribed together, we would expect the normalized values obtained for each of the 15 genes to be similar to one another, and empirically we expect this to be the case for at least 13 of the 15 genes. Normalized values in red cells indicate that the gene has a greater count value relative to the other operonic genes while blue cells indicate that the gene has a lesser count value relative to the other operonic genes. Tools that discard ambiguous reads span two features in close proximity have a tendency to under-count the smaller genes in operons, such as Wbm0332 and Wbm0333. the same counts (all but Wbm0335 and possibly Wbm0337) (Figure 3.3A) allowing us to directly evaluate these methods. Therefore, we divided the fragment counts for each gene by gene length to obtain a fragment count per bp value for each gene. For each tool individually, these fragment count per bp values were normalized by dividing by the

74

median fragment count per bp for all genes in this 15 gene region (Figure 3.3B). Given

that empirically 13-14 genes should have roughly the same count, ideally a tool would

yield normalized value close to 0 for all genes except Wbm0335 and possibly Wbm0337.

When normalized values are >0, it indicates that the gene is relatively over-counted, and when the values are <0, it indicates the gene is relatively under-counted.

For Wbm0328 and Wbm0339, despite both genes having a fragment depth that appears to be similar to the rest of the putative operon (Figure 3.3A), HTSeq union, intersection-

nonempty, and intersection-strict obtain virtually no counts. The close proximity of the

adjacent genes combined with the small size of Wbm0328 (309 bp) makes it difficult for

an unambiguous multi-gene fragment to be assigned to Wbm0328. Similarly, this also

occurs for Wbm0332 and Wbm0333, which are two of the smallest genes contained in

the putative operon. Because of the small size of Wbm0332 and Wbm0333, eXpress has

difficulty identifying reads that mostly map to each of the two genes (Figure 3.3B).

Similarly, the smaller size makes it difficult for kallisto and Salmon to identify unique k-

mers for these genes, leading to both genes being under-counted by the k-mer based tools

(Figure 3.3B). In comparison, FADU, featureCounts overlap, featureCounts fractional-

overlap, and HTSeq nonunique are able to obtain the expected counts for all four genes

(Figure 3.3B).

By assigning fragment counts based on proportion, FADU runs the risk of incorrectly

assigning counts in instances where fragments from the untranslated region of one gene

overlap the coding sequence of another gene, particularly in the case of gene-dense

genomes. Because FADU assigns a fragment count to all overlapped features, the count

75

Figure 3.4: The improper quantification of fragments that span multiple features.

(A) Using the wBm stranded RNA-Seq data set, the log10 read depth (orange) was plotted across the wBm pseudogene (gene490; red) flanked by the protein-coding genes Wbm0395 and Wbm0396. (B) For each of the listed quantification modes, the number of counts for gene490, Wbm0395, and Wbm0396 were obtained. The counts assigned to gene490 are all derived from fragments that also map to the 5’-end of Wbm0395. Despite the wBm annotation lacking UTRs, these reads likely originate from the 5’-UTR of Wbm0395, indicating that any reads assigned to gene490 are erroneous.

value from a fragment that originates from the untranslated region (UTR) of one gene can

be mistakenly assigned to another gene overlapped by the fragment. While this is

erroneous, scenarios such as this are difficult to avoid without well-annotated transcripts

and without undercounting smaller operonic genes. However, the proportional fragment

counts assigned by FADU minimizes the errors from these fragments. As an example, the

wBm pseudogene gene490 is in close proximity to Wbm0395 (Figure 3.4A), such that reads from the unannotated UTR of Wbm0395 are erroneously being counted for gene490. Ideally, no fragment counts would be derived for gene490. FADU mitigates this issue by only assigning proportional fragment counts based on the percentage of the fragment’s length that overlaps a feature. In comparison, featureCounts overlap, featureCounts fractional-overlap, and HTSeq nonunique assign many more reads (>9X)

76

Figure 3.5: Clustering patterns of the fragment counts derived from different quantification methods.

(A) An unrooted dendrogram with 1000 bootstraps was generated using the log2 count values from wBm calculated using eXpress, FADU, the three different modes of featureCounts, the 4 different modes of HTSeq, kallisto, and Salmon. The dendrogram reveals three distinct clusters of (1) featureCounts default, HTseq union, and HTSeq intersection-nonempty; (2) HTSeq intersection-strict; (3) eXpress, kallisto, and Salmon; and (4) FADU, featureCounts overlap, and featureCounts fractional- overlap. (B) A principal component analysis for all wBm count values derived from each of the tools is visualized. Each color corresponds to quantification tool, while each shape represents the specific mode of the tool used. (C) The log2 count values for all wBm genes with count values derived from at least one of the tools was used to generate a heatmap. The wBm genes are displayed on the horizontal axis while each of the tools are displayed on the vertical axis. All cells in grey describe genes with no count value in its corresponding tool. Bootstrap values for both the unrooted and squared dendrograms are located next to their corresponding nodes. to gene490 (Figure 3.4B). Collectively, by assigning proportional counts, FADU is able to better quantify operonic genes while mitigating the issues stemming from quantifying multi-gene fragments.

77

Figure 3.6: FADU timing and memory benchmarking. For the unstranded E. coli, E. chaffeensis, and the stranded wBm data sets, the (A) speed for the indexing and/or alignment steps (red), and quantification steps (blue) along with the (B) maximum memory used (green) for eXpress, FADU, the three different modes of featureCounts, the 4 different modes of HTSeq, kallisto, and Salmon was recorded. Clustering analyses of the different quantification modes

The results from the different tools cluster into four distinct groups based on how each of the different tools handle fragments mapping to multiple features (Figure 3.5A). FADU, featureCounts overlap, and featureCounts fractional-overlap, which are more liberal in assigning counts, form a cluster while HTSeq union, HTSeq intersection-nonunique, and featureCounts default, which are all more conservative, form another cluster.

Additionally, eXpress, kallisto, and Salmon, which all use coding sequences as a reference instead of genomic sequences, form a third cluster. HTSeq intersection-strict does not cluster with any of the groups, due to it being the most stringent in assigning fragment counts to features. Similarly, in the principal component analysis, the variation observed in the first principal component is primarily from the counts derived from

78

HTSeq intersection-strict while the second principal component variation is derived from

the quantification tools that use coding sequences as an input reference (Figure 3.5B).

These differences translate into the number of genes detected with each tool. HTSeq

intersection-strict obtains counts for the fewest number of genes (Figure 3.5C). FADU, featureCounts overlap, featureCounts fractional-overlap, and HTSeq nonunique cluster together and yield counts for the most genes relative to the other tools. However, counts obtained using featureCounts overlap, featureCounts fractional-overlap, and HTSeq nonunique are on average higher than the counts obtained using FADU or other quantification tools. Because FADU assigns proportional read counts based on how much of a fragment overlaps a feature, FADU minimizes the errors associated with overcounting genes in close proximity to one another while better quantifying small operonic genes from ambiguous multi-gene fragments.

Timing and memory benchmarks

We compared the speed (Figure 3.6A) and memory usage (Figure 3.6B) of FADU relative to the ten other quantification modes. For the alignment-based quantification tools eXpress, FADU, featureCounts, and HTSeq, the times for alignment and quantification were individually recorded, while for the alignment-free quantification tools kallisto and Salmon, the times for indexing and quantification were individually recorded. Across all three data sets analyzed, kallisto performs the fastest and uses the least memory. The other alignment-free quantification tool, Salmon, performs quicker than the alignment-based tools when used to quantify the E. chaffeensis and E. coli data sets, but performs remarkably slower when used to quantify the wBm data set. The speed

79

of FADU is comparable to the other alignment-based quantification tools for the E. chaffeensis and wBm data sets. However, when used to analyze the E. coli data set,

FADU and the different modes of HTSeq have longer run times compared to eXpress and featureCounts. When comparing the maximum memory usage between the different tools, HTSeq consistently requires the most memory. In comparison, FADU often requires more memory than eXpress and featureCounts, but consistently requires less memory than HTSeq.

Discussion

In an ideal transcriptomics differential expression analysis with current analysis tools, full-length transcripts of the reference organism should be used for alignment and/or quantification. In the case that full-length transcripts are available, we believe that ambiguous reads should be able to be assigned with confidence with methods such as the expectation-maximization algorithm and bootstrap support implemented in kallisto or

Salmon. However, in cases where full-length transcripts are not available, coding

sequences must be used instead. When coding sequences are used, fragments that

typically map primarily to the UTRs of a transcript may only map partially to the coding

sequence, leading to these fragments being unable to be quantified. This also occurs in

the case of multi-gene fragments spanning operonic genes.

Assigning these multi-gene fragments becomes difficult, especially in the context of

prokaryotes. Without the knowledge of complete transcript sequences, it becomes

impossible to determine whether a multi-gene fragment stems from an operonic transcript

or overlaps two genes in close proximity that are independent transcriptional units. While

80

operon databases and prediction software are available, they are not comprehensive and

the gold standard for determining the sequence of full-length transcripts is through 5’- and 3’-RACE, which becomes laborious to perform preceding transcriptomics experiments. While advances in long read sequencing will eventually lead to native strand RNA sequencing (168), in which the sequences of full-length transcripts can be obtained, current technologies are lacking in a method for easily annotating full-length transcripts.

In situations where full-length transcript sequences are unavailable, mapping directly to transcript references will fail to quantify a subset of reads that originate primarily from untranslated regions instead of the coding sequence. For cases such as this, genome references perform better for deriving counts from non-coding sequence reads. By using the precise mapping information provided by a genome-based alignment, quantification tools can better infer whether a fragment overlaps a target gene. However, as we’ve shown, currently available quantification tools that use a genome-based alignment, ambiguous reads are often either too conservatively or too liberally assigned, resulting in an under or over estimation of transcript expression. FADU was designed for the purpose of optimizing the available alignment information for systems in which full-length transcript annotations are unavailable. By assigning proportional read counts, FADU attempts to maximize the quantification of operonic genes while minimizing the false over-quantification of non-operonic genes in close proximity to one another.

81

Materials and Methods

Implementation

FADU was written entirely using the Julia programming language v1.0 (169) and relies

heavily on the BioAlignments.jl and GenomicFeatures.jl packages. GenomicFeatures.jl is

used to parse out record information from GFF files and determine overlaps between alignments and features. BioAlignments.jl is used to quickly parse out record information from the BAM file. FADU was tested and benchmarked in the UNIX environment.

Computing non-overlapping annotation feature coordinates

FADU first creates an interval tree-based data structure of the feature annotation (GFF3)

input file consisting of the sequence ID, the leftmost coordinate, the rightmost coordinate,

the strand, and the feature metadata. This data structure is used to construct a set of all

non-overlapping coordinates per strand. A non-overlapping coordinate is defined as a

genomic position that is not overlapped by two or more recorded features. If the BAM

alignment input file is unstranded, the set of overlapping coordinates per attribute ID will

be strand-agnostic.

Defining fragments versus reads

The BAM alignment input file is read in one alignment record at a time. For each record,

validation steps are performed to ensure the record is mapped, is a primary record, and

exceeds the specified minimum mapping quality (default score of 10). If the option to

remove multimapped alignments is enabled, records whose ‘NH’ attribute exceeds a

value of 1 are also removed. Next, the record is assessed whether it is part of a fragment

by taking into account read pair information. In order to classify a read as part of a

82 fragment, both the template length of the read pair (default: <1000 bp) and the 0x2 bitflag of each record are assessed.

Records that do not meet the qualifications to be processed as “fragments” are instead processed as “reads”. For records classified as “fragments”, only one of the reads of the pair is kept because the coordinate information of the mate pair can be inferred. If the option to keep only properly paired reads is enabled, only records that can be classified as

“fragments” will be kept.

Processing overlaps between alignments and features

Once a BAM record passes validation, and is classified either as a fragment or a read for downstream processing, the record is used to create an interval tree-based data structure consisting of the record reference sequence name, the leftmost and rightmost coordinate of the fragment or read, the strand of the fragment or read, and a designation of

“fragment” or “read”. Once a sufficient number of these data structures are read into memory (default chunk size is 10,000,000), the overlaps for each alignment record to the specified annotation feature type are processed.

FADU calculates the proportional count for each fragment as follows:

= 𝐹𝐹𝑂𝑂 𝐹𝐹𝑐𝑐 𝐹𝐹𝐿𝐿 Fc represents the fractionalized fragment count contribution for a given fragment or read for each feature ID while FO represents the number of bases the fragment overlaps the feature and FL represents the total length of the fragment. In the case of reads that were

83

unable to be processed as fragments, Fc is halved to prevent the overestimation the

contribution of discordant reads or singletons to the fractionalized count of a feature.

Writing output

Once all records have been processed, the total counts for every feature ID for the specified attribute type is calculated and used to calculate normalized TPM values for each feature ID. For each feature ID, five tab-delimited fields are written to file: (a) feature ID, (b) length of non-overlapping coordinates, (c) number of alignments to overlap with the feature, (d) total fractionalized alignment counts for the feature, and (e) the TPM count for the feature.

Availability of data sets

Three data sets were used in all analyses consisting of RNA-Seq paired-end data from standard, non-stranded libraries originating from E. chaffeensis and E. coli, and stranded libraries from wBm. The sequencing reads for the three datasets can be found in the

NCBI Sequencing Read Archive at the following accession numbers, respectively:

SRX485438, SRX1322474, and SRX2505171.

Quantification tool comparisons

Reads from each of the data sets were aligned using HISAT2 with the -X option used to set minimum fragment length and the --no-spliced-alignment option. The minimum fragment length set for each data set was identified as the two standard deviations from the mean fragment length using the CollectInsertSizeMetrics tool of PicardTools v2.9.4.

As references, the combined genomes of Canis familiaris (GenBank:

GCA_000002285.2) and E. chaffeensis Arkansas (GenBank: NC_007799.1), E. coli

84

0127:H6 strain E2348/69 (GenBank: NC_011601.1), and the combined genomes of

Brugia malayi (WormBase: WBPS9) and its Wolbachia endosymbiont wBm (GenBank:

NC_006833.1) were used for the FADU, featureCounts, and HTSeq runs for E. chaffeensis, E. coli, and wBm data sets, respectively. Additionally, the coding sequences of the corresponding references were used as the references for eXpress, kallisto, and

Salmon. MA plots were generated between FADU and the default modes of eXpress, featureCounts, HTSeq, kallisto, and Salmon. For each plot, the mean average of the log2 counts (A) was calculated and plotted against the log2 ratio of counts (M).

The parameters used for FADU, eXpress, featureCounts, HTSeq, kallisto, and Salmon used for analyses and benchmarking are available at https://github.com/IGS/FADU. Plots were generated using the R packages ape, FactoMineR, gggenes, ggplot2, gplots, ggpubr, pvclust, and venn.

Acknowledgements

This project was funded by federal funds from the National Institute of Allergy and

Infectious Diseases, National Institutes of Health, Department of Health and Human

Services under grant number U19 AI110820.

Authors’ Contributions

MC, RSA, and JCDH wrote portions of the manuscript. MC, JSAM, and ACS conducted bioinformatics analyses. MC, RSA, KRB conceived and optimized the algorithm. RSA performed benchmark analyses. MC created figures. LS and LJT oversaw sequencing.

CMF, DAR, AM, and JCDH designed and oversaw the analyses. All authors read, edited, and approved the manuscript.

85

Chapter 4: Drug repurposing of bromodomain inhibitors as a novel therapeutic for lymphatic filariasis guided by multi-species transcriptomics 1

Abstract

Most of the current anti-helminthic treatments for filarial disease have limited macrofilaricidal efficacy. To identify and test potential repurposed drugs to target adult nematodes, we conducted a multi-species RNA-Seq on the filarial nematode Brugia malayi, its Wolbachia endosymbiont wBm, and its laboratory vector A. aegypti across the entire B. malayi life cycle. We find little evidence for global gene regulation in the bacterial endosymbiont wBm, but the noncoding 6S RNA levels correlate with bacterial replication rates, rising as the nematode matures from L3 to adult life stages before plummeting in microfilariae. For A. aegypti, the transcriptional response reflects the stress B. malayi infection exerts on the mosquito with indicators of increased energy demand and decreased fitness observed at 18 hpi. Mosquito stress proteins are upregulated as the nematodes mature to their L3 life stage. In the mammalian stages, B. malayi has male-specific kinase signaling cascades, while the transcription profile of adult females, embryos, and microfilariae includes a significant number of chromatin remodeling proteins. The B. malayi genes upregulated in adult females, embryos, and microfilariae include those for 15/18 bromodomain-containing proteins, which are

involved in chromatin remodeling, including 2 members of the bromodomain and extra-

terminal (BET) protein family. The BET inhibitor JQ1(+), which was originally

1 Chung M, Teigan LE, Libro S, Bromley RE, Olley D, Kumar N, Sadzewicz L, Tallon LJ, Mahurkar A, Foster JM, Michalski ML, Dunning Hotopp JC. Drug repurposing of bromodomain inhibitors as a novel therapeutic for lymphatic filariasis guided by multi- species transcriptomics. In revision at Nat Commun. 2019. 86

developed as a cancer therapeutic, caused lethality and sterility of adult worms in vitro

suggesting it may be a potent adulticidal drug and novel treatment for lymphatic

filariasis.

Keywords

Lymphatic filariasis, filarial nematodes, nematodes, Brugia malayi, Wolbachia, Aedes

aegypti, mosquito, transcriptomics, RNASeq, multi-species RNASeq, bromodomain,

bromodomain inhibitor, 6S RNA

Background

Lymphatic filariasis is spread by mosquitoes infected with one of the three known

causative agents of lymphatic filariasis: Wuchereria bancrofti, Brugia malayi, or Brugia

timori. As of 2016, an estimated 856 million individuals are living in areas with ongoing

transmission of the disease (8). Infected individuals are often asymptomatic with

subclinical parasite-associated immunosuppression (170). Approximately one third of

affected individuals have chronic lymphoedema and/or hydrocele, caused by the damage

the nematodes inflict onto the human lymphatic system. Although rarely fatal, >25

million affected individuals are afflicted with hydrocele and >15 million individuals are

afflicted with lymphoedema.

Mass drug administration (MDA) programs aimed at eradicating the causative agents of

lymphatic filariasis employ a combination of diethylcarbamazine, doxycycline,

ivermectin, and/or albendazole (171-174). Although each of the drugs have different

mechanisms of action (4), all three drugs are primarily microfilaricidal and have a limited efficacy on adult worms (174, 175). The persistence of adult worms after treatment

87 requires expensive, prolonged 4-6-week drug treatments, with the possibility of repeated drug treatments over several years (176). Although the current MDA programs have been successful in lowering transmission of lymphatic filariasis (8), the limited adulticidal efficacy of the available drugs indicates a need for novel therapeutics.

The obligate relationship between B. malayi and its Wolbachia bacterial endosymbiont has provided one of the recent novel targets in filarial nematode elimination efforts (177-

179), since inhibiting wBm halts development and reproduction in B. malayi and, more importantly, kills adult worms. Other strategies based on genomic and transcriptomic analyses use predicted functional annotation (114, 116, 180, 181) and differential expression analyses (117, 118) to identify potential drug targets in both B. malayi and wBm.

Intraperitoneal infection of Mongolian gerbils by Brugia malayi is a frequently used system for studying lymphatic filariasis, since all mammalian life stages of the nematode can be isolated in the laboratory from a small rodent (182). However, B. malayi does not cause pathology in the gerbil. Although, several transcriptomics studies have been conducted on the filarial worm, its bacterial endosymbiont, and/or its vector using this laboratory system (117, 118, 151), no comprehensive study has been able to simultaneously analyze the transcriptome of all three organisms across the entirety of the

B. malayi life cycle. One of the main reasons is the inability to recover a sufficient number of reads to represent the transcriptome of wBm, which has a very low relative abundance in several key life stages. To address this, we used rRNA- and poly(A)- depletions (146) and Agilent SureSelect (AgSS) transcriptome capture to enrich for

88

Wolbachia reads for selected samples (Chapter 2) (183). Here, we report the transcriptomes of B. malayi, wBm, and its laboratory vector, Aedes aegypti, across the

entire B. malayi life cycle to better understand the biological interplay between the

organisms. We expect this to be a community resource and as such have provided access

to the data at the gene level for B. malayi, http://gcid.igs.umaryland.edu/pub/bm-RNA-

life; wBm, http://gcid.igs.umaryland.edu/pub/wb-RNA-life; and A. aegypti

http://gcid.igs.umaryland.edu/pub/aa-RNA-life. Chromatin remodeling seems to be a key

activity in adult females and their embryos such that bromodomain inhibitors were

examined as a potential therapeutic for lymphatic filariasis.

Results

B. malayi transcriptome

A total of 16 distinct B. malayi life cycle stages were sampled (Figure 4.1, Table 4.1).

Of the 11,085 predicted B. malayi protein-coding genes, 10,398 (93.8%) had statistically significant differential gene expression across the life cycle. A hierarchical clustering and principal component analysis (PCA) of these differentially expressed genes separates these samples into four distinct clusters consisting of (A) the adult male samples, (B) the adult female, embryo, immature microfilaria, and mature microfilaria samples, (C) the L1 and L2 samples, and (D) the L3, L4, and immature adult samples (Figure 4.2). The

10,398 differentially expressed genes were further clustered into 15 expression modules using WGCNA (184) (Figure 4.3), with each cluster being split into a primary expression profile and a secondary, inverse expression profile (185).

89

Figure 4.1: Brugia malayi life cycle and sample selection Initial larval development (mf, L2, L3) takes place in the thorax of the mosquito vector after ingestion of microfilaremic blood from the mammalian host. Subsequent development of L3 to L4 and finally to the mature adult occurs upon introduction of L3 to the mammalian host, in this case in the peritoneal cavity of the Mongolian gerbil. Life cycle stages sampled from mosquitoes included microfilariae that had penetrated the mosquito midgut and entered thorax muscle (18 hpi), molt to L2 larvae (4 dpi) and to L3 larvae (8 dpi) in mosquito thorax, and vector-derived infective L3 from the mosquito head. Life cycle stages sampled from gerbils included: early L3 infection (1-4 dpi), molt to L4 larvae (8 dpi), molt of males (20 dpi) and females (24 dpi) to immature adults, sexually mature adults (>70 dpi), eggs and embryos from adult females, immature microfilariae not yet infective to mosquitoes, and mature infective microfilariae.

Adult males. The second largest expression module identified in B. malayi consists of

1,481 genes upregulated and 72 genes downregulated in adult male B. malayi (Figure

4.3). While there were no significantly enriched functional terms in the downregulated

genes, PapD-like proteins with major sperm protein domains along with kinases and

phosphatases were significantly over-represented in genes upregulated in adult male B. malayi, as previously described (117). Between 30-45% of proteins with kinase domains and phosphatase domains were upregulated in males, likely indicative of male-specific signaling cascades in B. malayi. Adult male nematodes also upregulate an over- abundance of proteins containing BTB/Kelch-associated domains. The Kelch repeats of

90

Table 2.1: RNA-Seq life cycle samples and library preparations Nematode A. Sample B. malayi wBm Host aegypti rRNA-, poly(A)- 1 dpi gerbil poly(A)-selection NA depletion rRNA-, poly(A)- 2 dpi gerbil poly(A)-selection NA depletion rRNA-, poly(A)- 3 dpi gerbil poly(A)-selection NA depletion rRNA-, poly(A)- 4 dpi gerbil poly(A)-selection NA depletion rRNA-, poly(A)- 8 dpi gerbil poly(A)-selection NA depletion rRNA-, poly(A)- 20 dpi male gerbil poly(A)-selection NA depletion rRNA-, poly(A)- 24 dpi female gerbil poly(A)-selection NA depletion 24 dpi female gerbil poly(A)-selection wBm AgSS NA rRNA-, poly(A)- adult male gerbil poly(A)-selection NA depletion rRNA-, poly(A)- adult female gerbil poly(A)-selection NA depletion adult female gerbil poly(A)-selection wBm AgSS NA rRNA-, poly(A)- embryo gerbil poly(A)-selection NA depletion immature rRNA-, poly(A)- gerbil poly(A)-selection NA microfilariae depletion mature rRNA-, poly(A)- gerbil poly(A)-selection NA microfilariae depletion poly(A)- 18 hpi mosquito B. malayi AgSS wBm AgSS selection poly(A)- 4 dpi mosquito B. malayi AgSS wBm AgSS selection poly(A)- 8 dpi mosquito B. malayi AgSS wBm AgSS selection poly(A)- infective L3 mosquito poly(A)-selection NA selection

these proteins have been found to form a β-propeller that interacts with actin and

intermediate filaments for cytoskeletal rearrangement (186) while the BTB domains have been found to frequently interact with the Cul3 family of ubiquitin ligases (187-189).

91

Figure 4.2: Hierarchical clustering and PCA of differentially expressed B. malayi genes

(A) For each sample, the z-scores of the log2 TPM values of the 10,398 differentially expressed B. malayi genes, were used to hierarchically cluster the B. malayi samples using 100 bootstraps with support values indicated adjacent to the nodes. (B) A PCA was done using the z-score of the log2 TPM values of the 10,398 differentially expressed B. malayi genes. The variation represented by each of the first two principal components is indicated in parentheses next to the axes titles. In both the hierarchical clustering and PCA, four clusters are observed. Samples represented by triangles were prepared using a poly(A)-selection, while samples represented by circles were prepared using a B. malayi-specific AgSS capture. The B. malayi samples cluster based on sample rather than library enrichment, with all samples enriched using the AgSS clustering with their respective poly(A)-selected counterparts rather than each other. Adult female, embryos and microfilariae. The largest co-expression module identified

92

Figure 4.3: B. malayi WGCNA expression modules

The heatmap shows the z-score of the log2(TPM) values of the 10,398 differentially expressed B. malayi genes across the B. malayi lifecycle. The horizontal color bar and the key above the heatmap indicates the samples represented in each of the heatmap columns. The leftmost vertical color bar on the left of the heatmap represents the 15 different WGCNA expression modules generated using the 10,398 differentially expressed B. malayi genes. Within each individual expression module, there are two expression patterns that are the opposite of one another. The inner vertical colored bar indicates whether the gene clusters with the main (grey) or inverse expression pattern (black) of the module. in our data set includes 1,615 upregulated and 150 downregulated genes in the gravid

93

adult female, embryo, and microfilariae life stages of B. malayi. Because the bodies of

gravid adult females are filled with embryos, the transcriptional profile of adult female

worms is likely dominated by the transcriptional profile of eggs and embryos. No

modules were identified describing genes differentially expressed solely in adult females

or solely in the immature or mature microfilariae samples (Figure 4.3). The hierarchical cluster analysis of the B. malayi samples show the immature and mature microfilariae samples to be intermingled with one another (Figure 4.2A), indicating that despite differences in mosquito infectivity phenotype, there is no large scale significant difference in the transcriptomes of the different stages of microfilariae in this clustering- based analysis.

The 1,615 genes upregulated in the adult female, embryo, and microfilaria stages were found to be significantly over-represented in proteins with domains associated with transcription and transcriptional regulations as well as DNA-, protein- and ATP-binding, as has been previously described (117). Of the identified over-represented functional terms in this set of genes, chromatin remodeling is a recurring theme (e.g. 41/74 superfamily 1 and 2 (SF1/2) helicases (190, 191)). This module is enriched in bromodomain-containing proteins, which interact with chromatin remodeling complexes to regulate transcription or to facilitate DNA repair (192), and PHD-type zinc fingers, which have been found to have roles in epigenetic modifications and chromatin remodeling (193). The module is also enriched for SANT/Myb domain-containing proteins. SANT domains bind to histones for remodeling chromatin and regulating transcription (194), while Myb domains function as transcription factors, often playing an essential role in vertebrate haematopoiesis (195). Most of these SANT/Myb domain- 94

containing proteins (10/11) also contain homeobox-like domains suggesting a role in

cellular differentiation and embryonic development (196). The 150 downregulated genes in this module have an overrepresentation of ATP synthesis-coupled proton transport and hydrogen ion transmembrane transporter activity.

Two clusters were also obtained from subsets of samples from these same stages. In the

1,335 genes upregulated in only the embryo and adult female samples (Figure 4.3), we identified a significant overrepresentation of proteins involved with DNA replication, also previously identified (117, 118), likely reflecting the intense DNA replication in developing embryos. In the microfilariae, there are 1,242 upregulated genes over- represented in genes with no assigned functional terms (Figure 4.3). In the filarial nematode Onchocerca volvulus, the overrepresentation of GPCRs in the microfilarial life stages is proposed to meet unique chemosensory requirements needed for the migration of the microfilariae to the bloodstream of its definitive host for uptake by a vector host

(120). Although not statistically significant (Fisher’s exact test, Bonferroni correction, p- value=0.7), we find 36 of the 127 (28.3%) GPCRs encoded by B. malayi in this microfilariae-specific module.

Vector stage L1 and L2. Of the B. malayi vector samples, the L1 and L2 nematode samples taken at 18 hpi and 4 dpi were most similar (Figure 4.2) with a cluster of 1,032 genes that were highly upregulated in the 18 hpi samples and slightly less upregulated in the 4 dpi samples (Figure 4.3), indicative of the acclimation of the nematode to the mosquito vector. There is an overrepresentation of genes involved in protein production, cleavage, and degradation, suggesting a large turnover of proteins associated with the

95

mammal to vector insect host transition. These includes ribosomal components and

proteins with translational roles, including those with translational initiation factor, tRNA

aminoacylation, RNA-binding, and RNA processing activity. This cluster includes the

α/β proteasomal subunits, which function in peptide cleavage and protein degradation,

and most of the LSM domain proteins, which are thought to have a role in mRNA

degradation and splicing (197) and rRNA stability (198). This suggests that as nematodes

transition from the mammal to the insect host, the dominant process becomes facilitating

rapid protein turnover.

Vector-derived L3. There are two modules that represent genes upregulated later in B.

malayi vector development. The larger module contains genes that are more highly

upregulated in the 8 dpi samples while the smaller module contains genes that are more

highly upregulated in the infective L3 samples.

The larger module of 763 genes upregulated in the 8 dpi samples is overrepresented in

ion transport proteins, ion channels, and integral membrane proteins. It also contains 13

of 43 (30.2%) of B. malayi subtype 2 immunoglobulin (Ig)-like fold containing proteins,

including four containing fibronectin type-III repeats, an extracellular membrane protein

thought to facilitate crosstalk between cell functions and the extracellular environment

(199).

The smaller module of 267 upregulated genes is over-represented in proteins without annotated functions. However, this module does contain nine G protein-coupled receptors that could serve as a chemokine receptor to direct fully developed L3 to the mosquito head in preparation for a blood meal or to direct them once they infect the vertebrate

96

definitive host. There are also three abundant larval transcripts (ALTs), a class of larva-

specific proteins, whose expression levels peak at the infective L3 stage and drop upon

entry into the definitive host. Although ALTs have unknown and likely varying

functions, they are promising vaccine targets for filariasis because they are larva-specific,

highly expressed, and have no mammalian homologs (200). Although this module

represents the infective state of B. malayi, it is dominated by uncharacterized proteins,

indicating the need for additional studies.

Rodent-derived L3, L4, and early adults. The mosquito-derived infective L3s and the

gerbil-recovered L3s are transcriptionally most similar according to our hierarchical

clustering and PCA, despite the nematodes being in remarkably different host

environments (Figure 4.2). However, we did not recover a module of genes in the

WGCNA analysis that are shared between the two (Figure 4.3).

There are 455 nematode genes upregulated 1, 2, 3, 4, 8, 20, and 24 dpi of the gerbil

samples, which represents a wide diversity of life stages and molts including the L3, L4,

and sexually immature adult life stages (Figure 4.3). The only overrepresented functional

term identified describes membrane proteins.

A module was recovered containing 153 genes upregulated at specifically 1 and 2 dpi of

the gerbil host, indicative of the transcriptional alterations during the transition to the

vertebrate host. However, no over-represented functional terms were observed.

At 4 dpi of the gerbil, the cuticle surrounding the head thickens in anticipation of the L3 to L4 molt at 8 dpi (182). There is a module containing 411 genes upregulated at 4 dpi that is over-represented in structural constituents of the nematode cuticle and collagen 97

Figure 4.4: Hierarchical clustering and PCA of the differentially expressed wBm genes

(A) For each sample, the z-score of the log2 TPM values of the 336 differentially expressed wBm genes were used to hierarchically cluster the wBm samples using 100 bootstraps with support indicated at each node. Samples represented by triangles were prepared using a rRNA-, poly(A)-depletion while samples represented by circles were prepared using a wBm-specific AgSS capture performed on total RNA. (B) A PCA was performed using the z-score of the log2 TPM values of the 336 differentially expressed wBm genes. The first and second principal components are plotted on the x- and y-axes, respectively. The variation represented by each of the principal components is indicated in parentheses next to the axes titles. Unlike the B. malayi hierarchical clustering and principal component analyses, the wBm samples do not form discrete clusters. triple helix repeats. There is also an over-representation of proteins containing a zona

98

Figure 4.5: wBm WGCNA expression modules and 6S RNA expression

(A) The heatmap shows the z-score of the log2 (TPM) values of the 336 differentially expressed wBm genes across the B. malayi lifecycle. The horizontal color bar and the key above the heatmap indicates the samples represented in each of the heatmap columns. The leftmost vertical color bar on the left of the heatmap represents the nine different WGCNA expression modules generated using the differentially expressed wBm genes. The inner vertical colored bar indicates whether the genes cluster with the main (grey) or inverse expression pattern (black) of the module. (B) The upper plot indicates the number of reads mapped to genes in each of the samples. The bottom plot indicates the TPM value of the 6S RNA compared with that of all other transcripts for all poly(A)-selected samples. The Agilent SureSelect samples are excluded from the plot because there were no probes designed to capture the non-protein coding RNAs such as the 6S RNA. Additionally, one replicate of 4 dpi underwent column purification, leading to the loss of small RNAs, and is not shown. pellucida domain. While best known for their role in sperm attachment to oocytes (201), zona pellucida domains can have other functions. Additionally, all five proteins 99 containing the EGF domain found in malarial parasite merozoite surface protein 1

(MSP1) are upregulated at this stage.

The third B. malayi molt occurs 8 dpi of the gerbil with the secretion of a new collagenous cuticle (182). This corresponds with transcription of collagen-triple helix repeat-containing proteins that are structural constituents of the cuticle.

Once L4 larvae have matured to early adults, at 20 dpi for the males and 24 dpi for the females, 283 genes are upregulated with an over-representation in proteins annotated as being found in the extracellular space. No expression modules differentiating the 20 dpi male and 24 dpi female samples were recovered, despite the development of gonads.

However, the transcriptome profiles of the 20 dpi male and 24 dpi female samples are sufficiently different that they cluster apart in the hierarchical clustering and PCA

(Figure 4.2). wBm transcriptome

The wBm transcriptome was assessed for all the same time points as B. malayi with the exception of the infective L3 life stage. Of the 839 annotated wBm protein-coding genes,

336 (40.0%) were identified to be differentially expressed in at least one B. malayi life stage. However, hierarchical clustering and PCA of the differentially expressed genes in each of the samples reveals poorly defined clusters (Figure 4.4), relative to those from B. malayi. The most discrete cluster appears to be the wBm transcriptome recovered from the vector samples relative to the mammalian samples. While these samples were prepared using an AgSS targeted capture, we previously identified that this did not impart a significant bias relative to its poly(A)-, rRNA-depleted counterpart (Chapter 2) (183).

100

The third largest wBm expression module describes 31 wBm genes upregulated in adult

male and female B. malayi (Figure 4.5A). This cluster of genes is significantly over-

represented in protein-folding, with 5/9 protein-folding genes, different than the prior

observation of an overrepresentation of chaperones in only adult female B. malayi (118).

Despite the physiological differences between adult male and female worms, there

appears to be no discernible difference in the wBm transcriptomes of males and females

in the PCA (Figure 4.5B).

The first two principal components account for 42.6% of the variation across the samples.

The first principal component was found to be largely represented by ribosomal

components while the second principal component had no significantly enriched

functional terms. However, most of the ribosomal components are from a single, very

large operon. Therefore, it is not clear if this result reflects the co-transcription of these genes in the operon, or if operons confound differential expression analyses that expect each measure of transcription to be independent.

6S RNA. Additionally, we observed that the expression of the 6S RNA, a noncoding

RNA whose expression correlates with bacterial replication rates (202-205), increases as

B. malayi matures from L3 to the adult and embryo life stages (Figure 4.5B). Upon the embryo maturing to microfilaria, the expression of the 6S RNA drops precipitously.

A. aegypti transcriptome

The transcriptomes of A. aegypti 18 hours, 4 days, and 8 days after B. malayi infection

were compared to the transcriptomes from time-matched controls that were fed

uninfected dog blood. Of the 17,313 protein-coding genes annotated in A. aegypti, 6,653

101

were found to be differentially expressed in at least one of the conditions. Hierarchical

clustering and PCA shows distinct clustering of infected and control samples (Figure

4.6). The 6,653 differentially expressed genes were clustered into eight expression modules using WGCNA (Figure 4.7).

Figure 4.6: Hierarchical clustering and PCA of the differentially expressed A. aegypti genes

(A) A hierarchical clustering analysis of the z-score of the log2 TPM values of the 6,653 differentially expressed A. aegypti genes, was used to cluster the A. aegypti samples. Post-blood feeding controls (hpbf, dpbf) were obtained and used as a control for the post-infection samples at the same time interval (hpi, dpi). Support values were calculated using 100 bootstraps and are indicated adjacent to the nodes. (b) A PCA was performed using the z-score of the log2 TPM values of the 6,653 differentially expressed A. aegypti genes with the variation indicated in parentheses next to the axes titles. Some clusters likely represent differential gene expression relevant to blood feeding. For instance, the third largest WGCNA cluster contains 727 A. aegypti genes upregulated at

18 hours in both infected and control mosquitoes (Figure 4.7). Over-represented functional terms identified in this module include signal recognition particle (SRP)- dependent protein targeting, GOLD domains, and endoplasmic reticulum localization with functions that target proteins for secretion (206, 207).

102

Figure 4.7: A. aegypti WGCNA expression modules

The heatmap shows the z-score of the log2 TPM values of the 6,653 differentially expressed A. aegypti genes. The horizontal color bar and the key above the heatmap indicates the samples represented in each of the heatmap columns. The leftmost vertical color bar on the left of the heatmap represents the nine different WGCNA expression modules. The inner vertical colored bar indicates whether the genes cluster with the main (grey) or inverse expression pattern (black) of the module.

18 hours post feeding. By 18 hpi, microfilariae ingested from an infected mammal have

103

invaded the mosquito thoracic muscles and begun to shorten to a distinct morphological

form commonly referred to as the L1 or sausage stage. There are 446 A. aegypti genes upregulated at this time point in only the infected mosquitoes (Figure 4.7). The most significantly over-represented functional terms involve proteins with oxidoreductase activity and with roles in ATP synthesis-coupled proton transport and the tricarboxylic

acid cycle. Additionally, we observe an over-representation in structural constituents of

the ribosome. Confirming previous observations, this enrichment of proton transport and

ATP synthase activity at 18 hpi is an indicator of metabolic disturbance in an early response to B. malayi infection (151).

There are 209 A. aegypti genes that are upregulated solely in the 18 h blood-fed control

samples. The only significantly over-represented functional terms in this cluster describe

insect-allergen-related proteins, a protein family found to be widespread in insects with

roles in nutrient uptake and detoxification (208), and vitellogenins, a family of secreted proteins with an important role in yolk formation, known as vitellogenesis (209).

Previous studies have found vitellogenin (210, 211) and insect allergen protein expression to be a typical post-feeding effect in mosquitoes (212, 213). The absence of vitellogenin and insect allergen-related protein upregulation in infected mosquitoes confirms previous observations that B. malayi has an adverse effect on vector fitness

(214, 215).

4 days post infection The 4 dpi samples mark the L1 to L2 molt of B. malayi in the thoracic muscle. Although no expression modules contained genes only upregulated at 4 days in infected mosquitoes, the largest A. aegypti expression module generated contains

104

2,601 genes upregulated specifically in the blood fed control samples (Figure 4.7). Over-

represented encoded proteins in this cluster include helicases, zinc fingers, and proteins

with bromodomain folds. Additionally, the cluster is over-represented in kinases,

phosphatases, and other proteins with roles in intracellular signal transduction. These

transcriptional changes and triggered signal pathways are absent in nematode-infected mosquitoes, possibly indicating that the continued development of B. malayi in the

mosquito is interfering with normal metabolic activity.

8 days post infection. The second largest module of A. aegypti differentially expressed genes includes 1,272 genes specifically upregulated at 8 days in infected A. aegypti,

coinciding with the L2 to L3 molt of B. malayi in the thoracic muscle (Figure 4.7). As

with previous studies, we observed an increase in α-crystallin heat shock proteins a

family of chaperones induced to cope with stressful conditions (151, 216), likely from the

continued growth of B. malayi. However, while a previous study observed the small heat

shock proteins to have a temporal expression pattern at 1, 6, and 8 dpi of B. malayi, we

only observed detectable upregulation of these heat shock proteins at 8 dpi (151, 216).

Additionally, proteasomal components, including serine proteases, (217) were identified

to be over-represented in this cluster of genes. No other expression modules describing

specific expression were identified in time-matched controls.

Post-blood feeding versus post-infection. In the fourth largest A. aegypti expression

module, 419 genes, were upregulated in only blood fed control mosquitoes while 170

genes were upregulated only in infected mosquitoes (Figure 4.7). While no functional

terms were over-represented in upregulated genes from infected mosquitoes, the controls

105 were over-represented in translational proteins and structural constituents of the cytosolic ribosome as opposed to the structural constituents of the mitochondrial ribosome that were upregulated in the 18 hpi mosquitoes.

B. malayi drug target identification and in vitro validation

Because current mass drug administration (MDA) programs for lymphatic filariasis are primarily microfilaricidal and have a limited efficacy on adult worms (174, 175), we sought to identify a potential adulticidal drug target using our RNA-Seq data set. Of the

18 bromodomain-containing proteins encoded by the B. malayi genome, 15 were identified in the largest B. malayi expression module, which contained genes upregulated in the adult female, embryo, and microfilariae life stages. Of the three remaining bromodomain-containing proteins, two were found to be upregulated in the adult female and embryo life stages while one was upregulated only in the microfilariae life stages. As such, all 18 bromodomain-containing proteins are associated with adult females, embryos, and microfilariae. Bromodomains interact with chromatin remodeling complexes to regulate transcription or to facilitate DNA repair protein accessibility (218).

The overabundance of these bromodomain regulatory proteins in the adult female, embryo, and microfilariae life stages could indicate an underlying need for large-scale chromatin remodeling during these specific life stages.

JQ1(+) is a bromodomain inhibitor that binds to members of the bromodomain and extra- terminal (BET) family of transcription factors (219). By competitively binding to BET proteins, JQ1(+) prevents BET proteins from binding to acetylated lysine residues on chromatin and recruiting transcription factors (220, 221). JQ1(+) has been identified to be

106

a promising cancer therapeutic, having been shown to prevent BRD4, a BET family

protein, from interacting with and recruiting Myc, a transcription factor involved in cell

proliferation that has been found to be constitutively active in several cancers, such as

acute myeloid leukemia and multiple myeloma (221-225).

Of the 18 bromodomain proteins encoded by the B. malayi genome, only two are members of the BET family of proteins and orthologs of the human and mouse BET proteins. Our transcriptional data indicates they are both upregulated in the adult female, embryo, and microfilariae life stages. Similarly, Caenorhabiditis elegans has three

reported BET orthologs: bet-1, bet-2, and F13C5.2. Although there were no observed

RNAi phenotypes for bet-2, knockdowns of both bet-1 and F13C5.2 were found to result

in embryonic lethality, larval arrest, locomotion alterations, and sterility (226, 227),

indicating the importance of the C. elegans BET proteins in reproduction. Additionally,

interfering with the activity of BET proteins has the potential to cause adult worm

lethality, with knockdowns of bet-1 causing adult C. elegans lethality and knockdowns of

F13C5.2 causing ruptures in the vulva of female C. elegans (226, 227). Due to the

potential importance of the BET proteins in B. malayi reproduction and sterility, given

the parallels seen in C. elegans, we tested the efficacy of JQ1(+) on B. malayi.

Adult female B. malayi were treated with JQ1(+) ranging from 1-100 µM to assess the effect of BET inhibition on worm viability. The efficacy of the drug on the adult female worm was assessed at 6 and 24 h post-treatment using motility scoring (228) and the colorimetric MTT viability assay (229). At both time points 10 and 100 µM JQ1(+) were found to cause complete loss of motility in adult female worms, while the negative

107 enantiomer JQ1(-) and ivermectin only causing complete loss of motility at 100 µM, and for ivermectin this was only observed at 24 hours (Figure 4.8AB). At 6 h, adult worms had decreased viability according to the MTT assay at 10 µM and 100 µM JQ1(+) relative to the DMSO vehicle control (Figure 4.8CD). At 24 h, 10 and 100 µM JQ1(+) resulted in significantly reduced viability relative to the DMSO vehicle control while there was a great deal of variation in the 100 µM ivermectin treatment. As such JQ1(+) has adulticidal effects on filarial nematodes.

Additionally, worm fecundity was assessed by measuring microfilariae shed from treated adult female worms. At 6 and 24 h, all concentrations of JQ1(+) resulted in reduced fecundity, providing evidence that even 1 µM of JQ1(+), the lowest tested concentration, may potentially sterilize adult female worms (Figure 4.8EF).

Discussion

This analysis of the B. malayi transcriptome reveals overarching transcriptional themes during critical life stages. The over-representation of kinases and phosphatases coupled with an under-representation of DNA-binding proteins in adult male B. malayi indicates that signaling likely occurs through male-specific kinase signaling cascades. Conversely, adult female and embryo B. malayi stages are over-represented in DNA-binding proteins such as zinc fingers and helicases. Adult females, embryos, and microfilariae need to undergo significant chromatin remodeling for further development.

Proteins associated with chromatin remodeling, including those that contain bromodomains, may be a prospective new drug target for the development of adulticidal therapeutics. Our results with the BET inhibitor JQ1(+) indicate that it is more lethal than

108

Figure 4.8: Effects of JQ1(+) treatment on adult female B. malayi The efficacy of JQ1(+) was assessed on adult female B. malayi at 6 h (A,C,E) and 24 h (B,D,F) using (AB) a motility scoring assay, (CD) the colorimetric MTT viability assay measured at 530 nm, and (EF) microfilariae counts. Each point is indicative of two replicates and the error bars indicate standard error of the mean. For each of the assays, the effect of 1, 2.5, 5, 7.5 ,10, and 100 µM JQ1(+) and its negative enantiomer JQ1(-) was assessed along with 100 µM ivermectin (a known antihelminthic drug) and DMSO (a vehicle control). Asterisks indicate significant differences (p < 0.05, Student’s unpaired t- test) between the JQ1(+) and JQ1(-) treatments.

ivermectin in adult female worms. Additionally, JQ1(+) treatment reduced microfilariae

output more than ivermectin, suggesting that JQ1(+) could be sterilizing the adult female

worms. Future work needs to focus on microfilaricidal effects as well as assessing the

efficacy of JQ1(+) in gerbils, measuring worm burden, development, and fecundity.

Although JQ1(+) was initially touted as a prospective cancer therapeutic, its low half-life

in sera (<0.9 h) limits its efficacy (230). Because of this, a number of BET inhibitors have been developed with longer half-lives and are currently in clinical trials as cancer therapeutics (221). Given their promise as a cancer therapeutic, we can anticipate more

109

BET inhibitor derivatives and more studies need to ensure safety (223). Derivatives that more specifically targeting filarial nematodes may be desirable.

In contrast to published multi-species transcriptomes of filarial Wolbachia endosymbionts across their invertebrate host’s life cycles (118, 231), we observed very few changes in the wBm transcriptome across our data set. Although the absence of the heme, riboflavin, and FAD biosynthetic pathways in the B. malayi genome has generated hypotheses that wBm provides these needed metabolites to B. malayi by upregulating endosymbiont gene expression (181), our data suggests that these pathways may be constitutively expressed in wBm. While there may be specific wBm genes that are differentially expressed, we see little indication of global regulation of gene expression beyond that of chaperones and ribosomal proteins. The PCA does not resolve the samples, and the WGCNA modules do not show clear delineation of gene functions or of likely co-transcribed genes as is expected with global gene regulation. We propose that rather than altering the expression of specific metabolic pathways across the B. malayi life cycle, wBm may be altering its rates of replication, transcription, and translation as evidenced by the changes in the 6S RNA. This may not be surprising, given that nematode Wolbachia are obligate intracellular bacteria with limited environmental variation. Despite the lack of global regulation of gene expression in wBm across the life cycle, it is possible that there is global regulation of gene expression within different populations of B. malayi cells (e.g. oocytes or spermatocytes).

Our data show that parasitism by B. malayi has severe effects on transcription related to normal homeostatic processes in A. aegypti. At 18 hpi proteins with roles in ATP-coupled

110

protein transport and mitochondrial translation were elevated, likely due to increased

energy demands by infected mosquitoes. This upregulation coincides with previous observations of the breakdown of mitochondria near the nematode and the depletion of microfiber glycogen reserves upon nematode infection of A. aegypti (232). Additionally,

the absence of upregulated vitellogenin and insect allergen-related proteins in the 18 hpi

samples suggests a decrease in A. aegypti fecundity upon the uptake of B. malayi (214,

215). At 4 dpi, coinciding with the L1 to L2 molt of B. malayi, there is a remarkable

upregulation of many genes in the control mosquitoes that were not present in the

infected mosquitoes. These include proteins with roles in transcriptional regulation, such

as zinc fingers and bromodomains, along with proteins with roles in signaling cascades,

such as kinases, phosphatases, and intracellular signal transduction proteins. These large

perturbations in the A. aegypti transcriptome suggest a large impact on vector host

viability by B. malayi.

Overall, this study has led to the discovery of BET inhibitors as potential therapeutics for

lymphatic filariasis. Our multi-species RNA-Seq analysis across the B. malayi life cycle

serves as a comprehensive data set that can enable the identification of other therapeutics

and address other questions regarding the interplay between the different organisms in the

B. malayi life cycle.

Methods

Sample preparation--Nematodes

Third stage larvae of Brugia malayi strain FR3 were isolated from infected mosquitoes 9-

16 days post infection (dpi) using the NIAID/NIH Filariasis Research Reagent Resource

111

Center (FR3) Research Protocol 8.4 (233). Larvae were isolated in RPMI + P/S, enumerated, and either flash frozen in liquid nitrogen for storage at -80°C or used for the infection of male Mongolian gerbils (Meriones unguiculatus; Charles River, Wilmington,

MA, USA). Gerbils were anesthetized with 5% isoflurane and infected by introducing

400 L3s into the peritoneal cavity using a butterfly catheter. Larvae, immature adults, and mature adults were recovered by peritoneal flush using a terminal procedure. Brugia malayi embryos were collected from live gravid females as previously described (117).

Generation of immature and mature microfilariae required the surgical transplantation of adult B. malayi into the peritoneal cavity of recipient gerbils (234). Microfilariae were isolated either through a peritoneal tap (235) or by terminal worm recovery. For terminal worm recoveries, gerbils were euthanized by CO2 asphyxiation followed by cervical

dislocation. The peritoneal cavities were then opened and soaked with RPMI + P/S. All

animal care and use protocols were approved by the UWO IACUC. Nematode

preparations were flash frozen in liquid nitrogen and stored at -80°C prior to RNA

isolation.

Sample preparation--Mosquitoes

The Aedes aegypti black-eyed Liverpool strain was obtained from FR3 and maintained in

the Biosafety Level 2 Insectary at the University of Wisconsin Oshkosh. Desiccated

mosquito eggs were hatched in deoxygenated water and the resulting larvae were

maintained on a slurry of ground TetraMin fish food (Blacksburg, VA) at 27°C with 80%

relative humidity. Female pupae were separated using a commercial larva pupa separator

(The John Hock Company, Gainesville, FL), placed in mesh-covered paper soup cartons,

112

and enclosed adults were fed with cotton pads soaked in sucrose solution. Females were

deprived of sucrose approximately twelve hours prior to blood feeding. Mosquitoes were

infected with the Brugia malayi FR3 strain by feeding microfilaremic cat blood (FR3) through Parafilm via a glass-jacketed artificial membrane feeder. When necessary microfilaremic cat blood was diluted with uninfected dog blood to achieve a suitable parasite density for infection (100-250 microfilariae/20 μL). Infected mosquitoes were collected at 18 hours post infection (hpi), 4 dpi, and 8 dpi. They were anesthetized with

CO2 then transferred to chilled microscope slides where the legs, wings, heads and abdomens were removed. The thoraces with and without developing B. malayi larvae

were flash frozen in liquid nitrogen and stored at -80 °C prior to RNA isolation. Vector-

derived Infective B. malayi L3 were isolated from whole mosquitoes in bulk using the

standard FR3 protocol (233).

RNA isolation

Mosquito thoraces were combined with TRIzol (Zymo Research, Irvine, CA) at a ratio of

1 mL TRIzol per 50-100 mg mosquito tissue, while nematode samples were processed using a 3:1 volume ratio of TRIzol to sample. For both preparations, 1 µL β-

mercaptoethanol was added for every 100 µL of sample. The tissues were homogenized

using a bead beater and a TissueLyser (Qiagen, Germantown, MD) at 50 Hz for 5 min.

The homogenate was then transferred to a new tube and centrifuged at 12,000g for 10

min at 4°C. After incubation at room temperature for 5 minutes, 0.2 mL chloroform was

added for every 1 mL TRIzol. The samples were shaken by hand for 15 seconds,

incubated at room temperature for 3 min, loaded into a pre-spun Phase Lock Gel heavy

tube (5Prime, Gaithersburg, MD), and centrifuged at 12,000g for 5 min at 4°C. The upper 113

phase was extracted, one volume of 100% ethanol was added, and then loaded onto a

PureLink RNA Mini column (Ambion, Austin, TX). The samples were processed

following manufacturer instructions, quantified using a Qubit fluorometer (Qiagen,

Germantown, MD) and Nanodrop spectrometer (Nanodrop, Wilmington, DE), and sent to

Institute for Genomic Sciences at the University of Maryland Baltimore for DNase

treatment and library preparation.

Library preparation, capture system design, and stranded RNA sequencing

Transcriptome sequencing was conducted as previously described (236). Briefly, whole transcriptome libraries without AgSS capture were constructed for sequencing on the

Illumina platform using the NEBNext Ultra Directional RNA Library Prep Kit (New

England Biolabs, Ipswich, MA, USA). For targeting eukaryotic mRNA, polyadenylated

RNA was isolated using the NEBNext Poly(A) mRNA magnetic isolation module. When

targeting bacterial mRNA, samples underwent rRNA- and poly(A)-reductions, as previously described (146, 149). SPRIselect reagent (Beckman Coulter Genomics,

Danvers, MA, USA) was used for cDNA purification between enzymatic reactions and size selection. The PCR amplification step was performed with primers containing a 7-nt index sequence. Libraries were evaluated using the GX touch capillary electrophoresis system (Perkin Elmer, Waltham, MA) and sequenced on an Illumina HiSeq2500, generating 100-bp paired end reads.

For the B. malayi and wBm AgSS capture designs, probes were designed using the

SureSelect DNA Advanced Design Wizard for B. malayi and wBm (Chapter 2) (183).

Probes were designed to capture every 120 bp for each coding sequence in both

114

organisms with no overlap between baits. DustMasker was used to identify and mask low

complexity regions of the genome from the probe design (154).

For samples treated with the wBm AgSS capture, pre-capture libraries were constructed

from 500-1000 ng of total RNA samples using the NEBNext Ultra Directional RNA

Library Prep kit (NEB, Ipswich, MA, USA). First strand cDNA was synthesized without

mRNA extraction to retain non-polyadenylated transcripts and fragmented at 94 oC for 8 min. After adaptor ligation, cDNA fragments were amplified with 10 cycles of PCR before capture. Wolbachia transcripts were captured from 200 ng of the amplified libraries using the Agilent SureSelectXT RNA (0.5-2 Mbp) bait library designed

specifically for wBm. Library-bait hybridization reactions were incubated at 65 °C for 24 h then bound to MyOne Streptavidin T1 dynabeads (Invitrogen, Carlsbad, CA, USA).

After multiple washes, bead-bound captured library fragments were amplified with 18

cycles of PCR. The libraries were loaded on an Illumina HiSeq4000, generating 151-bp

paired end reads.

For samples treated with the B. malayi AgSS capture, pre-capture libraries were constructed from 1000 ng of total RNA samples using the NEBNext Ultra Directional

RNA Library Prep kit (NEB #E7420, Ipswich, MA, USA). After adaptor ligation, cDNA fragments were amplified for 10 cycles of PCR before capture. B. malayi transcripts were captured from 200 ng of the amplified libraries using an Agilent SureSelectXT Custom

(12-24 Mbp) bait library designed specifically for B. malayi. Library-bait hybridization reactions were incubated at 65 °C for 24 h then bound to MyOne Streptavidin T1

Dynabeads (Invitrogen, Carlsbad, CA). After multiple rounds of washes, bead-bound

115

captured library fragments were amplified with 16 cycles of PCR. The libraries were

loaded on a HiSeq4000 generating 151-bp paired end reads.

Read alignments, feature counting, and differential expression analysis

For all samples used in the analysis, sequencing reads were mapped to the B. malayi

genome WS259 (www.wormbase.org) using the splice junction mapper TopHat v1.4 4

(155) and the wBm genome (70) using the Bowtie v0.12.9 aligner (157). All vector

samples were aligned to the A. aegypti Liverpool strain genome AaegL3.3

(www.vectorbase.org) using the TopHat v1.4 aligner. Feature counts for the B. malayi

and A. aegypti alignments were calculated using the union mode of HTSeq v0.5.3p9

(130) for all exon features. Feature counts for wBm were calculated using the prokaryotic

feature counter FADU (Chapter 3) (237). Using RNAmmer v1.2 (156), we identified five

protein-coding genes in the B. malayi WS259 annotation that overlapped with predicted

rRNAs: WBGene00228061, WBGene00268654, WBGene00268655, WBGene00268656,

and WBGene00268657. These five genes were excluded from differential expression

analyses.

Differentially expressed genes across all samples were identified for each organism

individually by processing the read counts for all protein-coding genes using edgeR

v3.20.1 (238). Within edgeR, read counts were normalized and filtered so that only genes

with a counts per million (CPM) value of >2 in at least two samples would be kept. The

qCML common dispersion and tagwise dispersions were estimated for the counts of each

organism and fit to a generalized linear model. Differentially expressed genes were

116 determined using a quasi-likelihood F-test and defined as genes with a p-value and FDR

< 0.05.

WGCNA module detection and functional term enrichments

Read counts were converted to transcript per million (TPM) values to normalize the gene expression across samples. For each organism individually, the TPM values for differentially expressed genes were processed using the R package WGCNA v1.61 (184) to identify expression modules across the B. malayi lifecycle. For the soft thresholds for each expression subset, values of 6, 6, and 7 were chosen for B. malayi, wBm, and A. aegypti, respectively. Modules were derived from each organism’s dataset by hierarchically clustering genes based on dissimilarity in a topological overlap matrix and using a dynamic tree cut at a height that encompasses 99% of the truncated height range in the observed dendrogram (minimum cluster size: 1). Closely related modules were merged using a merge eigengene dissimilarity threshold of 0.25. Each individual expression module was further divided into two clusters depending on whether the expression pattern of a gene has a higher Pearson correlation to the eigengene or the inverse eigengene (185).

InterPro descriptions and gene ontology (GO) terms for each of the protein-coding genes in the A. aegypti, B. malayi, and wBm genomes were obtained using InterProScan v5.22.61.0 (239). Over-represented and under-represented terms for each WGCNA module were defined as terms with a Fisher’s exact test p-value <0.05 after applying a

Bonferroni correction.

117

B. malayi response to JQ1(+)

Adult female B. malayi worms for testing JQ1(+) were supplied by TRS Laboratories

(Athens, GA, USA) in RPMI culture medium. The experiment was conducted under a

laminar flow hood sprayed with 70 % ethanol before use. Upon arrival, worms were

placed in pre-warmed 37 °C sterile RPMI 1640 medium (Hyclone, Logan, UT, USA) supplemented with glucose (5 g/L), 10% fetal bovine serum, and antibiotic-antimycotic-

glutamine solution containing 10,000 units penicillin, 10 mg streptomycin and 25 μg

Amphotericin B per mL (Sigma-Aldrich, MO, USA) and incubated for 3-4 h at 37 °C in

5% CO2.

After acclimation, worms were exposed to different drug concentrations obtained from

stock solutions of 10 mM IVM or 100 mM JQ1 dissolved in 100% DMSO. Worms were

transferred to 12-well plates (2 worms per well, 2 well replicates per treatment) with each

well containing 990 µL of the glucose, serum, and antibiotic supplemented RPMI

medium and 10 µL of either JQ1(+), JQ1(-), IVM and/or DMSO to obtain the final

concentration tested, such that all wells had a final concentration of 1% DMSO. Worms

were kept at 37 °C in 5 % CO2 and their viability was assessed at 6 and 24 h for each

treatment using worm motility (228), microfilariae release, and the MTT reduction assay

at an absorbance of 530 nm (229). The MTT reduction assay relies on the ability of living organisms to reduce tetrazolium MTT (3-(4, 5-dimethylthiazolyl-2)-2, 5- diphenyltetrazolium bromide) to formazan. Formazan is then dissolved in DMSO and its absorbance at 530 nm is assessed. Adult female B. malayi were treated with 1, 2.5, 5, 7.5,

10, and 100 µM of the JQ1(+) drug and JQ1(-), its chiral enantiomer, in vitro to assess the effect of BET inhibition on their viability. This was compared to the positive control 118

100 µM ivermectin, a known antihelmintic drug, and DMSO as a vehicle control.

Motility scores, microfilariae counts, and absorbance between JQ1(+)-treated worms were compared to DMSO using an unpaired Student’s t-test.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

The data set(s) supporting the results of this article are available in the Sequence Read

Archive (SRA) repository. The A. aegypti and B. malayi sequencing reads are available in

SRP068692 and the wBm sequencing reads are available in SRP068711.

Competing interests

The authors declare that they have no competing interests.

Funding

This project was funded by federal funds from the National Institute of Allergy and

Infectious Diseases, National Institutes of Health, Department of Health and Human

Services under grant number U19 AI110820. The funder had no rule in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

119

Authors’ Contributions

MC, LT, SL, JMF, MLM, and JCDH wrote portions of the manuscript. All authors read and edited the manuscript. LT and MLM generated nematode and mosquito RNA samples. SL, NK, and REB conducted molecular biology experiments. LS and LJT oversaw sequencing. REB managed the sequencing data. MC conducted the

transcriptomics analysis. DO and AM generated the gEAR instances to data access. MC

and REB generated figures. MC and SL conducted the in vitro drug studies. JMF, MLM,

and JCDH designed and oversaw the experiments.

Acknowledgements

We would like to thank Zachary Williams at UWO for generating mature and immature

microfilariae. We would like to thank Dr. Ronna Hertzano and Joshua Orvis at UMSOM

for their seminal work on the gEAR platform, which we co-opt to provide community

access to this transcriptome data.

120

Chapter 5: Draft genome sequence of the Wolbachia endosymbiont of Wuchereria bancrofti wWb 1

Abstract

The draft genome assembly of the Wolbachia endosymbiont of Wuchereria bancrofti

(wWb) consists of 1,060,850 bp in 100 contigs and contains 961 ORFs, with a single copy of the 5S rRNA, 16S rRNA and 23S rRNA and each of the 34 tRNA genes.

Phylogenetic core genome analyses show wWb to cluster with other strains in supergroup

D of the Wolbachia phylogeny, while being most closely related to the Wolbachia endosymbiont of Brugia malayi strain TRS (wBm). The wWb and wBm genomes share

779 orthologous clusters with wWb having 101 unclustered genes and wBm having 23 unclustered genes. The higher number of unclustered genes in the wWb genome likely

reflects the fragmentation of the draft genome.

Keywords

Wolbachia, lymphatic filariasis, nematode, endosymbiont, genomics, Wuchereria bancrofti

Introduction

Lymphatic filariasis afflicts 120 million individuals worldwide. Wuchereria bancrofti,

Brugia malayi and Brugia.timori∼ cause human lymphatic filariasis, with W. bancrofti

being responsible for >90% of cases (8). Most filarial nematodes have an obligate bacterial Wolbachia endosymbiont that is required for the proper development and

1 Chung M, Small ST, Serre D, Zimmerman PA, Dunning Hotopp JC. Draft genome sequence of the Wolbachia endosymbiont of Wuchereria bancrofti wWb. Pathog Dis. 2017;75(9):ftx115. 121

reproduction of the nematode (52). Within the group of Wolbachia endosymbionts

originating from lymphatic filarial worms, the only sequenced, full-length genome is that

of the Wolbachia endosymbiont of B. malayi (wBm) (70). While there is an existing sequenced genome available for the Wolbachia endosymbiont of W. bancrofti, that assembly consists of 763 contigs (93), which equates to 1 gene per contig. Here, we present an independently sequenced and improved draft ∼genome sequence of the

Wolbachia endosymbiont of Wuchereria bancrofti (wWb).

Materials and Methods

Genome sequencing, assembly and annotation

The wWb sequences used were obtained during whole-genome sequencing of

Wuchereria bancrofti, taken from Patient 0022 at the sampling location of Tau, Papua

New Guinea: GPS coordinates −3.666163, 142.766774 (240). Wolbachia contigs were

identified by aligning to the wBm genome using MUMmer v3.0 (241), discarding contigs

with <80% identity across <50% of their length (70). Reads mapping to these putative

Wolbachia contigs were identified using Bowtie2 (159), extracted (240) and used to

construct a new de novo assembly using SPAdes v3.6.2 (242). The process was repeated

iteratively until no further contigs were added to the assembly. The contigs from the de

novo assembly were then reordered using Mauve (243), with the wBm genome as the reference. The final assembly consists of 100 scaffolds >500 bp with a scaffold N50 of 19

998 bp. GLIMMER v3.02 (244) and the IGS Prokaryotic Annotation Pipeline (245) were used to annotate the wWb assembly.

122

DNA sequencing reads for BioProject PRJNA275548 have been deposited at NCBI SRA:

SRP056161. The whole-genome shotgun project for wWb has been deposited at

DDBJ/ENA/GenBank under the accession NJBR00000000. The version described in this

paper is version NJBR02000000. The corresponding whole-genome shotgun project for

W. bancrofti is available at DDBJ/EMBL/GenBank under the accession LAQH00000000.

Phylogenetic and comparative genomic analyses

Mugsy v1.2 (246) and MOTHUR v1.22 (247) were used to construct a core genome

alignment of wWb and 13 other Wolbachia strains, spanning members from five of the

Wolbachia supergroups (Table 5.1) (69, 70, 111, 248-255). RAxML v7.3 (256) was used to construct a maximum-likelihood phylogenetic tree (bootstrap number = 1000, substitution mode = GTRGAMMA, default for all other settings) from the core genome alignment. Similarly, a core genome alignment was constructed with wWb and its closest related Wolbachia strain wBm. NUCmer v3.06 and MUMmerplot v3.5 (241) were used to produce and visualize a synteny plot between wWb and wBm, respectively. Although

the wWb contigs were ordered and oriented to the wBm assembly, the mummer plot enables us to visualize any chromosomal rearrangements within a contig. However, rearrangements within the 100 gaps between the contigs cannot be visualized. For the comparative genome analyses, Mugsy clusters (246) were used to assign all proteins from wWb and wBm to orthologous clusters. A Sybil instance (257) was used to identify shared and unique genes between the two strains, along with pseudogenes in wWb.

123

Identification of lower confidence positions in wWb

Table 5.1: Wolbachia genomes used for phylogenetic analysis Genbank Strain Host Supergroup Accession / BioProject ID wAu Drosophila simulans A LK055284 Drosophila wRi A CP001391 melanogaster Drosophila wMel A AE017196 melanogaster wHa Drosophila simulans A CP003884 wNo Drosophila simulans B CP003883 wPip_Pel Culex quinquefasciatus B AM999887 Trichogramma wTpre B CM003641 pretiosum wOo Onchocerca ochengi C HE660029 wOv Onchocerca volvulus C HG810405 wDi_Pavia Dirofilaria immitis C PRJEB4154 wBm Brugia malayi D AE017321 Litomosoides wLs D PRJEB4155 sigmodontis wWb Wuchereria bancrofti D NJBR00000000 wCle Cimex lectularius F AP013028

Putative low-confidence positions in the wWb assembly were assessed using high- sequencing depth and high-sequence variation. To measure sequencing depth, reads were aligned to the wWb genome with Bowtie2 (159), PCR duplicates were removed with

PicardTools (http://broadinstitute.github.io/picard), and depth was measured using the depth function of SAMtools v1.1 (158). Regions with elevated sequencing depth were defined as all ≥50 bp stretches with a sequencing depth of ≥4 median absolute deviations from the major mode of the sequencing depth (43.72x) (Figure 5.1). To validate this

LGT cutoff method, the sequencing depth thresholds derived from previous studies (258-

124

260) were re-examined using this method and were found to be within 1% of the original

published cutoff (Figure 5.2).

Figure 5.1: Sequencing depth distribution of the wWb assembly. This histogram depicts the sequencing depth distribution of the wWb genome assembly with the x-axis indicating sequence depth and the y-axis depicting the number of genomic positions (x1000) at each depth. The red dashed line indicates the major mode sequencing depth (20x) while the blue dashed line indicates our sequencing depth cutoff of four median absolute deviations from the major mode sequencing depth (43.72x).All points to the right of the blue dashed line were marked as regions with abnormally high sequencing depth, indicative of potential LGT regions.

To assess regions with high-sequence variation, 5423 variant positions were identified as

having ≥20x sequencing depth with at least one alternative base call that consisted of

>5% of the reads at the position. In order to find regions of the genome with higher

sequence variation, the percentage of variant positions in 50-bp sliding windows was

calculated throughout the entire assembly. Regions with high-sequence variation were defined as 50-bp windows with >12.73% (4x average absolute deviations) variant positions (Figure 5.3). ANNOVAR (261) was used to assess possible frameshifts caused by variant base calls in the low-confidence regions.

125

Results and Discussions

This wWb draft genome consists of 1,060,850 bp with an average G + C content of

34.3% in 100 contigs (maximum length = 59 950 bp; average length = 10 609 bp; average sequencing depth = 35x, major mode sequencing depth = 20x). Using a 582,455-

bp core genome alignment from the wWb genome and 13 other Wolbachia genomes, with

members from 5 of the Wolbachia supergroups (Table 5.1), we created a maximum-

likelihood phylogenetic tree (Figure 5.4A) that places wWb within Wolbachia

supergroup D subset, with wLs and wBm, while being most closely related to wBm.

Figure 5.2: Validation of four median absolute deviations (MADs) from the major mode as a LGT cutoff. The LGT cutoff for sequencing depth was defined as four median absolute deviations (MADs) from the mode. Each histogram depicts the sequencing depth on the x-axis and the number of genomic positions (x100) at each sequencing depth is shown on the y-axis. The three dashed lines indicate the major mode (red), the sequencing depth cutoff used in the study (blue), and the cutoff of four median absolute deviations (MADs) from the mode (orange).In (a) the sequencing depth cutoffs differ by <1.2x (threshold is 0.7% greater in study), while in (b) the sequencing depth cutoffs differ by <500x (threshold is 0.8% less in study). Additionally, a core genome alignment between only wWb and wBm reveals a sequence

identity of 96.9%, differing by 31 907 SNPs in the 1,046,453 bp genome shared between

them, which comprises 98.2% and 96.9% of their total genome lengths, respectively.

Synteny between the 100 contigs of wWb and the wBm genome was assessed using 126

NUCmer v3.06 and visualized with MUMmerplot v3.5 (Figure 5.4B) (262). The synteny plot shows that the wWb contigs are largely syntenic to the wBm genome. However, given that Wolbachia endosymbionts have one circular chromosome and the assembly has 100 gaps, there is the potential for synteny changes in these gaps. Furthermore, synteny changes are more likely to occur between similar sequences in a genome, such as duplicated genes, which can result in gaps in the assembly. Therefore, synteny analysis in any draft genome has limitations.

Figure 5.3: Identification of high sequence variation regions. For each position in the wWb genome, the percentage of variant positions in the surrounding 50 bp was determined and plotted as a histogram. The x-axis of the histogram indicates, for each base pair, the percentage of variant positions in its surrounding 50 bp. The y-axis for the histogram denotes the number of genomic positions (x1000) at each variant position percentage. Regions with high sequence variation were defined as positions with a percentage of variant positions >4x average absolute deviations from the major mode (red) (12.73%).Using this cutoff, 3,144 genomic positions across 28 contigs were identified. The wWb genome contains 961 ORFs, one copy of each of the 5S rRNA, 16S rRNA and

23S rRNA as well as one copy of each of the 34 tRNA genes. Given that GLIMMER is known to inflate the number of small ORFs in a genome, we removed all ORFs <60 aa and all ORFs coding for hypothetical proteins <100 aa with no ortholog in wBm (263).

127

Figure 5.4: wWb phylogeny and synteny. A RAxML maximum-likelihood phylogenetic tree of 14 Wolbachia genomes was constructed based on a 582 455-bp core genome alignment using 1000 bootstraps. The five Wolbachia supergroups present in the core genome alignment are denoted by the circles (red, supergroup A; blue, supergroup B; green, supergroup C; orange, supergroup D; and violet, supergroup F). The wWb genome clusters with the genomes of other strains of supergroup D, wBm and wLs, while being most closely related to wBm. (B) Synteny between wWb and wBm was compared using NUCMER. Red lines with a slope of 1 are indicative of conserved regions between the two strains, while blue lines with a slope of –1 are indicative of inverted conserved regions. The black dotted horizontal lines represent the boundaries of each of the 100 contigs of wWb. The contigs of wWb cover the entirety of the wBm genome apart from the 100 small breaks between the wWb contigs. While only four small inversions were identified, it is important to consider that more such inversions may occur in the physical gaps between the 100 contigs.

Using Sybil (257) to visualize and interrogate orthologous proteins between wWb and

128

Figure 5.5: Circos plot of NUCmer linkages between W. bancrofti and wWb, wWb sequencing depth, and wWb SNPs and indels. The innermost ring illustrates the concatenated wWb contigs delineated by tick marks (orange) alongside the concatenated W. bancrofti contigs (black). The W. bancrofti contigs are scaled to 1/1000 the size of wWb contigs and are not delineated by tick marks for visualization purposes, given that there are 5105 W. bancrofti contigs. The orange links between the wWb and W. bancrofti genomes are indicative of genomic positions present in both the nematode and Wolbachia assemblies as determined using MUMmer. The second track, counting outward from the center, contains an inward-facing histogram that indicates the percentage of variant positions in 100 bp bins (blue). Areas with histogram bars that reach the light blue background are indicative of windows with a percentage of variant positions >4 average absolute deviations from the major mode (>12.73%). The third track, flanked by the two histograms, indicates low-confidence regions in the wWb genome, with black indicating regions that fulfill any of our low-confidence criteria and orange indicating normal regions. The fourth track, and outermost track, shows an outward-facing, log2-transformed sequencing depth histogram in 100 bp bins. All positions with <20x sequencing depth are depicted in white, while positions with ≥20x sequencing depth are depicted in orange. All histogram bins that have >43.72x sequencing depth (4x median absolute deviations from the major mode sequencing depth) are indicated by the light-orange background. wBm, the two genomes were found to share 779 orthologous clusters, with wWb having

129

Figure 5.6: Low confidence positions in the wWb genome identified using sequencing depth, sequence variation, and shared genomic regions between wWb and W. bancrofti. Low confidence regions in the wWb genome were assessed using three independent criteria: (1) high sequencing depth defined as >4x median absolute deviations from the major mode of the sequencing depth (43.72x), (2) high sequence variation defined as positions with >4x average absolute deviations from the major mode surrounding variant positions (12.73%), and (3) shared genomic regions between wWb and W. bancrofti assessed using NUCmer. In total, 12,119 positions were supported by two or more criteria with 738 positions being supported by all three criteria.

101 unclustered genes and wBm having 23 unclustered genes. While 20 of the 23 unclustered genes in wBm were identified as hypothetical proteins, the other 3 genes were found to code for a RadC-like DNA repair protein, an ankyrin repeat-containing protein and elongation factor Tu. All three of these latter genes are duplicated in wBm.

Since the wWb genome is incomplete and the library insert size is less than the length of these genes, the assembly is likely to have collapsed in these regions with identical genes being assembled together in one contig instead of separately. Therefore, these genes should not be considered unique to wBm, thus highlighting one of the many nuances of orthologous gene predictions in draft genomes. In the wWb genome, we identified 10

130 unique ORFs that coded for proteins ≥200 aa, including a bacterial type II and III secretion system protein, 3-dehydroquinate synthase and a pyridoxamine 5΄-phosphate synthase. However, differences in the annotation methods for the wBm and wWb genomes could negatively impact the calculation of orthologs between the two organisms. Additionally, the wWb genome has numerous pseudogenes that will need to be assessed in future research; these could be of interest or could be an artifact in the assembly from inclusion of reads from Wolbachia-Wuchereria lateral gene transfers

(LGTs), a Wolbachia sequencing dilemma (260).

Due to the widespread occurrence of Wolbachia-nematode LGT events and the possibility of collapsed repeats in the assembly, we sought to identify lower confidence regions in the wWb genome, where the sequence supports some sequence variation based on three criteria: sequencing depth, sequence variation and the presence of the sequence in the W. bancrofti assembly indicative of a putative LGT. Regions with abnormally high-sequencing depth were defined as ≥50 bp stretches with a sequencing depth of ≥4 median absolute deviation from the major mode of the sequencing depth (43.72x), while regions with high-sequence variation were defined as 50-bp windows with ≥12.73% (4x average absolute deviations from the major mode) variant positions. A total of 75,702 and 3,144 positions were identified using these criteria respectively, and an additional

26,832 positions were identified as shared between the wWb and W. bancrofti assemblies. Integrating all three criteria, a total of 92,821 low-confidence genome sequence positions (8.75% of the wWb genome) spanning 69 contigs were identified, with 12,119 positions being supported by two criteria and 738 positions being supported by all three (Figure 5.5, 5.6). Such regions could indicate (i) Wolbachia-Wuchereria 131

LGT, (ii) collapsed repeats, (iii) population-level variation in the endosymbiont since

multiple nematodes were sequenced or (iv) some combination thereof.

To determine whether or not alternative base calls in low-confidence regions could have possibly altered the consensus base call in the wWb assembly, we sought to identified

1974 variant positions with ≥20× sequencing depth and <90% of reads supporting the consensus base call. Of these positions, alternative base calls with >5% read support were identified and analyzed with ANNOVAR (261) to determine whether these alternative base calls resulted in the possibility of a frameshifted gene call. Using this method, alternative base calls can be differentiated from sequencing errors since this requires at least 1 read to support the alternative base call. Within these 1974 positions, 2234 variant calls were identified with 1891 being SNPs (993 transitions and 898 transversions) and

343 being indels/substitutions. A total of 1335 variants were found in genic regions, with

182 of the variants having the potential to generate a frameshift within gene calls (stop gains, stop losses, frameshift insertions and frameshift deletions) across 67 genes. We also identified 3449 variant positions located outside of the low-confidence regions.

Despite our ability to identify these variants, we have no means of determining the source of the sequence variation.

Summary

The sequencing and characterization of the wWb genome adds more insight on the evolutionary relationships between the different Wolbachia supergroups, specifically supergroup D. The addition of another supergroup D Wolbachia genome should aid in future studies delineating core Wolbachia supergroup D genome characteristics.

132

However, we continue to demonstrate that the presence of LGTs in the nematode genome has the potential to confound the accurate sequencing of Wolbachia endosymbiont genomes.

Funding

This work was funded by the National Institutes of Health (AI103263) to PAZ and the

National Institute of Allergy and Infectious Diseases (U19AI110820) to JCDH. STS received support through a T32 Postdoctoral Fellowship in Geographic Medicine and

Infectious Disease (AI007024).

Supplementary Data

Supplementary data are available at FEMSPD online at https://academic.oup.com/femspd/article- lookup/doi/10.1093/femspd/ftx115#supplementary-data.

Authors’ Contribution

STS, DS and PAZ conceived the study. STS sequenced and assembled the genome and edited the manuscript. MC annotated the genome, conducted analyses and generated figures. MC and JCDH drafted the manuscript. All authors read and approved the final manuscript.

Conflict of Interest. None declared.

133

Chapter 6: Using core genome alignments to assign bacterial species 1

Abstract

With the exponential increase in the number of bacterial taxa with genome sequence data, a new standardized method to assign species designations is needed that is consistent with classically obtained taxonomic analyses. This is particularly acute for unculturable, obligate intracellular bacteria with which classically defined methods, like DNA-DNA hybridization, cannot be used, such as those in the Rickettsiales. In this study, we generated nucleotide-based core genome alignments for a wide range of genera with classically defined species, as well as those within the Rickettsiales. We created a workflow that uses the length, sequence identity, and phylogenetic relationships inferred from core genome alignments to assign genus and species designations that recapitulate classically obtained results. Using this method, most classically defined bacterial genera have a core genome alignment that is ≥10% of the average input genome length. Both

Anaplasma and Neorickettsia fail to meet this criterion, indicating that the taxonomy of these genera should be reexamined. Consistently, genomes from organisms with the same species epithet have ≥96.8% identity of their core genome alignments. Additionally, these core genome alignments can be used to generate phylogenomic trees to identify monophyletic clades that define species and neighbor-network trees to assess recombination across different taxa. By these criteria, Wolbachia organisms are delineated into species different from the currently used supergroup designations, while

Rickettsia organisms are delineated into 9 distinct species, compared to the current 27

1 Chung M, Munro JB, Tettelin H, Dunning Hotopp JC. Using core genome alignments to assign bacterial species. mSystems. 2018;3(6):e00236-18. 134

species. By using core genome alignments to assign taxonomic designations, we aim to

provide a high-resolution, robust method to guide bacterial nomenclature that is aligned

with classically obtained results.

Importance

With the increasing availability of genome sequences, we sought to develop and apply a

robust, portable, and high-resolution method for the assignment of genera and species

designations that can recapitulate classically defined taxonomic designations. Using

cutoffs derived from the lengths and sequence identities of core genome alignments along

with phylogenetic analyses, we sought to evaluate or reevaluate genus- and species-level

designations for diverse taxa, with an emphasis on the order Rickettsiales, where species

designations have been applied inconsistently. Our results indicate that the Rickettsia

genus has an overabundance of species designations, that the current Anaplasma and

Neorickettsia genus designations are both too broad and need to be divided, and that there

are clear demarcations of Wolbachia species that do not align precisely with the existing supergroup designations.

Keywords

Anaplasma, Rickettsia, Rickettsiales, Wolbachia, bacterial taxonomy, core genome alignment, genomics, species concept

Introduction

While acknowledging the disdain that some scientists have for taxonomy, Stephen Jay

Gould frequently highlighted in his writings how classifications arising from a good taxonomy both reflect and direct our thinking, stating, “the way we order reflects the way 135

we think. Historical changes in classification are the fossilized indicators of conceptual

revolutions” (264). Historically, bacterial species delimitation relied on phenotypic,

morphologic, and chemotaxonomic characterizations (265-267). The 1960s saw the

introduction of molecular techniques in bacterial species delimitation through the use of

GC content (268), DNA-DNA hybridization (DDH) (269), and 16S rRNA sequencing

(270, 271). Currently, databases like SILVA (272) and Greengenes (273) use 16S rRNA sequencing to identify bacteria. However, 16S rRNA sequencing often fails to separate closely related taxa, and its utility for species-level identification is questionable (273-

275). Multi-locus sequence analysis (MLSA) has also been used to delineate species

(276), as has the phylogenetic analysis of both rRNA and protein-coding genes (266, 277,

278). Nongenomic mass spectrometry-based approaches, in which expressed proteins and peptides are characterized, provide complementary data to phenotypic and genomic species delimitations (279, 280) and are used in clinical microbiology laboratories.

However, DDH remains the “gold standard” of defining bacterial species (281, 282), despite the intensive labor involved and its inability to be applied to nonculturable organisms. A new genome-based bacterial species definition is attractive, given the increasing availability of bacterial genomes, rapid sequencing improvements with decreasing sequencing costs, and data standards and databases that enable data sharing.

Average nucleotide identity (ANI) and digital DDH (dDDH) were developed as genomic- era tools that allow for bacterial species classification with a high correlation to results obtained using wet lab DDH, while bypassing the associated difficulties (283-285). For

ANI calculations, the genome of the query organism is split into 1-kbp fragments, which are then searched against the whole genome of a reference organism. The average 136 sequence identity of all matches having >60% overall sequence identity over >70% of their length is defined as the ANI between the two organisms (281). dDDH uses the sequence similarity of conserved regions between two genomes of interest (241) to calculate genome-to-genome distances. These distances are converted to a dDDH value, which is intended to be analogous to DDH values obtained using traditional laboratory methods (241). There are three formulas for calculating dDDH values between two genomes using either (i) the length of all matches divided by the total genome length, (ii) the sum of all identities found in matches divided by the overall match length, and (iii) the sum of all identities found in matches divided by the total genome length, with the second formula being recommended for assigning species designations for draft genomes

(286, 287). However, neither ANI nor dDDH reports the total length of fragments that match the reference genome, and problems arise when only a small number of fragments are unknowingly used. Additionally, neither method generates an alignment that can be used for complementary phylogenetic analyses, instead relying solely on strict cutoffs to define species, potentially leading to the formation of paraphyletic taxa.

Some recent phylogenomic analyses have shifted toward using the core proteome, a concatenated alignment constructed using the amino acid sequences of genes shared between the organisms of interest (288). However, differences in annotation that affect gene calls can add an unnecessary variable when deriving evolutionary relationships.

Instead, we propose that such analyses should use nucleotide core genome alignments to infer phylogenetic relationships with a well-defined nucleotide identity threshold to identify where trees should be pruned to define species. Such an analysis would be enabled by genome aligners like Mauve (289) and Mugsy (246), which identify regions 137

of nucleotide identity that are shared between genomes, which are referred to as locally

collinear blocks (LCBs). For any given subset of genomes, a core genome alignment can

be generated by concatenating these LCBs and retaining only positions present in all

genomes. Using the length and sequence identity of the core genome alignment, paired

with a phylogeny generated from this alignment, genus- and species-level taxonomic

assignments can be developed. A core genome alignment-based method provides advantages to its protein-based counterpart in that it is of a higher resolution and independent of annotation, while also being transparent with respect to the data used in the calculations and very amenable to data sharing and deposition in data repositories.

The Rickettsiales are an order within the composed of obligate, intracellular bacteria where classic DNA-DNA hybridization is not possible and bacterial taxonomy is uneven, with each of the genera having its own criteria for assigning genus and species designations. Within the Rickettsiales, there are three major families, the

Anaplasmataceae, Midichloriaceae, and Rickettsiaceae, with an abundance of genomic data being available for genera within the Anaplasmataceae and Rickettsiaceae. However, as noted above, species definitions in these families are inconsistent, and organisms in the

Wolbachia genus lack community-supported species designations. While the Wolbachia community has vigorously discussed nomenclature at each of its biannual meetings over the past 20 years, thus far, the community has supported only supergroup designations.

Proposed ANI- and dDDH-informed Wolbachia species definitions and nomenclature

(68) lack support from the Wolbachia community for a number of reasons (see, e.g., reference (290)).

138

The last reorganization of the Rickettsiaceae taxonomy occurred in 2001 (291), a time when there were <300 sequenced bacterial genomes (292), a limited number of Rickettsia genomes (293, 294), and no Anaplasmataceae genomes (69, 295-297). As of 2014, there were >14,000 bacterial genomes sequenced, and with this increase in available genomic information, more-informed decisions can be made with regard to taxonomic classification (292). By using core genome alignments, we can condense whole genomes into positions shared between the input genomes and use sequence identity to infer phylogenomic relationships. We can identify sequence identity thresholds for these alignments that are consistent with classically defined bacterial species and can be overlaid on phylogenetic trees generated from the same alignment. This combined approach, which we present in this study, enables a phylogenetically informed whole- genome approach to bacterial taxonomy. We apply this approach to organisms in the

Rickettsiales, including Rickettsia, Orientia, Ehrlichia, Neoehrlichia, Anaplasma,

Neorickettsia, and Wolbachia.

Results

Advantages of nucleotide alignments over protein alignments for bacterial species

analyses

While core protein alignments are increasingly used for phylogenetics (298, 299), a core

nucleotide alignment has more phylogenetically informative positions in the absence of

substitution saturation (300), yielding a greater potential for phylogenetic signal.

Nucleotide-based analyses outperform amino acid-based analyses at all time scales in

terms of resolution, branch support, and congruence with independent evidence (301-

303). Nucleotide and protein core genome alignments were constructed for 10 complete 139

Table 6.1: Ten complete Wolbachia genomes used for PhyloPhlAn and nucleotide and protein core genome analyses GenBank/RefSeq Genome Accession or BioProject ID Wolbachia endosymbiont of Drosophila melanogaster (wMel) AE017196.1 Wolbachia endosymbiont of Drosophila simulans (wRi) CP001391.1 Wolbachia endosymbiont of Drosophila simulans (wHa) CP003884.1 Wolbachia endosymbiont of Drosophila simulans (wAu) LK055284.1 Wolbachia endosymbiont of Drosophila simulans (wNo) CP003883.1 Wolbachia endosymbiont of Onchocerca volvulus (wOv) HG810405.1 Wolbachia endosymbiont of Onchocerca ochengi (wOo) HE660029.1 Wolbachia endosymbiont of Brugia malayi (wBm) AE017321.1 Wolbachia endosymbiont of Folsomia candida (wFol) CP015510.1 Wolbachia endosymbiont of Cimex lectularius (wCle) AP013028.1

Wolbachia genomes (Table 6.1), and the resulting phylogenies were compared.There are

a number of whole-genome aligners that can be used to generate nucleotide core genome

alignments (for an overview, see reference (304)). We generated core genome alignments

for the 10 Wolbachia genomes using Mauve and Mugsy, currently two of the most

commonly used whole-genome aligners (246, 289). Despite the different algorithms used

by each of the two programs, the lengths of the generated core genome alignments are

similar, with the Mauve and Mugsy core genome alignments being 682,949 and

579,495 bp in length, respectively. However, because Mugsy has been found to be better

at handling larger genome data sets (246), Mugsy was used for all subsequent core genome alignment analyses discussed. Despite the presence of large-scale genome rearrangements present between the 10 Wolbachia genomes (Figure 6.1), the resulting

alignment is 45.9% of the average input genome length, indicating that collinear genomes

are not necessarily required to construct core genome alignments. The lack of synteny

140

results merely in more and smaller local colinear blocks. Using the same 10 Wolbachia

genomes and the available annotation at NCBI RefSeq, a core protein alignment was

generated using FastOrtho, a reimplementation of the tool OrthoMCL (305). Despite the

presence of different annotation methods used for gene calls, we were able to generate a

protein alignment of 77,868 positions from the concatenated alignments of 152 shared

genes between the 10 genomes using the available protein amino acid sequences. In total,

the core protein alignment contained 16,241 parsimony-informative positions compared

to the 124,074 such positions observed with the core nucleotide alignment, indicating a

10-fold increase in potentially informative positions (Figure 6.2A). When maximum-

likelihood (ML) trees from the core protein alignment and the core nucleotide alignment

are compared, the two trees are quite similar in topology and branch length. However, the

nodes on the ML tree generated using the core nucleotide alignment consistently have

higher bootstrap support values.

Similarly, we compared the Wolbachia core nucleotide alignment to a protein-based

alignment generated using PhyloPhlAn (298). PhyloPhlAn uses a set of 400 of the most

conserved proteins across diverse bacterial taxa to infer phylogenetic relationships.

Across the 10 complete Wolbachia genomes, 176 of these 400 genes were identified to be

present in each of the genomes. A concatenated amino acid alignment of these 176 genes

yields an alignment with 2,877 positions, of which 2,447 are parsimony informative,

indicating a 50-fold decrease in the number of parsimony-informative positions from that of the core nucleotide alignment and an 8-fold decrease relative to that of the core protein alignment. As with the core protein alignment, the ML trees constructed using the

Wolbachia core nucleotide alignment and PhyloPhlAn have similar topologies and

141

Figure 6.1: Synteny between the 10 complete Wolbachia genomes. Syntenies between the 10 complete Wolbachia genomes were compared and visualized using the Artemis Comparative Tool. The red ribbons indicate conserved regions of >3 kbp between two genomes, while the blue ribbons indicate >3-kbp inverted conserved regions. Wolbachia supergroups A (●), B (▲), C (■), D (⬟), E (⬢), and F (⯃) are represented in the genome subset. Shapes of the same color indicate that the multiple genomes are of the same species as determined using our relative branch lengths, with the core nucleotide alignment having higher support values on its nodes (Figure 6.2B). The one difference in clustering occurs within the supergroup 142

A Wolbachia, in which wRi clusters with wHa in the core nucleotide alignment, while

wRi clusters with wMel and wAu in the PhyloPhlAn ML tree. The wRi and wHa branch

is supported by a bootstrap value of 100 in the core nucleotide alignment ML tree, while

the wRi, wMel, and wAu branch is supported by a bootstrap value of 98.9 in the

PhyloPhlAn ML tree.

Collectively, these results demonstrate that while nucleotide and protein alignments give

roughly the same result, there are key differences. Core nucleotide alignments have

almost an order of magnitude more parsimony-informative positions, and this results in

stronger branch support values in the corresponding phylogenies.

Can core genome alignments be constructed from bacteria within a genus?

To compare differences between existing classically defined species designations by this

method of using sequence identity thresholds with whole-genome phylogenies, core

genome alignments were constructed for seven bacterial genera representative of six

different bacterial taxonomic classes (Table 6.2). For these seven genera, which include

Arcobacter, Caulobacter, Erwinia, Neisseria, Polaribacter, Ralstonia, and Thermus, each

of the generated core genome alignments were found to be ≥0.33 Mbp in size,

representative of ≥13.6% of the average input genome size (Table 6.2). For this diverse

group of genera, the core genome alignments comprise a large portion of their input

genomes, indicating that this technique is applicable to a wide range of classically defined bacterial taxa. Additionally, core genome alignments were successfully constructed from the genomes of four of the six Rickettsiales genera, including 69

143

Figure 6.2: Comparison of phylogenomic trees generated using the core nucleotide alignments versus protein-based alignments for 10 complete Wolbachia genomes. A maximum-likelihood phylogenomic tree generated from the core genome alignment (CGA) of 10 complete Wolbachia genomes was compared to a core protein alignment (CPA) containing 152 genes present in only one copy (A) and an alignment generated using PhyloPhlAn, containing 176 conserved proteins (B). Wolbachia supergroups A (●), B (▲), C (■), D (⬟), E (⬢), and F (⯃) are represented in the genome subset. Shapes of the same color indicate that the multiple genomes are of the same species as defined using our determined CGASI cutoff of ≥96.8%. When comparing the ML trees generated using the core nucleotide and core protein alignments, the trees are largely similar in both topology and branch length. In the comparison between the ML trees generated using the core nucleotide alignment and PhyloPhlAn, the trees are similar except for the relationship of wRi, which is sister to wHa in the core nucleotide alignment tree and sister to wAu + wMel in the PhyloPhlAn tree. Despite differences in clustering, the core nucleotide alignment ML tree consistently has higher bootstrap values than its core protein alignment or PhyloPhlAn counterpart.

Rickettsia genomes, 3 Orientia genomes, 16 Ehrlichia genomes, and 23 Wolbachia genomes (Table 6.2). Each of these four core genome alignments are >0.18 Mbp in length and contain >10% of the average size of the input genomes.

144

- 85.7% 85.7% 81.6% 80.1% 98.9% 90.6% 81.7% 96.3% 84.0% 77.2% 80.1% 78.9% 82.5% 78.8% 80.5% 77.9% 82.2% 80.9% CGASI Minimum

- 55 31 1,423 1,661 2,339 LCBs 36,667 12,101 33,010 14,193 14,146 34,089 80,702 17,290 64,633 20,905 27,112 13,251 31,062 Identified

- 2.3% 8.9% 1.4% 42.4% 14.8% 43.9% 18.9% 13.6% 14.7% 22.5% 48.0% 47.5% 87.4% 39.8% 83.3% 65.3% 32.4% 22.8% Percentage Percentage Composition Core Genome Genome Core

- 0.56 0.18 0.54 0.43 0.60 0.33 0.80 1.10 0.97 0.02 0.76 0.49 0.11 0.02 1.25 0.77 1.61 1.24 Alignment Size (Mbp) Core Genome Genome Core

1.5 1.32 2.27 4.42 2.24 3.55 2.29 1.22 1.23 2.04 1.27 0.87 0.87 1.23 1.24 1.39 1.18 4.97 5.44 Average Average Genome Size (Mbp)

- - 1 1 3 2 4 5 3 1 2 4 3 27 12 10 10 11 11 Species Species Established Represented

3 1 4 3 69 16 17 30 20 10 23 22 44 26 22 66 24 21 19 Genomes Genomes Analyzed

Class Deinococci Flavobacteria Betaproteobacteria Alphaproteobacteria Alphaproteobacteria Alphaproteobacteria Alphaproteobacteria Alphaproteobacteria Alphaproteobacteria Alphaproteobacteria Alphaproteobacteria

Family Thermaceae Rickettsiaceae Rickettsiaceae Anaplasmataceae Anaplasmataceae Anaplasmataceae Anaplasmataceae Anaplasmataceae Caulobacteraceae Burkholderiaceae Flavobacteriaceae Campylobacteraceae

and Core genome alignmentCore statistics

N. w Ppe :

.2

Ehrlichia

Genus With A. phagocytophilum marginale A. centraleA. Excluding helminthoeca Oregon Excluding

Table 6 Rickettsia Orientia Ehrlichia Neoehrlichia Anaplasma Neorickettsia Wolbachia Arcobacter Caulobacter Erwinia Neisseria Polaribacter Ralstonia Thermus

Substitution saturation 145

A common caveat in nucleotide alignments is the presence of substitution saturation, an

issue in which the phylogenetic distances reported by nucleotide alignments are

underestimated due to an overabundance of reverse mutations within the organism subset

(300). Compared to protein-based methods, substitution saturation more heavily impacts

nucleotide-based phylogenetic distance measurements due to the higher rate of

substitutions per position. However, for each core genome alignment, when the

uncorrected pairwise genetic distances are plotted against the model-corrected distances, linear relationships are observed for all alignments (r2 > 0.995), indicating that

substitution saturation does not hinder the ability of the core genome alignments to

represent evolutionary relationships (306) (Figure 6.3). Given the fact that the analyzed genera span bacterial taxonomic classes, including Alphaproteobacteria,

Betaproteobacteria, Gammaproteobacteria, Epsilonproteobacteria, Flavobacteriia, and

Deinococci (Table 6.2), we expect this result to be broadly applicable to bacterial core genome alignments.

Assessment of genus designations

We found our initial core genome alignments for Anaplasma and Neorickettsia to be considerably shorter than other core genome alignments, both being 20 kbp and accounting for ≤2.3% of the average input genome sizes. The small size∼ of the core genome alignment indicates that the Anaplasma and Neorickettsia genome subsets each

contain a diverse set of genomes that are likely not of the same genus. Using input

genomes from different genera to construct a core genome alignment yields an alignment

146

of an insufficient size to accurately represent the evolutionary distances between the input

genomes. For example, when the genome of Neoehrlichia lotoris is supplemented with

the genomes used to create the 0.49-Mbp Ehrlichia core genome alignment, the resultant

core genome alignment is 0.11 Mbp, representing only 8.9% of the average input genome

size, compared to the prior 39.8%. Therefore, we used subsets of species to test whether

the Anaplasma and Neorickettsia genera are too broadly defined. A core genome

alignment generated using only the 20 A. phagocytophilum genomes in the 30 Anaplasma

genome set is 1.25 Mbp and represents 83.3% of the average input genome size, while a

core genome alignment of the remaining 10 Anaplasma genomes is 0.77 Mbp, 65.3% of

the average genome input size. This result suggests that the Anaplasma genus should be

split into two separate genera. Similarly, when the genome of Neorickettsia helminthoeca

Oregon is excluded from the Neorickettsia core genome alignment, a 0.77-Mbp

Neorickettsia core genome alignment is generated, 87.4% of the average input genome size, suggesting that N. helminthoeca Oregon is not of the same genus as the other three

Neorickettsia genomes. For the remainder of this paper, these genus reclassifications are used. Given these collective results, we recommend that the genus classification level can be defined as a group of genomes that together will yield a core genome alignment that represents ≥10% of the average input genome sizes.

Identifying a CGASI for species delineation

Within the Rickettsia, Orientia, Ehrlichia, Anaplasma, Neorickettsia, Wolbachia,

Arcobacter, Caulobacter, Erwinia, Neisseria, Polaribacter, Ralstonia, and Thermus

genome subsets, ANI, dDDH, and core genome alignment sequence identity (CGASI)

147

Figure 6.3: Assessing substitution saturation for core genome alignments. For each of the 14 core genome alignments that comprise ≥10% of the average input genome size, the uncorrected genetic distance between each of the members was plotted against the Tn69-model corrected genetic distance. The red line represents the best-fit line for each data set, while the black dotted line represents the identity line (y = x). In all cases, the relationship between the two distances are linear (r2 > 0.995), indicating little substitution saturation in the core genome alignments. values were calculated for 7,264 intragenus pairwise genome comparisons, of which 601

148

Figure 6.4: ANI, dDDH, and CGASI correlation analysis. CGASI, ANI, and dDDH values were calculated for 7,264 intragenus pairwise comparisons of genomes for Rickettsia, Orientia, Ehrlichia, Anaplasma, Neorickettsia, Wolbachia, Caulobacter, Erwinia, Neisseria, Polaribacter, Ralstonia, and Thermus. (A) The CGASI and ANI values for the intragenus comparisons follow a second-degree polynomial model (r2 = 0.977), with the ANI species cutoff of ≥95% being equivalent to a CGASI of 96.8%, indicated by the blue dashed box. (B) The CGASI and dDDH values for all pairwise comparisons follow a third-degree polynomial model (r2 = 0.978), with the dDDH species cutoff of ≥70% being equivalent to a CGASI of 97.6%, indicated by the red dashed box. (C) To identify the optimal CGASI cutoff to use when classifying species, for each increment of the CGASI species cutoff plotted on the x axis, the percentage of intraspecies and interspecies comparisons correctly assigned was determined based on classically defined species designations. The ideal cutoff should maximize the prediction of classically defined species for both interspecies and intraspecies comparisons. The ANI-equivalent CGASI species cutoff is represented by the blue dashed line, while the dDDH-equivalent CGASI species cutoff is represented by the red dashed line. are between members of the same species. The ANI and CGASI follow a second-degree polynomial relationship (r2 = 0.977) (Figure 6.4A). Using this model, the ANI species 149

cutoff of ≥95% is analogous to a CGASI cutoff of ≥96.8%. dDDH and CGASI follow a

third-degree polynomial model (r2 = 0.978), with a dDDH of 70% being equivalent to a

CGASI of 97.6%, indicating that the dDDH species cutoff is generally more stringent

than the ANI species cutoff (Figure 6.4B).

The ideal CGASI threshold for species delineation would maximize the prediction of

classically defined species, neither creating nor destroying the majority of the classically

defined species. To examine this, all 7,264 intragenus pairwise comparisons were

classified as either intraspecies or interspecies. Intraspecies comparisons are comparisons

between genomes with the same classically defined species designation, while

interspecies comparisons are between genomes within the same genus but with different

classically defined species designations. Every possible CGASI threshold value was then

tested for the ability to recapitulate these classically defined taxonomic classifications

(Figure 6.4C). In all cases, an abnormally high number of interspecies Rickettsia

comparisons were found above both the established ANI and the dDDH species

thresholds, consistent with previous observations that guidelines for establishing novel

Rickettsia species are too lax (307), and as such they were excluded from this specific analysis. Below a CGASI of 97%, classically defined species begin to be separated, while organisms classically defined as different species begin to be collapsed. This coincides with the above-calculated ANI-equivalent threshold but differs from the above-calculated dDDH equivalent (Figure 6.4C). The dDDH-equivalent CGASI threshold of 97.6% failed to predict the classically defined taxa from 100 intraspecies comparisons while the

ANI-equivalent CGASI threshold of 96.8% failed to predict the classically defined taxa from 41 intraspecies comparisons. 150

Given these results, we selected the ANI-equivalent CGASI value of ≥96.8% to further analyze these taxa. Overall, there were 41 pairwise comparisons of organisms across diverse taxa, including Arcobacter, Caulobacter, Neisseria, Orientia, Ralstonia, and

Thermus, that are classically defined as the same species but would be classified as distinct species when assessed using our suggested CGASI threshold. Additionally, there were 782 pairwise comparisons of organisms classically defined as different species that this threshold suggests should be the same species, of which only 10 were not in the genus Rickettsia, instead belonging to the Arcobacter and Caulobacter genera.

Rickettsiaceae phylogenomic analyses

Rickettsia

The Rickettsiaceae family includes two genera, the Rickettsia and the Orientia, and while both genera are obligate intracellular bacteria, Rickettsia genomes have undergone more reductive evolution, having a genome size ranging from 1.1 to 1.5 Mbp (308) compared to the 2.0- to 2.2-Mbp size of the Orientia genomes (309). Of the Rickettsiales, the

Rickettsia genus contains the greatest number of sequenced genomes and named species, containing 69 genome assemblies in ≤100 contigs representing 27 unique species.

Rickettsia genomes are currently classified based on the Fournier criteria, an MLST approach established in 2003 based on the sequence similarity of five conserved genes: the 16S rRNA gene, citrate synthase gene (gltA), and three surface-exposed protein antigen genes (ompA, ompB, and gene D) (310). To be considered a Rickettsia species, an isolate must have a sequence identity of ≥98.1% to the 16S rRNA and ≥86.5% to gltA in at least one preexisting Rickettsia species. Within the Rickettsia, using ompA, ompB, and gene D sequence identities, the Fournier criteria also support the further classification of 151

Rickettsia species into three groups: the group, the spotted fever group (SFG), and

the ancestral group (310). However, the Fournier criteria have not yet been amended to

classify the more recently established transitional Rickettsia group (311), indicating a need to update the Rickettsia taxonomic scheme.

A total of 69 Rickettsia genomes representative of 27 different established species were used for ANI, dDDH, and core genome alignment analyses. Regardless of the method used, a major reclassification is justified (Figure 6.5). A core genome alignment constructed using the 69 Rickettsia species genomes yielded a core genome alignment size of 0.56 Mbp, 42.4% of the average lengths of the input Rickettsia genomes (Table

6.2). For∼ 42 of the 44 SFG Rickettsia genomes, the CGASI between any two genomes is

≥98.2%, well within the proposed CGASI species cutoff of ≥96.8% (Figure 6.5), while the CGASI is ≤97.2% in the ancestral and typhus groups. If a CGASI cutoff of 96.8% is used to reclassify the Rickettsia species, all but two of the SFG Rickettsia genomes would be classified as the same species (Figure 6.5), with the two remaining SFG Rickettsia genomes, IrR Munich and Rickettsia sp. strain Humboldt, being designated as the same species. This is consistent with ANI results as well, while dDDH yields conflicting results (Figure 6.5). For the transitional group Rickettsia, and would be collapsed into a single species due to having

CGASI values of 97.2% in a comparison with one another. Similarly, Rickettsia asembonensis, , and Rickettsia hoogstraalii, all classified as transitional group Rickettsia, would be collapsed into another species, all having CGASI values of

97.2% in comparisons with one another. This is consistent with a phylogenomic tree

152

Figure 6.5: Analysis of the ANI, dDDH, and CGASI values of 69 Rickettsia genomes. Across 69 Rickettsia genomes, the ANI and dDDH values (A) and CGASI values (B) were calculated for each pairwise genome comparison. The shape next to each Rickettsia genome represents whether the genome originates from an ancestral (⚫), transitional (⬟), typhus group (▲), or spotted fever group (■) Rickettsia species, while the colors of the shapes on the axes represent species designations as determined by a CGASI cutoff of ≥96.8%. (C) An ML phylogenomic tree with 1,000 bootstraps was generated using the core genome alignment. (D) The relationships in the green box in panel C cannot be adequately visualized at the necessary scale, so they are illustrated separately with a different scale. For both trees, red branches represent branches with <100 bootstrap support. generated from the Rickettsia core genome alignments, where the SFG Rickettsia g

153

Figure 6.6: Analysis of the ANI, dDDH, and CGASI values of three Orientia genomes. For 3 genomes, the ANI and dDDH values (A) and CGASI values (B) were calculated for each genome comparison and color-coded to illustrate the results with respect to cutoffs of an ANI of ≥95% and a dDDH value of ≥70%. Circles of the same colors next to the names of each genome indicate members of the same species as defined by a CGASI of ≥96.8%. genomes have far less sequence divergence than the rest of the Rickettsia genomes

(Figure 6.5).

154

Orientia

The organisms within Orientia have no standardized criteria to define novel species.

There are far fewer high-quality Orientia genomes than Rickettsia genomes, which is due partly to the large number of repeat elements found in Orientia genomes, with the genome of Orientia tsutsugamushi being the most highly repetitive sequenced bacterial genome to date, with 42% of its genome being comprised of short repetitive sequences and transposable elements∼ (312).

The Orientia core genome alignment was constructed using three O. tsutsugamushi genomes and is 0.97 Mbp in size, 47.6% of the average input genome size, with CGASI values ranging from 96.3 to 97.2%∼ (Figure 6.6). Reclassifying the Orientia genomes using a CGASI species cutoff of 96.8% would result in O. tsutsugamushi Gillliam and O. tsutsugamushi Ikeda being classified as species separate from O. tsutsugamushi Boryong

(Figure 6.6). In this case, this reclassification would not be consistent with recommendations from using ANI or dDDH. We suspect that ANI is strongly influenced by the large number of repeats in the genome due to ANI calculations being based off the sequence identity of 1-kbp query genome fragments. In comparison, we do not anticipate that whole-genome alignments would be confounded by the repeats. While the LCBs may be fragmented by the repeats, creating smaller syntenic blocks, the nonphylogenetically informative repeats are eliminated from an LCB-based analysis.

155

Figure 6.7: Analysis of the ANI, dDDH, and CGASI values of 16 Ehrlichia genomes. For 16 Ehrlichia genomes, the ANI and dDDH values (A) and CGASI values (B) were calculated for each pairwise genome comparison. The colors of the shapes next to each Ehrlichia genome represent species designations as determined by a CGASI cutoff of ≥96.8%. (C) An ML phylogenomic tree with 1,000 bootstraps was generated using the core genome, with red branches representing branches with <100 bootstrap support.

Anaplasmataceae phylogenomic analyses

156

Ehrlichia

Within the Anaplasmataceae, species designations are frequently assigned based on

sequence similarity and clustering patterns from phylogenetic analyses generated using

the sequences of genes such as 16S rRNA, groEL, and gltA (313). For example, the

species designations for Ehrlichia khabarensis and Ehrlichia ornithorhynchi are justified

based on having a lower sequence similarity for 16S rRNA, groEL, and gltA (313-315).

A core genome alignment constructed using 16 Ehrlichia genomes, representative of four defined species, yields a 0.49-Mbp alignment, which equates to 39.8% of the average

Ehrlichia genome size (1.25 Mbp) (Table 6.2). Using a CGASI species cutoff of 96.8%, the Ehrlichia chaffeensis and Ehrlichia ruminantium genomes were recovered as monophyletic and well-supported species, which is consistent with ANI and dDDH results (Figure 6.7). The two Ehrlichia muris and Ehrlichia sp. strain Wisconsin_h genomes have CGASI values of >97.8%, indicating that the three genomes represent one species, which is consistent with ANI but not dDDH results (Figure 6.7). The genomes of Ehrlichia sp. strain HF and E. canis Jake do not have CGASI values of ≥96.8% with any other species, confirming their status as individual species, identical to the results found with ANI and dDDH (Figure 6.7).

Anaplasma

A novel Anaplasma species is currently defined based on phylogenetic analyses involving 16S rRNA, gltA, and groEL, with a new species having a lower sequence identity and a divergent phylogenetic position relative to established Anaplasma species

157

Figure 6.8: Analysis of the ANI, dDDH, and CGASI values of 30 Anaplasma genomes. (A) For 30 Anaplasma genomes, the ANI and dDDH values were calculated for each genome comparison and color-coded to illustrate the results with respect to ANI cutoffs of ≥95% and dDDH cutoffs of ≥70%. (B) When we attempted to construct a core genome alignment using all 30 Anaplasma genomes, only a 20-kbp alignment was generated, accounting for <1% of the average Anaplasma genome size. Therefore, CGASI values were calculated after the Anaplasma genomes were split into two subsets containing 20 A. phagocytophilum genomes and the remaining 10 Anaplasma genomes. In all panels, the shape next to each genome denotes genus designations as determined by the size of their core genome alignments, while the color of the shape denotes the species as defined by a CGASI cutoff of ≥96.8%.

(316-319). A core genome alignment constructed using 30 Anaplasma genomes yielded a

20-kbp alignment, 1.4% of the average input genome size. As noted before, such low

158 values are indicative of more than one genus being represented in the taxa that were included in the analysis. Thus, CGASI analyses for Anaplasma species were done on two

Anaplasma genome subsets, one containing the 20 Anaplasma phagocytophilum genomes and the other containing 9 Anaplasma marginale genomes and 1 Anaplasma centrale genome (Table 6.2).

A 1.25-Mbp core genome alignment, consisting of 39.8% of the average input genome size, was constructed using the 20 A. phagocytophilum genomes. All 20 genomes have

CGASI values of ≥96.8%, supporting their designation as members of a single species

(Figure 6.8). This is supported by ANI, but dDDH again yields conflicting results

(Figure 6.8). The core genome alignment generated using the remaining 10 Anaplasma genomes yields a 0.77-Mbp core genome alignment, 65.3% of the average input genome size. While the A. marginale genomes all have CGASI values of ≥96.8% in comparisons with one another, the Anaplasma centrale genome has CGASI values ranging from 90.7 to 91.0% compared to the nine A. marginale genomes, supporting A. centrale as a separate species from A. marginale, consistent with existing taxonomy and with the ANI and dDDH species cutoffs (Figure 6.8).

Neorickettsia

The Neorickettsia genus contains four genome assemblies: Neorickettsia helminthoeca

Oregon, Illinois, Miyayama, and

Neorickettsia sp. strain 179522. The genus was first established in 1954 with the discovery of N. helminthoeca (320). In 2001, N. risticii and N. sennetsu, both initially classified as Ehrlichia strains, were added to the Neorickettsia based on phylogenetic

159

analyses of 16S rRNA and groESL (291). A core genome alignment constructed using all four Neorickettsia genomes yields a 20-kbp alignment, 2.3% of the average input genome size. When excluding the genome of N. helminthoeca Oregon, the three remaining

Neorickettsia genomes form a core genome alignment of 0.76 Mbp in size, 87.4% of the

average input genome size. The sequence identity of the core genome alignment indicates

that N. risticii Illinois, N. sennetsu Miyayama, and Neorickettsia sp. 179522 are three

distinct species within the same genus, while the length of the core genome alignment

indicates that N. helminthoeca Oregon is of a separate genus (Figure 6.9). When

assessing the Neorickettsia species designations using ANI and dDDH cutoffs, the four

Neorickettsia genomes can only be determined to be different species, as the two

techniques are unable to delineate phylogenomic relationships at the genus level.

Wolbachia

The current Wolbachia classification system lacks traditional species designations and

instead groups organisms by supergroup designations using an MLST system consisting

of 450- to 500-bp internal fragments of five genes: gatB, coxA, hcpA, ftsZ, and fbpA (66).

A core genome alignment generated using 23 Wolbachia genomes yields a 0.18-Mbp

alignment, amounting to 14.8% of the average input genome size.

Among filarial Wolbachia supergroups C and D, the CGASI cutoff of 96.8% would split

each of the traditionally recognized supergroups into two groups each, which is also

supported by ANI and dDDH. While wOo and wOv would be the same species, wDi

Pavia should be considered a different species. Similarly, wBm and wWb would be

160

Figure 6.9: Analysis of the ANI, dDDH, and CGASI values of 4 Neorickettsia genomes. (A) For the 4 Neorickettsia genomes, the ANI and dDDH values were calculated for each genome comparison and color-coded to illustrate the results with respect to cutoffs of an ANI of ≥95% and a dDDH value of ≥70%. (B) CGASI values were calculated and are illustrated using a core genome alignment that could only be constructed using 3 of the Neorickettsia genomes, excluding N. helminthoeca Oregon. Circles of the sample colors next to the names of each genome indicate members of the same species as defined by a CGASI of ≥96.8% considered the same species, while wLs should be designated a separate species. The wCle, wFol, and wPpe endosymbionts from supergroups E, F, and L, respectively, would all be considered distinct species using CGASI, ANI, or dDDH.

161

The CGASI values between the supergroup A Wolbachia organisms, apart from wInc

SM, have CGASI values of ≥96.8 (Figure 6.10). The genome of wInc SM has CGASI

values ranging from 94.6% to 95.9% compared to other supergroup A Wolbachia

genomes, indicating that if species assignments were dependent solely on a CGASI

cutoff, then wInc SM would be a different species. Because >10% of the positions in the

wInc SM genome are ambiguous nucleotide positions, this leads to a large penalty in

sequence identity scores, as seen with CGASI, ANI, and DDH (Figure 6.10). However, a

phylogenomic tree generated using the Wolbachia core genome alignment indicates that

wInc SM is nested within the other supergroup A Wolbachia taxa, a clade with 100%

bootstrap support (Figure 6.10). Similarly, the results between CGASI, ANI, and dDDH

in supergroup B are discordant. Five of the six supergroup B Wolbachia genomes, all

except for wTpre, have CGASI values of ≥96.8% compared to one of the other five, indicating the five genomes are likely of the same species. However, despite wTpre being considered a different species if the CGASI cutoff of 96.8% is used, a phylogenomic tree constructed using the core genome alignment shows wTpre to be nested within the supergroup B Wolbachia organisms (Figure 6.10).

In both cases, if wInc SM or wTpre were designated distinct species from the other

supergroup A and B Wolbachia, respectively, paraphyletic clades would be created. This

indicates the importance of using a phylogenomic analysis in addition to a threshold. In

this case, the core genome alignment can easily be used with robust phylogenetic

162

Figure 6.10: Analysis of the ANI, dDDH, and CGASI values of 23 Wolbachia genomes. Analysis of the ANI, dDDH, and CGASI values of 23 Wolbachia genomes. For 23 Wolbachia genomes, the ANI and dDDH values (A) and CGASI values (B) were calculated for each pairwise genome comparison. The shape next to each Wolbachia genome represents supergroup A (⚫), B (▲), C (■), D (⬟), E (⬢), F (⯃), and L (◆) designations, while the color of the shape indicates species as defined by a CGASI cutoff of ≥96.8%. (C) An ML phylogenomic tree was generated using the Wolbachia core genome alignment constructed using 23 Wolbachia genomes, with red branches representing branches with <100 bootstrap support. The species designations in supergroup B show the CGASI-designated species clusters of wPip_Pel, wPip_JHB, wAus, and wStri to be polyphyletic.

163

Figure 6.11: Analysis of the ANI, dDDH, and CGASI values of 66 Neisseria genomes. For the 66 Neisseria genomes, the ANI and dDDH values (A) and CGASI values (B) were calculated for each genome comparison and color-coded to illustrate the results with respect to cutoffs of an ANI of ≥95% and a dDDH value of ≥70%. (C) An ML phylogenomic tree was generated using 1,000 bootstraps, with the 0.33-Mbp Neisseria core genome alignment. Circles of the sample colors next to the names of each genome indicate members of the same species as defined by a CGASI of ≥96.8 algorithms to generate both network trees and ML trees. For both cases, despite wInc SM and wTpre having <96.8% CGASI, the phylogenies reveal that these organisms should be

164

considered the same species as the other supergroup A and B Wolbachia species,

respectively.

Other taxa

The power of an approach that combines phylogenetics with a nucleotide identity

threshold is similarly observed with Neisseria. A 0.33-Mbp core genome alignment was generated from 66 Neisseria genomes, accounting for 14.7% of the average input genome size. The core genome alignment largely supports the classically defined species, including , , Neisseria weaveri, and

Neisseria lactamica, although one N. lactamica isolate, N. lactamica 338.rep1_NLAC,

appears to be inaccurately assigned (Figure 6.11). The CGASI values suggest that each

Neisseria elongata isolate is a distinct species, as are the Neisseria flavescens isolates.

Like what was observed in Wolbachia species, a paraphyletic clade is observed when using a CGASI cutoff of 96.8%, with the genome of Neisseria sp. strain HMSC061B04

being nested within a clade of two Neisseria mucosa genomes, N. lactamica

338.rep1_NLAC, and several unnamed Neisseria taxa while not having a CGASI of

≥96.8% compared to any other Neisseria genome (Figure 6.11).

Discussion

Workflow for the taxonomic assignment of new genomes at the genus and species levels

These results highlight that while sequence identity cutoffs derived using nucleotide- based pairwise comparisons are important when delineating species, they should be coupled with an analysis of phylogenetic trees. Yet, phylogenetic trees alone are

165

inadequate, as they cannot identify the level of branching where a species should be

defined. However, the two results can be quite complementary when applied to core

genome alignments and together make for a robust approach.

Given our results, we propose a workflow using three criteria to guide the taxonomic

assignment of novel genomes at the genus and species levels (Figure 6.12). Starting with a novel, sequenced query genome, the most closely related genus of the query genome should be identified using homology-based search algorithms (e.g., BLASTN searches of

16S rRNA). Once identified, a set of trusted genomes within the genus should be combined with this query genome to generate a core genome alignment using a tool such as Mauve (289) or Mugsy (246). These trusted genomes would be the type genomes, analogous to type strains. Unlike type strains, they are digitally recorded and thus highly unlikely to be lost. However, multiple organisms with different genome sequences could now serve in the capacity of type strains, eliminating the possibility that a single unusual type strain could unduly influence the taxonomy.

From the core genome alignment, the genus assignment would be considered validated if the length of the core genome alignment is ≥10% of the average input genome size.

Subsequently, a pairwise sequence identity matrix is calculated from the core genome alignment, identifying all genomes with ≥96.8% sequence identity. Finally, phylogenomic trees are generated with the core genome alignment and overlaid with the

CGASI results. A query genome is deemed a novel species if it is <96.8% identical to the genomes of the type strains and the clades remain monophyletic. Importantly, whole-

166

Figure 6.12: Workflow for the taxonomic assignment of a novel genome at the genus and species levels. The proposed workflow for assigning genus- and species-level designations using core genome alignments is based on three criteria: (i) the length of the core genome alignment, (ii) the sequence identity of the core genome alignment, and (iii) phylogenomic analyses. Using a query genome and a set of trusted genomes from a single genus, a core genome alignment is generated. The length of the core genome alignment is the first criterion that is used as the genus-level cutoff, with a core genome alignment size of ≥10% of the average input genome size indicating that all genomes within the subset are of the same genus. Provided that all genomes in the subset are of the same genus, the second criterion is the sequence identity of the core genome alignment, with genomes sharing ≥96.8% similarity being designated the same species. The third and final criterion uses a phylogenomic tree generated using the core genome alignment to check for paraphyletic clades. If the query genome has <96.8% core genome alignment sequence identity with any other genome and its designation as a new species does not form a paraphyletic clade, the genome should be considered a novel species. genome sequences are not needed to assign an existing genus and species designation to a new isolate, merely to describe new genera or species. Community-derived species- specific MLST schemes may be well suited for assigning isolates to already-named and - described species.

167

The genomics and data era may necessitate increased flexibility or a modification of

the rules in bacterial nomenclature

The International Code of Nomenclature of Prokaryotes (321) is a frequently updated governance that includes 65 rules that regulate bacterial nomenclature, with the laudable goal of maintaining stability in bacterial names. These rules are designed to assess the correctness of bacterial names, as well as procedures for creating new names, but are agnostic to the methods or criteria used to delineate genera or species (321).

However, modern methods (i.e., genome sequencing, bioinformatics, data sharing, and isolate repositories) have the potential to profoundly and negatively affect bacterial nomenclature in ways where the current code may need to adjust and adapt. Historically, isolates were frequently discovered, described, and named by the same individual or group of individuals. Conflicts arose over time (e.g., two named species turned out to be the same species), and many of the rules focus on resolving those inevitable conflicts.

One principle in the code for resolving conflicts relies on using the first published name.

However, with modern methods, it would be fairly simple to conduct a comprehensive bioinformatics analysis and be the first to publish names of organisms, without any input from the relevant research communities, including individuals who have spent their scientific careers working on unnamed organisms. In addition, with extensive genomic data, there have been increased efforts to harmonize taxonomy that will lead to large changes in bacterial nomenclature. While choosing the first published name may be effective at resolving small conflicts, it may not be ideal for larger taxonomic changes that may be necessary and warranted. In these cases, the code may need to prioritize large collaborative community-supported efforts that include all relevant stakeholders to 168 reform the nomenclature in a meaningful and important way, particularly efforts that reflect the recent changes in the ways that we think about bacteria and are reflective of the modern conceptual revolutions to which Stephen Jay Gould referred.

Furthermore, a complete genome sequence fully describes an organism, including its phenotypic potential. As such, a genome-informed taxonomic scheme, particularly one that relies on groups of genome type strains, as described here, may render the

“Candidatus” designation obsolete. “Candidatus” was originally proposed as a category or rank given to a potential new taxon that has only sequence data and a morphology determined through microscopy with a nucleic acid probe (322). At that time, this precluded the naming of a species with only a single gene sequence, like 16S rRNA, which could be a chimera. While “Candidatus” had an important role at that time and prior to the widespread use of whole-genome sequencing, it has led to extensive confusion among researchers, particularly those new to fields with organisms that use this naming scheme. It also places an undue burden on researchers that study obligate host- associated organisms that are often very highly characterized but unamenable to growth on laboratory media and long-term storage in culture collections. Additionally, among endosymbionts and other obligate intracellular organisms, it has been applied inconsistently, compounding this confusion. Therefore, it seems reasonable that its use should be limited to only organisms from which a complete genome sequence cannot be obtained consistently from the same host organism or host cell line.

169

Limitations

A current issue of whole-genome aligners, such as Mauve or Mugsy, is the inability for

the software to scale, being able to computationally handle only subsets of at most 80

genomes. While the aligner ParSNP in the Harvest suite can generate a core genome∼

alignment from a larger subset of genomes, its use is limited to intraspecies comparisons

(304), making it currently unamenable to this approach. However, for heavily sequenced

genera, such as the Rickettsia genus, future species assignments for novel genomes do not

require constructing a core genome alignment from every available genome. Instead, we

recommend that for each genus, the relevant experts in the community establish and

curate a set of trusted genomes with at least one representative of each named species that

should be used for constructing core genome alignments. After an initial assessment,

refinement could be made by using a core genome alignment with many more genomes

from closely related species. As an example, in the case of Rickettsia species, the initial alignment would include representative genomes from each of the ancestral, typhus, spotted fever, and transitional groups. If the analysis indicated that the query was closest to typhus group genomes, a subsequent alignment would include a curated subset of genomes only from the typhus group. The first core genome alignment would serve to classify a genome into a subgroup, while the second would be used to assign a species designation.

Conclusions

As noted with the development of MLST (323), sequence data have the advantage of being incredibly standardized and portable. While MLST methods allow for insight at the subspecies level of taxonomic classifications and are heavily relied upon during 170 infectious disease outbreaks, they do not provide sufficient resolution to define species.

Increased resolution is required for taxonomy and can be obtained through the construction of whole-genome alignments that maximize the number of evolutionarily informative positions.

By generating core genome alignments for different genera, we sought to identify and develop a universal, high-resolution method for species classification. Using the core genome alignment, a set of genomes can be assessed based on the length, sequence identity, and a phylogenomic tree of the same core genome alignment. The length of the core genome alignment, which reflects the ability of the genomes to be aligned, is informative in delineating genera. Meanwhile, the pairwise comparisons of core genome alignment sequence identity can be used to delineate species, which can in turn be further refined with a phylogenetic analysis to resolve paraphyletic clades in the taxonomy.

Through this work, we have identified criteria that reconstruct classical species definitions using a method that is transparent and portable. In the course of this work, we have identified modifications that need to be made to the species and genus designations of a number of organisms, particularly within the Rickettsiales. While we have identified these instances, we recommend that any changes in nomenclature be addressed by collaborative teams of experts in the respective communities.

Materials and Methods

Synteny plots

For the 10 complete Wolbachia genomes, NCBI BLAST v2.2.26 with the blastn algorithm was used to obtain the coordinates of the links for each of the pairwise genome

171

comparisons. Synteny plots were generated and visualized using the Artemis

Comparative Tool viewer (324, 325).

Core genome alignments

The exact commands for generating a core genome alignment from a directory containing all of the fasta files are provided in a bash script (see Text S1 in https://msystems.asm.org/content/3/6/e00236-18/figures-only#fig-data-supplementary-

materials); the local paths for the bin directors for Mugsy and mothur need to be specified

in the script. More specifically, genomes used in the taxonomic analyses were

downloaded from NCBI’s GenBank (326). OrthoANIu v1.2 (327) and USEARCH

v6.1.544 (328) were run with default settings and used for ANI calculations. GGDC v2.1

(ggdc.dsmz.de) (287) paired with the recommended BLAST+ alignment tool (329) was

used to calculate dDDH values. For analyses in this paper, all dDDH calculations were

performed using dDDH formula 2 due to the usage of draft genomes in taxonomic

analyses (286). For each of the genome subsets, core genome alignments were generated

using Mugsy v1.2 (246), with default settings, or Mauve v2.4 (289), with the progressive

Mauve algorithm, retaining only LCBs of >30 bp in length that are present in all

genomes. mothur v1.22 was used to process each of the core genome alignments to

remove positions not present in all organisms in the genome subset (247). Sequence

identity matrices for the core genome alignments were created using BioEdit v7.2.5

(brownlab.mbio.ncsu.edu/JWB/papers/1999Hall1.pdf). ML phylogenomic trees with

1,000 bootstraps were calculated for each core genome alignment using IQ-TREE v1.6.2

(330) paired with ModelFinder (331) to select the best model of evolution and UFBoot2

(332) for fast bootstrap approximation. Trees were visualized and annotated using iTOL 172

v4.1.1 (333). Construction of neighbor-network trees was done using the R packages ape

(334) and phangorn (335).

Core protein alignments

Orthologs between complete genomes of the same species were determined using

FastOrtho, a reimplementation of OrthoMCL (305) that identifies orthologs using all-by-

all BLAST searches, run with default settings. The amino acid sequences of proteins

present in all organisms in only one copy were aligned with MAAFT v7.313 FFT-NS-2, using default settings (336). For every protein alignment, the best model of evolution was identified using ModelFinder (331), and phylogenomic trees were constructed using an

edge-proportional partition model (337) with IQ-TREE v1.6.2 (330) and UFBoot2 (332) for fast bootstrap approximation. PhyloPhlAn v0.99 was run using default settings as an alternative method of assessing protein-based phylogenies (298). Comparative analyses of core protein and core nucleotide alignment trees were performed using the R packages ape (334) and phangorn (335).

Authors’ Contributions

MC, JBM, and JCDH wrote portions of the manuscript. MC, HT, and JCDH conceived algorithm. MC conducted bioinformatics analyses and generated figures. JCDH designed and oversaw analyses. All authors read and edited the manuscript.

Acknowledgements

We thank Joseph J. Gillespie for helpful discussions and encouragement.

This work was funded by the National Institute of Allergy and Infectious Diseases (grant

U19AI110820). 173

Chapter 7: Considerations for adopting species designations in the genus Wolbachia

Abstract

The importance of bacterial species designations lies in how they inform how we view relationships between two different organisms. In the absence of species designations, it becomes difficult to properly evaluate relationships between different bacterial strains.

While there have been numerous methods devised for the assignment of species, the obligate, intracellular bacterial genus Wolbachia has avoided assigning species designations. DNA-DNA hybridization, thought to be the gold standard for bacterial species assignment, is unable to be performed on unculturable bacteria, while other techniques, such as 16S rRNA sequencing and multi-locus sequence typing, provided

inadequate resolution to divide the Wolbachia into species. These methods fail to capture the relevant phenotypic traits used to differentiate Wolbachia strains, namely cytoplasmic incompatibility. In its place, the Wolbachia community has elected to use supergroup designations, which are not intended to be analogous to species designations. The advancement of sequencing technologies and whole-genome based phylogenetic techniques has provided an opportunity to revisit species designations in the Wolbachia genus. Here, we discuss the necessary steps needed to assign species designations to the genus Wolbachia that are in line with the rest of bacterial taxonomy.

Introduction

Wolbachia is a bacterial genus that includes obligate, intracellular bacteria thought to infect >1 million different insect species and many filarial nematode species worldwide

(64). Unlike other bacterial genera the Wolbachia community has yet to adopt species

174

designations owing to their lack of culturability. Instead, the community uses sequence

data to classify these endosymbionts into supergroups that are not intended to be

analogous to species designations. However, current advances in sequencing and

phylogenomics tools provide the opportunity to identify species boundaries and name

species.

In 1924, Wolbachia endosymbionts were discovered with the microscopic observation of

Rickettsia-like micro-organisms in the gonads of the mosquito Culex pipiens (338).

Twelve years later, this isolate was classified in the novel genus Wolbachia based off its

unique pleomorphic characteristics, namely that its cells appeared as rings, curved, or

bent rods (339). The next reported classifications of bacteria in the genus Wolbachia

occurred in 1956 and 1961 during which Wolbachia melophagi (now ) and

Wolbachia persica (now Francisella) were respectively described and classified based on

their similar phenotypic characteristics to W. pipientis (340, 341).

Beginning in 1977, the use of the 16S rRNA gene as a phylogenetic marker (342-344)

enabled a more objective taxonomic classification of bacteria (345). The 16S rRNA sequences of W. melophagi and W. persica led to their reclassifications into the

Bartonella and Francisella genera, respectively. However, as more Wolbachia strains

were sequenced, the slow evolutionary rate of the 16S rRNA gene made resolving species

relationships difficult; the 16S rRNA failed to differentiate Wolbachia strains based on

their cytoplasmic incompatibility (CI) phenotypes (63, 346). Phylogenetic analyses using

single-copy genes with a higher rate of evolution, such as ftsZ (347) and dnaA (348),

175

were done to obtain a higher resolution phylogeny of the Wolbachia that could possibly

aid in defining Wolbachia past the genus level.

In 1998, the wsp gene was first used as a tool for typing Wolbachia strains (63, 349).

Cited as having a higher evolutionary rate than the 16S rRNA, ftsZ, and dnaA, the

increased resolution of wsp was able to divide the Wolbachia into groups that

differentiated CI phenotypes (63). Over time, a classification system arose based on the

wsp gene whereby Wolbachia were subdivided into supergroups. Using this system, each

Wolbachia supergroup was defined by a set of supergroup-specific wsp primers.

Wolbachia supergroups were further subdivided into groups, in which members of the

same group would have ≥97.5% wsp sequence identity (63). However, the use of wsp is

confounded by significant intragenus recombination (350-352) and differing rates of

selection between arthropod and nematode Wolbachia (353, 354). Intragenus

recombination is widespread in Wolbachia endosymbionts, being observed even in

housekeeping genes, including ftsZ and dnaA (355).

In the early 2000s, multi-gene phylogenetic approaches were increasingly used to type

bacterial genera, including the Wolbachia (66, 356, 357). In 2006, a Wolbachia multilocus sequencing typing (MLST) system was proposed (66) using sequences from five conserved Wolbachia genes (ftsZ, gatB, coxA, hcpA, and fbpA), along with wsp as an optional strain marker. The MLST system minimizes the single-locus recombination biases observed by constructing single gene phylogenies while providing a reliable method to discriminate Wolbachia strains. However, the Wolbachia MLST system is unable to differentiate between closely related strains due to the conserved nature of the

176

single-copy MLST genes (67). Additionally, the MLST system was designed at a time

when the only complete Wolbachia genomes available were of wMel and wBm (66-70).

While all these tools are able to type strains, species designations remained elusive and were contentious within the Wolbachia community, being discussed at each of the biennial international Wolbachia conferences. However, with the rapid improvements made in sequencing technologies paired with its decreasing costs, whole genome approaches are becoming increasingly used in bacterial taxonomy. At the 2018 Tenth

Annual Wolbachia Conference in Salem, MA, USA, it was decided by vote that the time is right for a community-supported taxonomy enabled by recent developments in using genomics to define taxonomy.

Historically, DNA-DNA hybridization (DDH) has been the “gold standard” for bacterial species classification (282, 358). However, DDH is not possible for obligate-host associated bacteria. Using whole genome data, average nucleotide identity (ANI), digital

DNA-DNA hybridization (dDDH), and core genome alignment sequence identity

(CGASI) are three techniques that determine species cutoffs intended to be analogous to the DDH species cutoffs (Chapter 6) (281, 283, 359, 360). There have been numerous proposals to adopt these genomic approaches as the new standard for defining bacterial species due to the limitations and labor involved with DDH and the higher resolution conferred with whole genome approaches (278, 359, 361).

The shift in bacterial taxonomy from using DNA-DNA hybridization to whole-genome based approaches provides an opportune time to establish species names and bring

Wolbachia taxonomy in line with the rest of bacterial taxonomy. However, there are

177

several important considerations needed for adopting the classical taxonomic

nomenclature for the Wolbachia.

Methods for protein and whole-genome based phylogenetic analyses

While numerous studies have analyzed Wolbachia strains using a set of core proteins, no

specific species cutoffs have been drawn using core protein-based methods. PhyloPhlAn

(298) and the Genome Taxonomy Database Toolkit (GTDB-Tk) (362) are two taxonomic tools designed to classify bacterial genomes using sets of 400 and 120 conserved genes, respectively. When applied to ten complete Wolbachia genomes from supergroups A-F,

PhyloPhlAn was able to construct a 2,877 aa alignment using 176 of its 400 query proteins (Chapter 6) (360). The PhyloPhlAn taxonomic imputation pipeline was able to classify all genomes as Wolbachia but did not differentiate them as separate species, likely due to all Wolbachia in the PhyloPhlAn tree of life being classified as the same species. Using the same set of Wolbachia genomes, GTDB-Tk could construct a 5,035 aa alignment using 106 of its 120 query genes. Taxonomic ranks are defined in GTDB-Tk using a parameter called the relative evolutionary divergence (RED). All extant taxa are assigned a RED value of 1 while the least common ancestor is assigned a RED of 0. For all internal nodes between the two, RED values are extrapolated according to lineage- specific rates of evolution. Taxonomic ranks are then assigned as the median RED value for all taxa within a specific rank (362). While GTDB-Tk classifies all ten genomes as

Wolbachia, the species designations are more unclear with only wOo and wOv being assigned as the same species and the remaining eight Wolbachia being given no species designations. Additionally, the Genome Taxonomy Database (GTDB) repository

(gtdb.ecogenomic.org) does not assign species designations for 37 of the 46 Wolbachia 178

genomes taken from the NCBI Assembly database. The lack of species designations

assigned by GTDB and the GTDB-Tk indicates an inability to reliably assign species

designations using the method (362).

There are three major nucleotide-based whole genome classification methods: ANI,

dDDH, and CGASI (Chapter 6) (283, 359, 360, 363). ANI is calculated from splitting a

query genome into 1 kbp fragments, which are then searched against a reference genome.

The ANI between two genomes is the average sequence identity between all matches

with >60% sequence identity over >70% of their length. For dDDH calculations, high

scoring segment pairs (286) or maximally unique matches (364) are identified between

two genomes. While there are several formulas available for dDDH calculations, most

commonly for comparisons between draft genomes, dDDH values are calculated by

dividing the sum of all identities by the overall match length (281, 283). However, ANI and dDDH set species cutoffs without further phylogenetic tree-based analyses, such that the formation of paraphyletic clades is not routinely evaluated.

More recently, core genome alignments have been rising to the forefront of bacterial

taxonomy. A core genome alignment is a concatenated alignment that contains only sets

of orthologous sequences present in all organisms of interest (246, 304) while core

protein analyses are constructed using the amino acid sequences of the genes shared

between the organisms of interest (288, 365, 366). Nucleotide core genomes are

advantageous to protein-based analyses in that they account for a greater number of

phylogenetically-informative positions and are independent of annotation (301, 302).

However, core genome alignments are limited to taxonomic analyses at the genus level,

179 while protein analyses can be performed between organisms at any taxonomic level (298,

300, 362). The sequence identity of the core genome alignment (CGASI) has been proposed as a method for taxonomic classification at the genus and species levels

(Chapter 6) (360). While ANI and dDDH are unable to classify bacteria above the species level, CGASI offers a method for genus-level designations and is easily amenable to including an additional phylogenetic tree-based analysis to avoid the formation of paraphyletic clades (Chapter 6) (360).

Two separate analyses of the Wolbachia using the ANI and dDDH species cutoffs found that strains from different supergroups had ANI and dDDH values below the species cutoff, confirming previous analyses that the Wolbachia genus contained multiple species. Within supergroups, the studies published by Ramirez-Puebla et al. and Chung et al. indicated that wDi and wLs should be designated as separate species from the other supergroup C and D strains, respectively (Chapter 6) (68, 360). Similarly, Ramirez-

Puebla et al. identified that the Wolbachia endosymbionts of Diaphorina citri, wDacA and wDacB, were separate species from the other members of supergroups A and B, respectively, using ANI and dDDH cutoffs (68). While the Chung et al. study did not include wDacA or wDacB due to the fragmented nature of both genomes, wInc SM and wTpre met the ANI, dDDH, and CGASI cutoffs indicating that they were each a separate species from the other supergroup A and B Wolbachia, respectively. Both studies have shown that if Wolbachia species were determined strictly using sequence identity cutoffs, supergroups A and B would be split into multiple species. However, the CGASI method proposed by Chung et al. supplements the CGASI cutoff with a phylogenetic tree-based analysis that corrects for paraphyletic clades (Figure 7.1) (Chapter 6) (360). 180

Ideally, the method used to assign Wolbachia species should be both high resolution and

scalable to classify novel genomes. Of the described methods, ANI and dDDH provide the most simple and accessible cutoffs for delimiting species based on entire genomes, having web-based tools that allow for ANI or dDDH values to be calculated between any

two genomes (286, 327, 367, 368). However, tree-independent approaches such as ANI

and dDDH are more likely to generate paraphyletic clades, due to species cutoffs being

determined using only a sequence identity cutoff. Core genome alignments provide an

improved resolution compared to both the ANI and dDDH approaches while also

implementing a tree-based analysis for species classification. In doing so, core genome

alignments maintain the high resolution conferred by being a nucleotide-based analysis

while avoiding the pitfalls associated with tree-independent analyses (Chapter 6) (360).

The main issue of core genome alignments is its inability to scale (369). The compute time required for whole-genome alignment based methods exponentially increases as more genomes are added due to a reliance on all-versus-all progressive alignments (246,

370). Protein-based approaches such as PhyloPhlAn and GTDB-Tk can better scale but forego the increased resolution conferred by nucleotide-based approaches. Because the alignment used for inferring phylogenetic relationships is smaller, the resolution of the alignment decreases but the method can handle a larger subset of genomes. Additionally, protein-based approaches allow for taxonomic classification at all ranks while nucleotide- based approaches still rely on 16S rRNA screening and similar techniques for the initial placement of a genome into a genus.

181

Figure 7.1: Neighbor-network tree of 23 Wolbachia genomes A neighbor-network tree was generated using the core genome alignment of 23 Wolbachia genomes assembled in one piece. The shape next to each Wolbachia genome represents supergroup A (⚫), B (▲), C (■), D (⬟), E (⬢), F (⯃), and L (◆) designations, while the color of the shape indicates species as defined by a core genome alignment sequence identity (CGASI) cutoff of ≥96.8%. Paraphyletic clades are formed in supergroups A and B with wInc_SM and wTpre if the species cutoff used is strictly based off sequence identity.

The International Code of Nomenclature

For the scientific naming of prokaryotes, the International Code of Nomenclature of

Prokaryotes (ICNP) was established in 1980 (321). While the ICNP is agnostic of the

182 method used to delimit taxa, it provides a resource to assess the correctness of the names applied. Presently, the only valid Wolbachia species epithet accepted by the ICNP and published in the List of Prokaryotic Names with Standing in Nomenclature (LPSN) is W. pipientis (www.bacterio.net) (371). According to the ICNP guidelines, assigning novel taxa names requires that:

(1) The name is published in the International Journal of Systematic Bacteriology

(IJSB) or the International Journal of Systematic and Evolutionary Microbiology

(IJSEM).

(2) The publication of the name is accompanied by a description of the taxon or

by a reference to a previous effectively published description of the taxon.

(a) The new name or new combination should be clearly stated and indicated as

such (i.e. fam. nov., gen. nov., sp. nov., comb. nov., etc.).

(b) The derivation (etymology) of a new name (and if necessary of a new

combination) must be given.

(c) The properties of the taxon being described must be given directly after (a)

and (b). This may include reference to tables or figures in the same publication, or

reference to previously effectively published work.

(d) All information contained in (c) should be accessible.

(3) The type of the taxon must be designated.

183

The Candidatus epithet

The provisional status Candidatus was initially proposed to record the properties of

putative taxa for which molecular sequences are available but could not be isolated in a

pure culture. The term was initially proposed due to the uncertainty that a specific nucleic

acid sequence belonged to a unique organism in the environment. Without a pure culture,

it was possible that a nucleic acid sequence could arise from sequencing/amplification

errors or from formations of chimeric sequences (372). According to the ICNP, the

Candidatus name is a preliminary name and has no standing in prokaryotic literature.

For researchers that study unculturable obligate organisms that are highly characterized, this poses an added difficulty. Additionally, most environmentally-occurring bacteria are resistant to cultivation and efforts to improve isolation and culture methods for these organisms is both time-consuming and onerous (373). The improvements in sequencing technologies and assembly tools allows whole genomes to be constructed with higher confidence. Because the genome of an organism is thought to represent its phenotypic potential, isolating and culturing these bacteria becomes obsolete for taxonomic and phylogenetic-based analyses. While the presence of a pure culture does indicate that the genome sequence is reliably originating from a specific organism, the ability to reproduce a whole genome sequence repeatedly from the same sample is a similar indicator that the sequence is originating from a single isolate. Therefore, we propose that the use of

Candidatus should instead be limited to organisms in which a complete genome sequence cannot be consistently obtained from the same host organism or cell line.

184

Naming Wolbachia species

Ramirez-Puebla et al. recently published a study designating delimiting Wolbachia

species based on ANI, dDDH, synteny, G+C content, and protein-based analyses. In total,

eight Candidatus names were assigned to isolates spanning supergroups A-F (68).

Recognized species names were not proposed because type material was not designated.

While their methods for assigning species were largely sound, the specific species names chosen are contentious in the Wolbachia community, particularly that several of the chosen names provide misleading descriptions for their clades (290). This is compounded by a lack of involvement of the active Wolbachia community that has been discussing the taxonomy for over two decades.

The ICNP prioritizes the first established names regarding naming conflicts. While historically, isolates were discovered, described, and named by the same individual or groups of individuals, the Wolbachia are an anomaly in that no species names are established due to the community-wide preference of the supergroup classification system until now. The presence of large sequencing data repositories makes it so that whole genome-based bioinformatics analyses to delimit and publish names to Wolbachia species can be performed by any person with a computer without any community input.

While for the individual naming of isolates, priority can be given to the earlier name to resolve smaller conflicts, we believe that large shifts in taxonomic nomenclature should be decided by large, collaborative community-supported efforts.

185

The designation of Wolbachia type material

For each named species, the type strain serves as a designated strain whose name is

associated with the species as whole. The designation of type strains requires the

deposition of type strains as pure cultures in two culture collections in different countries.

Although previous iterations of the ICNP indicating that a description, preserved

specimen, or illustration could serve as the type material for unculturable organisms,

deposition in culture collections is required as of 2001. However, there have been

proposals to use genome sequences as the type material for strains (374-378). These proposals state that as new methods are developed, they should be able to serve as the type material if they unambiguously identify their representative taxa group (375, 376,

378). Provided that an organism’s genome sequence provides the highest resolution material with which to delineate species, we believe that genome sequences are sufficient as type material.

Of the 16 Wolbachia supergroups, isolates from supergroup A (379-383) and supergroup

B (383, 384) are able to be cultured and maintained in mosquito cell lines. For Wolbachia which can be cultured, we would recommend the deposition of the type strain in addition to supplying the whole genome sequence as the type material. Although designation of the type strain is given priority to the earlier strain, we believe that the initial designation should prioritize strains that have available genome sequences and can be cultured and deposited into culture collections.

186

Genome quality standards for use as type material

For largely unculturable organisms such as the Wolbachia, the taxonomic classification

system is highly dependent on the quality of genome assemblies. To maintain the

integrity of a classification system based on genomes, we believe that complete genome

sequences must be used for type material in order to define a novel Wolbachia species.

The quality of the complete genome sequence must also be held to a minimum standard.

Wolbachia genomes are widespread in lateral gene transfer events to the host, in which large portions of the genome of their Wolbachia endosymbiont are integrated into host genomes (385, 386). During assembly, reads originating from the host genome can be mistakenly used in the assembly of the Wolbachia genome (387). As an example, the

currently deposited wAna, Wolbachia endosymbiont of Drosophila ananassae, is likely

composed entirely of lateral gene transfer events from its host (385, 388), making it unsuitable for use in phylogenetic analyses. Additionally, multiple strains of Wolbachia from different supergroups have been found within the same host, further confounding the assembly of individual Wolbachia genomes (290, 389). However, there are few tools to assess contaminations at the assembly level from the host, with contaminations from other Wolbachia being even harder to detect (Chapter 5) (390). While these issues may

eventually be resolved with high fidelity, long read sequencing technologies, at this point

in time contamination should be addressed at the sequencing level using targeted

enrichments for specific Wolbachia strains (258).

Future Directions

Transitioning the Wolbachia to the traditional species designations used by other bacterial genera is no small task and will be done in collaboration with members of 187

Wolbachia research community who have elected to participate. We anticipate first

designating wPip as the type strain for the genus Wolbachia and the species Wolbachia

pipientis, which will represent all supergroup B Wolbachia strains. Subsequently, we will

designated a species epithet for wMel, which will be the type strain for supergroup A

Wolbachia strains. However, the scarcity of bacterial culture collections able to maintain

other Wolbachia species requires that we seek an exemption from the International

Committee on Systematics of Prokaryote such that storage in only one culture collection

is required. This is how we will seek species designations for the Wolbachia

endosymbionts of the filarial nematodes Brugia malayi and Dirofilaria immitis. For

unculturable Wolbachia strains, genome sequences have been proposed as sufficient type

material for the species designations (375). The genome sequence of an organism represents its entire phenotypic potential and, provided that it is shown to be absent of erroneous regions originating from chimeric reads or lateral gene transfer regions, we believe genome sequences to be sufficient enough to serve as type material for obligate intracellular bacteria where the host organism provides ongoing selection for a specific strain. As an example, the filarial nematode Onchocerca volvulus cannot be cultured in vitro and thus, its Wolbachia endosymbiont wOo can never be stored in a culture

collection. However, the complete genome sequence of wOo is published and available.

This may require a rule change that we have already begun discussing with a member of

the International Committee on Systematics of Prokaryotes such that for unculturable

organisms, a complete genome sequence can be used as type material.

Species designations frame how we consider relationships between different strains of

bacteria. The supergroup designations as they are currently used in the Wolbachia genus, 188 while not intended to be analogous to species designations, function in a similar vein.

Despite there being a hesitance to designate Wolbachia species in the past, the current advances in sequencing technologies and whole genome-based taxonomic methods provides an opportune time to assign Wolbachia species designations. While we have made strides to adapt species designations to the Wolbachia taxonomy, we intend to keep the supergroup designations due to their extensive use in the nomenclature.

Authors’ Contributions

M.C. and JCDH wrote and edited the manuscript.

189

Chapter 8: Conclusions

Considerations for multi-species transcriptomics experiments

Since its development in 2008, RNA-Seq been increasing used to examine the underlying transcriptional signatures responsible for the emergence of specific phenotypes (391-

393). As sequencing technologies continued to improve, it became possible to analyze the transcriptomes of multiple species in a single system simultaneously. For such studies, the term dual-species transcriptomics was coined to describe transcriptomics studies that

simultaneously assessed the transcriptional profiles of multiple organisms from a single

sample (144). Compared to typical transcriptomic studies, dual species transcriptomics

studies are technically challenging due to the varying proportions of reads between the

different organisms. In most biologically-relevant infection models studying host-

pathogen interactions with eukaryotic and prokaryotic organisms, the total RNA content

of the eukaryote outnumbers the prokaryote 10-fold (144). As such, one would have to generate at least 10-fold more sequencing reads in order to get adequate coverage of the lower abundance member to study these host-pathogen interactions.

For our transcriptomics analysis, we analyzed the transcriptome of three organisms simultaneously across the B. malayi life cycle: B. malayi, its Wolbachia endosymbiont wBm, and its vector host A. aegypti. The technical challenges in this study stemmed primarily from the vector samples, in which the sequenced reads were ~80% A. aegypti,

~20% B. malayi, and <0.0001% wBm. As such, the Wolbachia transcriptome in the

vector life stages has never been assessed before our study. Using the Agilent SureSelect

(AgSS) RNA-Seq capture system, we were able to extract a meaningful number of B.

190

malayi and wBm reads from all samples to conduct meaningful statistical analyses

(Chapter 2, Chapter 4). Additionally, we showed that the AgSS capture system did not

significantly bias the recovered transcriptome from the target organism relative to

standard enrichments such as the poly(A)-enrichment or poly(A)-, rRNA depletion

(Chapter 2).

Our analysis of the AgSS capture platform serves as a proof of concept in its use in multi- species transcriptomics analyses. By also validating the use of the AgSS platform in host- pathogen systems involving the fungal pathogen Aspergillus fumigatus, we showed the potential of the AgSS in facilitating the transcriptomics studies of complex multi-species systems. In addition to being able to extract reads from a secondary or tertiary organism in a sample, the AgSS platform can also be used for the detection of low abundance

transcripts. From our studies, we believe that the AgSS platform and similar RNA-Seq

capture technologies are necessary for enriching samples in multi-species transcriptomics

experiments, especially those in which secondary organisms are present in low

abundance (Chapter 2).

Prokaryotic-based quantification tools for RNA-Seq analyses

For our transcriptome analysis of the Wolbachia endosymbiont of B. malayi, wBm, our

initial results using the RNA-Seq quantification tool HTSeq indicated that a large number of multi-gene fragments were not being quantified, but rather marked as ambiguous and discarded. These discarded multi-gene fragments included those that spanned genes that were transcribed together in an operon, leading to these operonic genes being under-

191

counted. We developed FADU as a means to derive meaningful quantification from these

ambiguous multi-gene fragments to better quantify operonic genes (Chapter 3).

Ideally, RNA-Seq quantification tools use full-length transcripts to quantify transcripts, but due to the vast number of prokaryotic organisms studied, determining the full-length transcript sequences for each of these organisms is impossible. Using FADU, we assigned proportional read counts for multi-gene fragments based off how much of gene is overlapped by a multi-gene fragment. While this allowed us to better quantify operonic genes, particularly those of a smaller size, this also led to a small proportion of counts erroneously assigned. As an example, counts from fragments that originated from the untranslated region of one gene would be incorrectly assigned to an adjacent gene whose coding sequence overlapped with the untranslated region. However, without full length transcript annotations it becomes impossible to differentiate cases where two genes are transcribed together from those that are simply in close proximity.

As methods for de novo transcriptomics assemblies continue to improve, tools such as kallisto or Salmon, which assign ambiguous fragments using an expectation- maximization algorithm, become more ideal. However, de novo transcriptome assembies will continue to have problems with coding-dense genomes, being unable to differentiate reads in regions where genes overlaps. As an alternative solution, the improvement of transcript and operon prediction technologies can also allow for the better quantification of prokaryotic genes in transcriptomics analyses. To this end, we are currently developing

FADU-O, a software designed for the prediction of operonic structures in prokaryotic genomes. By identifying where certain positions in the genome are enriched for the start

192

and ends of reads, we can identify full length transcripts from read data for more accurate

quantification in prokaryotic transcriptomics experiments.

The comprehensive transcriptome of lymphatic filariasis in the jird model

Using the AgSS platform (Chapter 2) and FADU (Chapter 3), we were able to assess the

transcriptome of B. malayi, its Wolbachia endosymbiont wBm, and its laboratory A. aegypti across the entire B. malayi life cycle (Chapter 4). For each organism, we identified sets of expression modules correlated with specific times in the B. malayi life cycle. To further examine the processes occurring in each of these life stages, a functional term enrichment analysis was conducted on each of the expression modules. In A. aegypti, we observed an increase in energetics demand alongside an increasing stress response that corresponds with the development of B. malayi from its L1 through its L3

life stages. In B. malayi we identified an upregulation of molting proteins that coincided with each of the known molting periods along with an an upregulation of kinases and phosphatases in adult males and DNA-binding and chromatin-remodeling genes in adult female B. malayi. For wBm we identified very few enriched functional terms corresponding to each of its expression modules. However, we did observe the upregulation of the 6S RNA, a small RNA regulator of transcription, as B. malayi matured to its adult life stages (Chapter 4).

Our transcriptomics data set is a comprehensive resource that can be used for the interrogation of potential drug targets, biomarkers, and vaccine candidates. By combining our data set with the published secretomics data sets available, in the future, we can identify proteins that are secreted and heavily expressed in the adult or microfilariae life

193

stages to be used as potentially improved biomarkers. For interrogating potential vaccine

candidates, we were able to obtain two independent expression modules containing genes

upregulated in the infective L3 larvae life stage and microfilariae, respectively. In the

future, by querying these sets of genes for surface expression, potential vaccine

candidates can be identified for testing.

By further interrogating our transcriptomics data set we can better understand the interplay between the three organisms in the jird infection model. For B. malayi and A. aegypti, our analysis was conducted at the gene level rather than the transcript level.

Going forward, we plan on de novo assembling the reads generated from our data set in order to determine which specific B. malayi and A. aegypti splice variants are present in each of the samples. Additionally, we have another transcriptomics data set designed to assess the transcription of B. malayi lateral gene transfer events across the life cycle. For this data set, total RNA from each of the samples was enriched for polyadenylated transcripts and then captured using Agilent SureSelect probes designed using wBm genes.

By doing this, we enriched specifically for expressed, polyadenylated B. malayi mRNAs

with homology to wBm genes.

BET inhibitors as a potential therapeutic for lymphatic filariasis

We assessed the adult life stages for potential drug targets using a functional term enrichment analysis for each of our recovered expression modules indicating differential expression in the adult life stages. We identified an enrichment for chromatin-remodeling bromodomain proteins in adult female B. malayi. These proteins included the only two members of the bromodomain and extra-terminal (BET) family of transcription factors

194

encoded by B. malayi. Because BET inhibitors are currently being tested as cancer

therapeutics, BET inhibitors such as JQ1(+) were available and could be easily

repurposed for testing in vitro. When used to treat adult female B. malayi, JQ1(+) was

observed to have macrofilaricidal activity and cause sterility at a lower concentration that

ivermectin, one of the current prescribed therapeutics for lymphatic filariasis (Chapter 4).

However, despite our promising results. it is important to understand the limitations of

our in vitro studies using JQ1(+). Plans are currently in place to test the effect of JQ1(+)

in the jird infection mode for B. malayi. Additionally, the low half-life of JQ1(+) has been previously shown to limit its efficacy in humans. However, BET inhibitors with longer half-lives have been developed and are currently being tested in human cancer clinical trials. We plan to test the efficacy of these other BET inhibitors on adult female worms in vitro and potentially in vivo.

We also plan to identify the mechanism for BET inhibitors in B. malayi. In humans, BET inhibitors have been shown to competitively bind to BET proteins to prevent them from binding acetylated lysine residues on histones, which inhibits chromatin remodeling.

While the B. malayi BET proteins have the same two bromodomains identified in the human ortholog used to bind the acetylated lysine residues and BET inhibitors, this interaction has not been proven in B. malayi. Because in vivo protein-protein interaction assays have not been developed in filarial nematodes, we could explore the interaction with BET inhibitors against the B. malayi BET proteins by expressing them in a yeast expression vector, purifying them, and monitoring their interaction with BET inhibitors using techniques such as isothermal calorimetry. Using these models, we would not only

195

be able to confirm whether the BET inhibitors were interacting with the B. malayi BET

proteins but also identify which BET inhibitors have a higher affinity for binding to the

B. malayi BET proteins relative to human BET proteins.

In the case that the in vitro JQ1(+) results do not translate in vivo, we can use our

transcriptomics data set to identify other potential drug targets for testing. While we

focused on targeting the chromatin-remodeling proteins upregulated in adult female B. malayi, we also observed male-specific kinase signaling cascades as well as life stage specific G-protein coupled receptors that could potentially be inhibited to induce adult nematode death. Additionally, using our transcriptome data, we can identify several

expressed transcribed lateral gene transfers in B. malayi, termed nuwts (nuclear

Wolbachia transfers) (259). Transcribed nuwts represent an attractive therapeutic target

for lymphatic filariasis due to their presence in both B. malayi and its Wolbachia

endosymbiont but not the vertebrate host. If these nuwts are functional and present in

both wBm and B. malayi, the growth and development of both organisms can be hampered by targeting just a single gene while being less likely to have any harmful

effect to the host.

The state of the Wolbachia taxonomy

The current Wolbachia taxonomic scheme uses supergroup designations in place of the

species designations used by other bacterial genera. Despite this, the supergroup

designations are confusing to those unfamiliar to the field, with supergroups not

intending to be analogous to the standard species designations used in other bacterial

genera. While in the past, the Wolbachia community has been hesitant to embrace species

196

designations due to the lack of resolution obtained with species delimiting methods such

as 16S rRNA sequencing, we presented a method to assign species using core genome

alignments at the Wolbachia Conference 2018 in Salem, MA. Using our method, we re-

visited the taxonomic schemes of every genera within the Rickettsiales and recommended

alternative species designations and in some cases, genus designations. In particular, we

showed that the supergroup designations currently assigned to the Wolbachia need to be revised (Chapter 6).

We have been coordinating a community-wide effort in restructuring the Wolbachia taxonomy to replace the existing supergroups with species designations. To officially assign species designations, we have been following the guidelines set by the Code of

Nomenclature written by the International Committee on Systematics of Prokaryotes.

While the code is agnostic of the method for species delimitation, the code does outline several requirements for the establishment of a new species. The designation of a novel species requires deposition into two culture collections in different hemispheres of the world. Because the Wolbachia are intracellular bacteria, we have been reaching out to culture collections to see whether they would store intracellular Wolbachia in mosquito

or nematode cell lines. Additionally, we plan on establishing minimum criteria needed for

using genome sequences as type material, centered on ensuring that the Wolbachia

genomes used as type material are free of lateral gene transfer sequences (Chapter 5, 7).

Once all the criteria have been met, we plan on publishing manuscript detailing each of

the different re-classified Wolbachia species (Chapter 7).

197

Additionally, our work has also shown that Rickettsia species are too liberally assigned

while the Anaplasma and Neorickettsia should be each split into two separate genera. We plan to reach out to each of these communities individually in regard to our proposed taxonomic changes.

Authors’ Contributions

M.C. wrote the manuscript. M.C. and J.C.D.H. edited the manuscript.

198

References

1. Stepek G, Buttle DJ, Duce IR, Behnke JM. 2006. Human gastrointestinal nematode infections: are new control methods required? Int J Exp Pathol 87:325- 41. 2. Disease GBD, Injury I, Prevalence C. 2016. Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990-2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet 388:1545-1602. 3. Grote A, Lustigman S, Ghedin E. 2017. Lessons from the genomes and transcriptomes of filarial nematodes. Mol Biochem Parasitol 215:23-29. 4. Taylor MJ, Hoerauf A, Bockarie M. 2010. Lymphatic filariasis and onchocerciasis. Lancet 376:1175-85. 5. Devaney E. 2006. Thermoregulation in the life cycle of nematodes. Int J Parasitol 36:641-9. 6. Aoki Y, Fujimaki Y, Tada I. 2011. Basic studies on filaria and filariasis. Trop Med Health 39:51-5. 7. Nutman TB. 2013. Insights into the pathogenesis of disease in human lymphatic filariasis. Lymphat Res Biol 11:144-8. 8. WHO. 2017. Global programme to eliminate lymphatic filariasis: progress report, 2016. Wkly Epidemiol Rec 92:594-607. 9. Pani SP, Srividya A. 1995. Clinical manifestations of bancroftian filariasis with special reference to lymphoedema grading. Indian J Med Res 102:114-8. 10. Dreyer G, Medeiros Z, Netto MJ, Leal NC, de Castro LG, Piessens WF. 1999. Acute attacks in the extremities of persons living in an area endemic for bancroftian filariasis: differentiation of two syndromes. Trans R Soc Trop Med Hyg 93:413-7. 11. Lustigman S, Makepeace BL, Klei TR, Babayan SA, Hotez P, Abraham D, Bottazzi ME. 2018. Onchocerca volvulus: The Road from Basic Biology to a Vaccine. Trends Parasitol 34:64-79. 12. Cano J, Basanez MG, O'Hanlon SJ, Tekle AH, Wanji S, Zoure HG, Rebollo MP, Pullan RL. 2018. Identifying co-endemic areas for major filarial infections in sub- Saharan Africa: seeking synergies and preventing severe adverse events during mass drug administration campaigns. Parasit Vectors 11:70. 13. Mediannikov O, Ranque S. 2018. Mansonellosis, the most neglected human filariasis. New Microbes New Infect 26:S19-S22. 14. Ta-Tang TH, Crainey JL, Post RJ, Luz SL, Rubio JM. 2018. Mansonellosis: current perspectives. Res Rep Trop Med 9:9-24. 15. Akue JP, Eyang-Assengone ER, Dieki R. 2018. Loa loa infection detection using biomarkers: current perspectives. Res Rep Trop Med 9:43-48. 16. Leggat PA, Melrose W. 2005. Lymphatic filariasis: disease outbreaks in military deployments from World War II. Mil Med 170:585-9. 17. Kamgno J, Pion SD, Mackenzie CD, Thylefors B, Boussinesq M. 2009. Loa loa microfilarial periodicity in ivermectin-treated patients: comparison between those

199

developing and those free of serious adverse events. Am J Trop Med Hyg 81:1056-61. 18. Asio SM, Simonsen PE, Onapa AW. 2009. Analysis of the 24-h microfilarial periodicity of Mansonella perstans. Parasitol Res 104:945-8. 19. Pichon G. 1983. Crypto-periodicity in Mansonella ozzardi. Trans R Soc Trop Med Hyg 77:331-3. 20. Weil GJ, Lammie PJ, Weiss N. 1997. The ICT Filariasis Test: A rapid-format antigen test for diagnosis of bancroftian filariasis. Parasitol Today 13:401-4. 21. Dieye Y, Storey HL, Barrett KL, Gerth-Guyette E, Di Giorgio L, Golden A, Faulx D, Kalnoky M, Ndiaye MKN, Sy N, Mane M, Faye B, Sarr M, Dioukhane EM, Peck RB, Guinot P, de Los Santos T. 2017. Feasibility of utilizing the SD BIOLINE Onchocerciasis IgG4 rapid test in onchocerciasis surveillance in Senegal. PLoS Negl Trop Dis 11:e0005884. 22. Dolo H, Coulibaly YI, Dembele B, Guindo B, Coulibaly SY, Dicko I, Doumbia SS, Dembele M, Traore MO, Goita S, Dolo M, Soumaoro L, Coulibaly ME, Diallo AA, Diarra D, Zhang Y, Colebunders R, Nutman TB. 2019. Integrated seroprevalence-based assessment of Wuchereria bancrofti and Onchocerca volvulus in two lymphatic filariasis evaluation units of Mali with the SD Bioline Onchocerciasis/LF IgG4 Rapid Test. PLoS Negl Trop Dis 13:e0007064. 23. Burbelo PD, Ramanathan R, Klion AD, Iadarola MJ, Nutman TB. 2008. Rapid, novel, specific, high-throughput assay for diagnosis of Loa loa infection. J Clin Microbiol 46:2298-304. 24. Pedram B, Pasquetto V, Drame PM, Ji Y, Gonzalez-Moa MJ, Baldwin RK, Nutman TB, Biamonte MA. 2017. A novel rapid test for detecting antibody responses to Loa loa infections. PLoS Negl Trop Dis 11:e0005741. 25. Nuchprayoon S. 2009. DNA-based diagnosis of lymphatic filariasis. Southeast Asian J Trop Med Public Health 40:904-13. 26. Zimmerman PA, Guderian RH, Aruajo E, Elson L, Phadke P, Kubofcik J, Nutman TB. 1994. Polymerase chain reaction-based diagnosis of Onchocerca volvulus infection: improved detection of patients with onchocerciasis. J Infect Dis 169:686-9. 27. Toure FS, Bain O, Nerrienet E, Millet P, Wahl G, Toure Y, Doumbo O, Nicolas L, Georges AJ, McReynolds LA, Egwang TG. 1997. Detection of Loa loa- specific DNA in blood from occult-infected individuals. Exp Parasitol 86:163-70. 28. Medeiros JF, Almeida TA, Silva LB, Rubio JM, Crainey JL, Pessoa FA, Luz SL. 2015. A field trial of a PCR-based Mansonella ozzardi diagnosis assay detects high-levels of submicroscopic M. ozzardi infections in both venous blood samples and FTA card dried blood spots. Parasit Vectors 8:280. 29. Tang TH, Lopez-Velez R, Lanza M, Shelley AJ, Rubio JM, Luz SL. 2010. Nested PCR to detect and distinguish the sympatric filarial species Onchocerca volvulus, Mansonella ozzardi and Mansonella perstans in the Amazon Region. Mem Inst Oswaldo Cruz 105:823-8. 30. Poole CB, Li Z, Alhassan A, Guelig D, Diesburg S, Tanner NA, Zhang Y, Evans TC, Jr., LaBarre P, Wanji S, Burton RA, Carlow CK. 2017. Colorimetric tests for diagnosis of filarial infection and vector surveillance using non-instrumented

200

nucleic acid loop-mediated isothermal amplification (NINA-LAMP). PLoS One 12:e0169011. 31. Poole CB, Tanner NA, Zhang Y, Evans TC, Jr., Carlow CK. 2012. Diagnosis of brugian filariasis by loop-mediated isothermal amplification. PLoS Negl Trop Dis 6:e1948. 32. Rebollo MP, Bockarie MJ. 2017. Can Lymphatic Filariasis Be Eliminated by 2020? Trends Parasitol 33:83-92. 33. Zaky WI, Tomaino FR, Pilotte N, Laney SJ, Williams SA. 2018. Backpack PCR: A point-of-collection diagnostic platform for the rapid detection of Brugia parasites in mosquitoes. PLoS Negl Trop Dis 12:e0006962. 34. Dorkenoo MA, de Souza DK, Apetogbo Y, Oboussoumi K, Yehadji D, Tchalim M, Etassoli S, Koudou B, Ketoh GK, Sodahlon Y, Bockarie MJ, Boakye DA. 2018. Molecular xenomonitoring for post-validation surveillance of lymphatic filariasis in Togo: no evidence for active transmission. Parasit Vectors 11:52. 35. Morris CP, Evans H, Larsen SE, Mitre E. 2013. A comprehensive, model-based review of vaccine and repeat infection trials for filariasis. Clin Microbiol Rev 26:381-421. 36. Wong MM, Fredericks HJ, Ramachandran CP. 1969. Studies on immunization against Brugia malayi infection in the rhesus monkey. Bull World Health Organ 40:493-501. 37. Luder CG, Soboslay PT, Prince AM, Greene BM, Lucius R, Schulz-Key H. 1993. Experimental onchocerciasis in chimpanzees: cellular responses and antigen recognition after immunization and challenge with Onchocerca volvulus infective third-stage larvae. Parasitology 107 ( Pt 1):87-97. 38. Akue JP, Morelli A, Moukagni R, Moukana H, Blampain AG. 2003. Parasitological and immunological effects induced by immunization of Mandrillus sphinx against the human filarial Loa loa using infective stage larvae irradiated at 40 Krad. Parasite 10:263-8. 39. Horton J. 2000. Albendazole: a review of anthelmintic efficacy and safety in humans. Parasitology 121 Suppl:S113-32. 40. Gyapong JO, Kumaraswami V, Biswas G, Ottesen EA. 2005. Treatment strategies underpinning the global programme to eliminate lymphatic filariasis. Expert Opin Pharmacother 6:179-200. 41. Geary TG, Moreno Y. 2012. Macrocyclic lactone anthelmintics: spectrum of activity and mechanism of action. Curr Pharm Biotechnol 13:866-72. 42. Geary TG, Woo K, McCarthy JS, Mackenzie CD, Horton J, Prichard RK, de Silva NR, Olliaro PL, Lazdins-Helds JK, Engels DA, Bundy DA. 2010. Unresolved issues in anthelmintic pharmacology for helminthiases of humans. Int J Parasitol 40:1-13. 43. Peixoto CA, Silva BS. 2014. Anti-inflammatory effects of diethylcarbamazine: a review. Eur J Pharmacol 734:35-41. 44. Kwarteng A, Ahuno ST, Akoto FO. 2016. Killing filarial nematode parasites: role of treatment options and host immune response. Infect Dis Poverty 5:86. 45. Cline BL, Hernandez JL, Mather FJ, Bartholomew R, De Maza SN, Rodulfo S, Welborn CA, Eberhard ML, Convit J. 1992. Albendazole in the treatment of

201

onchocerciasis: double-blind clinical trial in Venezuela. Am J Trop Med Hyg 47:512-20. 46. Awadzi K, Edwards G, Duke BO, Opoku NO, Attah SK, Addy ET, Ardrey AE, Quartey BT. 2003. The co-administration of ivermectin and albendazole--safety, pharmacokinetics and efficacy against Onchocerca volvulus. Ann Trop Med Parasitol 97:165-78. 47. Cross HF, Haarbrink M, Egerton G, Yazdanbakhsh M, Taylor MJ. 2001. Severe reactions to filarial chemotherapy and release of Wolbachia endosymbionts into blood. Lancet 358:1873-5. 48. Keiser PB, Reynolds SM, Awadzi K, Ottesen EA, Taylor MJ, Nutman TB. 2002. Bacterial endosymbionts of Onchocerca volvulus in the pathogenesis of posttreatment reactions. J Infect Dis 185:805-11. 49. Herrick JA, Legrand F, Gounoue R, Nchinda G, Montavon C, Bopda J, Tchana SM, Ondigui BE, Nguluwe K, Fay MP, Makiya M, Metenou S, Nutman TB, Kamgno J, Klion AD. 2017. Posttreatment Reactions After Single-Dose Diethylcarbamazine or Ivermectin in Subjects With Loa loa Infection. Clin Infect Dis 64:1017-1025. 50. Taylor HR. 1989. Ivermectin treatment of onchocerciasis. Aust N Z J Ophthalmol 17:435-8. 51. Kamgno J, Pion SD, Chesnais CB, Bakalar MH, D'Ambrosio MV, Mackenzie CD, Nana-Djeunga HC, Gounoue-Kamkumo R, Njitchouang GR, Nwane P, Tchatchueng-Mbouga JB, Wanji S, Stolk WA, Fletcher DA, Klion AD, Nutman TB, Boussinesq M. 2017. A Test-and-Not-Treat Strategy for Onchocerciasis in Loa loa-Endemic Areas. N Engl J Med 377:2044-2052. 52. Taylor MJ, Bandi C, Hoerauf A. 2005. Wolbachia bacterial endosymbionts of filarial nematodes. Adv Parasitol 60:245-84. 53. Taylor MJ, Makunde WH, McGarry HF, Turner JD, Mand S, Hoerauf A. 2005. Macrofilaricidal activity after doxycycline treatment of Wuchereria bancrofti: a double-blind, randomised placebo-controlled trial. Lancet 365:2116-21. 54. Mand S, Pfarr K, Sahoo PK, Satapathy AK, Specht S, Klarmann U, Debrah AY, Ravindran B, Hoerauf A. 2009. Macrofilaricidal activity and amelioration of lymphatic pathology in bancroftian filariasis after 3 weeks of doxycycline followed by single-dose diethylcarbamazine. Am J Trop Med Hyg 81:702-11. 55. Hoerauf A, Mand S, Adjei O, Fleischer B, Buttner DW. 2001. Depletion of wolbachia endobacteria in Onchocerca volvulus by doxycycline and microfilaridermia after ivermectin treatment. Lancet 357:1415-6. 56. Hoerauf A, Mand S, Volkmann L, Buttner M, Marfo-Debrekyei Y, Taylor M, Adjei O, Buttner DW. 2003. Doxycycline in the treatment of human onchocerciasis: Kinetics of Wolbachia endobacteria reduction and of inhibition of embryogenesis in female Onchocerca worms. Microbes Infect 5:261-73. 57. Hoerauf A, Specht S, Buttner M, Pfarr K, Mand S, Fimmers R, Marfo-Debrekyei Y, Konadu P, Debrah AY, Bandi C, Brattig N, Albers A, Larbi J, Batsa L, Taylor MJ, Adjei O, Buttner DW. 2008. Wolbachia endobacteria depletion by doxycycline as antifilarial therapy has macrofilaricidal activity in onchocerciasis: a randomized placebo-controlled study. Med Microbiol Immunol 197:295-311.

202

58. Clare RH, Bardelle C, Harper P, Hong WD, Borjesson U, Johnston KL, Collier M, Myhill L, Cassidy A, Plant D, Plant H, Clark R, Cook DAN, Steven A, Archer J, McGillan P, Charoensutthivarakul S, Bibby J, Sharma R, Nixon GL, Slatko BE, Cantin L, Wu B, Turner J, Ford L, Rich K, Wigglesworth M, Berry NG, O'Neill PM, Taylor MJ, Ward SA. 2019. Industrial scale high-throughput screening delivers multiple fast acting macrofilaricides. Nat Commun 10:11. 59. Hong WD, Benayoud F, Nixon GL, Ford L, Johnston KL, Clare RH, Cassidy A, Cook DAN, Siu A, Shiotani M, Webborn PJH, Kavanagh S, Aljayyoussi G, Murphy E, Steven A, Archer J, Struever D, Frohberger SJ, Ehrens A, Hubner MP, Hoerauf A, Roberts AP, Hubbard ATM, Tate EW, Serwa RA, Leung SC, Qie L, Berry NG, Gusovsky F, Hemingway J, Turner JD, Taylor MJ, Ward SA, O'Neill PM. 2019. AWZ1066S, a highly specific anti-Wolbachia drug candidate for a short-course treatment of filariasis. Proc Natl Acad Sci U S A 116:1414-1419. 60. von Geldern TW, Morton HE, Clark RF, Brown BS, Johnston KL, Ford L, Specht S, Carr RA, Stolarik DF, Ma J, Rieser MJ, Struever D, Frohberger SJ, Koschel M, Ehrens A, Turner JD, Hubner MP, Hoerauf A, Taylor MJ, Ward SA, Marsh K, Kempf DJ. 2019. Discovery of ABBV-4083, a novel analog of Tylosin A that has potent anti-Wolbachia and anti-filarial activity. PLoS Negl Trop Dis 13:e0007159. 61. Jacobs RT, Lunde CS, Freund YR, Hernandez V, Li X, Xia Y, Carter DS, Berry PW, Halladay J, Rock F, Stefanakis R, Easom E, Plattner JJ, Ford L, Johnston KL, Cook DAN, Clare R, Cassidy A, Myhill L, Tyrer H, Gamble J, Guimaraes AF, Steven A, Lenz F, Ehrens A, Frohberger SJ, Koschel M, Hoerauf A, Hubner MP, McNamara CW, Bakowski MA, Turner JD, Taylor MJ, Ward SA. 2019. Boron-Pleuromutilins as Anti- Wolbachia Agents with Potential for Treatment of Onchocerciasis and Lymphatic Filariasis. J Med Chem 62:2521-2540. 62. Taylor MJ, von Geldern TW, Ford L, Hubner MP, Marsh K, Johnston KL, Sjoberg HT, Specht S, Pionnier N, Tyrer HE, Clare RH, Cook DAN, Murphy E, Steven A, Archer J, Bloemker D, Lenz F, Koschel M, Ehrens A, Metuge HM, Chunda VC, Ndongmo Chounna PW, Njouendou AJ, Fombad FF, Carr R, Morton HE, Aljayyoussi G, Hoerauf A, Wanji S, Kempf DJ, Turner JD, Ward SA. 2019. Preclinical development of an oral anti-Wolbachia macrolide drug for the treatment of lymphatic filariasis and onchocerciasis. Sci Transl Med 11. 63. Zhou W, Rousset F, O'Neil S. 1998. Phylogeny and PCR-based classification of Wolbachia strains using wsp gene sequences. Proc Biol Sci 265:509-15. 64. Werren JH, Baldo L, Clark ME. 2008. Wolbachia: master manipulators of invertebrate biology. Nat Rev Microbiol 6:741-51. 65. Lefoulon E, Bain O, Makepeace BL, d'Haese C, Uni S, Martin C, Gavotte L. 2016. Breakdown of coevolution between symbiotic bacteria Wolbachia and their filarial hosts. PeerJ 4:e1840. 66. Baldo L, Dunning Hotopp JC, Jolley KA, Bordenstein SR, Biber SA, Choudhury RR, Hayashi C, Maiden MC, Tettelin H, Werren JH. 2006. Multilocus sequence typing system for the endosymbiont Wolbachia pipientis. Appl Environ Microbiol 72:7098-110.

203

67. Bleidorn C, Gerth M. 2018. A critical re-evaluation of multilocus sequence typing (MLST) efforts in Wolbachia. FEMS Microbiol Ecol 94. 68. Ramirez-Puebla ST, Servin-Garciduenas LE, Ormeno-Orrillo E, Vera-Ponce de Leon A, Rosenblueth M, Delaye L, Martinez J, Martinez-Romero E. 2015. Species in Wolbachia? Proposal for the designation of 'Candidatus Wolbachia bourtzisii', 'Candidatus Wolbachia onchocercicola', 'Candidatus Wolbachia blaxteri', 'Candidatus Wolbachia brugii', 'Candidatus Wolbachia taylori', 'Candidatus Wolbachia collembolicola' and 'Candidatus Wolbachia multihospitum' for the different species within Wolbachia supergroups. Syst Appl Microbiol 38:390-9. 69. Wu M, Sun LV, Vamathevan J, Riegler M, Deboy R, Brownlie JC, McGraw EA, Martin W, Esser C, Ahmadinejad N, Wiegand C, Madupu R, Beanan MJ, Brinkac LM, Daugherty SC, Durkin AS, Kolonay JF, Nelson WC, Mohamoud Y, Lee P, Berry K, Young MB, Utterback T, Weidman J, Nierman WC, Paulsen IT, Nelson KE, Tettelin H, O'Neill SL, Eisen JA. 2004. Phylogenomics of the reproductive parasite Wolbachia pipientis wMel: a streamlined genome overrun by mobile genetic elements. PLoS Biol 2:E69. 70. Foster J, Ganatra M, Kamal I, Ware J, Makarova K, Ivanova N, Bhattacharyya A, Kapatral V, Kumar S, Posfai J, Vincze T, Ingram J, Moran L, Lapidus A, Omelchenko M, Kyrpides N, Ghedin E, Wang S, Goltsman E, Joukov V, Ostrovskaya O, Tsukerman K, Mazur M, Comb D, Koonin E, Slatko B. 2005. The Wolbachia genome of Brugia malayi: endosymbiont evolution within a human pathogenic nematode. PLoS Biol 3:e121. 71. Engelstadter J, Telschow A. 2009. Cytoplasmic incompatibility and host population structure. Heredity (Edinb) 103:196-207. 72. Cordaux R, Bouchon D, Greve P. 2011. The impact of endosymbionts on the evolution of host sex-determination mechanisms. Trends Genet 27:332-41. 73. Vavre F, Fleury F, Varaldi J, Fouillet P, Bouletreau M. 2000. Evidence for female mortality in Wolbachia-mediated cytoplasmic incompatibility in haplodiploid insects: epidemiologic and evolutionary consequences. Evolution 54:191-200. 74. Tram U, Sullivan W. 2002. Role of delayed nuclear envelope breakdown and mitosis in Wolbachia-induced cytoplasmic incompatibility. Science 296:1124-6. 75. Shropshire JD, On J, Layton EM, Zhou H, Bordenstein SR. 2018. One prophage WO gene rescues cytoplasmic incompatibility in Drosophila melanogaster. Proc Natl Acad Sci U S A 115:4987-4991. 76. LePage DP, Metcalf JA, Bordenstein SR, On J, Perlmutter JI, Shropshire JD, Layton EM, Funkhouser-Jones LJ, Beckmann JF, Bordenstein SR. 2017. Prophage WO genes recapitulate and enhance Wolbachia-induced cytoplasmic incompatibility. Nature 543:243-247. 77. Groenenboom MA, Hogeweg P. 2002. Space and the persistence of male-killing endosymbionts in insect populations. Proc Biol Sci 269:2509-18. 78. Bonte D, Hovestadt T, Poethke HJ. 2008. Male-killing endosymbionts: influence of environmental conditions on persistence of host metapopulation. BMC Evol Biol 8:243.

204

79. Kageyama D, Narita S, Watanabe M. 2012. Insect Sex Determination Manipulated by Their Endosymbionts: Incidences, Mechanisms and Implications. Insects 3:161-99. 80. Narita S, Kageyama D, Nomura M, Fukatsu T. 2007. Unexpected mechanism of symbiont-induced reversal of insect sex: feminizing Wolbachia continuously acts on the butterfly Eurema hecabe during larval development. Appl Environ Microbiol 73:4332-41. 81. Adachi-Hagimori T, Miura K, Stouthamer R. 2008. A new cytogenetic mechanism for bacterial endosymbiont-induced parthenogenesis in Hymenoptera. Proc Biol Sci 275:2667-73. 82. Brownlie JC, Cass BN, Riegler M, Witsenburg JJ, Iturbe-Ormaetxe I, McGraw EA, O'Neill SL. 2009. Evidence for metabolic provisioning by a common invertebrate endosymbiont, Wolbachia pipientis, during periods of nutritional stress. PLoS Pathog 5:e1000368. 83. Slatko BE, Taylor MJ, Foster JM. 2010. The Wolbachia endosymbiont as an anti- filarial nematode target. Symbiosis 51:55-65. 84. Bandi C, Anderson TJ, Genchi C, Blaxter ML. 1998. Phylogeny of Wolbachia in filarial nematodes. Proc Biol Sci 265:2407-13. 85. Brown AM, Wasala SK, Howe DK, Peetz AB, Zasada IA, Denver DR. 2016. Genomic evidence for plant-parasitic nematodes as the earliest Wolbachia hosts. Sci Rep 6:34955. 86. McGarry HF, Egerton GL, Taylor MJ. 2004. Population dynamics of Wolbachia bacterial endosymbionts in Brugia malayi. Mol Biochem Parasitol 135:57-67. 87. Taylor MJ, Voronin D, Johnston KL, Ford L. 2013. Wolbachia filarial interactions. Cell Microbiol 15:520-6. 88. Hoerauf A, Nissen-Pahle K, Schmetz C, Henkle-Duhrsen K, Blaxter ML, Buttner DW, Gallin MY, Al-Qaoud KM, Lucius R, Fleischer B. 1999. Tetracycline therapy targets intracellular bacteria in the filarial nematode Litomosoides sigmodontis and results in filarial infertility. J Clin Invest 103:11-8. 89. Bandi C, McCall JW, Genchi C, Corona S, Venco L, Sacchi L. 1999. Effects of tetracycline on the filarial worms Brugia pahangi and Dirofilaria immitis and their bacterial endosymbionts Wolbachia. Int J Parasitol 29:357-64. 90. Landmann F, Foster JM, Slatko B, Sullivan W. 2010. Asymmetric Wolbachia segregation during early Brugia malayi embryogenesis determines its distribution in adult host tissues. PLoS Negl Trop Dis 4:e758. 91. Foray V, Perez-Jimenez MM, Fattouh N, Landmann F. 2018. Wolbachia Control Stem Cell Behavior and Stimulate Germline Proliferation in Filarial Nematodes. Dev Cell 45:198-211 e3. 92. Casiraghi M, Bain O, Guerrero R, Martin C, Pocacqua V, Gardner SL, Franceschi A, Bandi C. 2004. Mapping the presence of Wolbachia pipientis on the phylogeny of filarial nematodes: evidence for symbiont loss during evolution. Int J Parasitol 34:191-203. 93. Desjardins CA, Cerqueira GC, Goldberg JM, Dunning Hotopp JC, Haas BJ, Zucker J, Ribeiro JM, Saif S, Levin JZ, Fan L, Zeng Q, Russ C, Wortman JR,

205

Fink DL, Birren BW, Nutman TB. 2013. Genomics of Loa loa, a Wolbachia-free filarial parasite of humans. Nat Genet 45:495-500. 94. McNulty SN, Foster JM, Mitreva M, Dunning Hotopp JC, Martin J, Fischer K, Wu B, Davis PJ, Kumar S, Brattig NW, Slatko BE, Weil GJ, Fischer PU. 2010. Endosymbiont DNA in endobacteria-free filarial nematodes indicates ancient horizontal genetic transfer. PLoS One 5:e11029. 95. Joubert DA, Walker T, Carrington LB, De Bruyne JT, Kien DH, Hoang Nle T, Chau NV, Iturbe-Ormaetxe I, Simmons CP, O'Neill SL. 2016. Establishment of a Wolbachia Superinfection in Aedes aegypti Mosquitoes as a Potential Approach for Future Resistance Management. PLoS Pathog 12:e1005434. 96. Hoffmann AA, Iturbe-Ormaetxe I, Callahan AG, Phillips BL, Billington K, Axford JK, Montgomery B, Turley AP, O'Neill SL. 2014. Stability of the wMel Wolbachia Infection following invasion into Aedes aegypti populations. PLoS Negl Trop Dis 8:e3115. 97. Flores HA, O'Neill SL. 2018. Controlling vector-borne diseases by releasing modified mosquitoes. Nat Rev Microbiol 16:508-518. 98. Moreira LA, Iturbe-Ormaetxe I, Jeffery JA, Lu G, Pyke AT, Hedges LM, Rocha BC, Hall-Mendelin S, Day A, Riegler M, Hugo LE, Johnson KN, Kay BH, McGraw EA, van den Hurk AF, Ryan PA, O'Neill SL. 2009. A Wolbachia symbiont in Aedes aegypti limits infection with dengue, Chikungunya, and Plasmodium. Cell 139:1268-78. 99. Thomas S, Verma J, Woolfit M, O'Neill SL. 2018. Wolbachia-mediated virus blocking in mosquito cells is dependent on XRN1-mediated viral RNA degradation and influenced by viral replication rate. PLoS Pathog 14:e1006879. 100. Carrington LB, Tran BCN, Le NTH, Luong TTH, Nguyen TT, Nguyen PT, Nguyen CVV, Nguyen HTC, Vu TT, Vo LT, Le DT, Vu NT, Nguyen GT, Luu HQ, Dang AD, Hurst TP, O'Neill SL, Tran VT, Kien DTH, Nguyen NM, Wolbers M, Wills B, Simmons CP. 2018. Field- and clinically derived estimates of Wolbachia-mediated blocking of dengue virus transmission potential in Aedes aegypti mosquitoes. Proc Natl Acad Sci U S A 115:361-366. 101. Aliota MT, Peinado SA, Velez ID, Osorio JE. 2016. The wMel strain of Wolbachia Reduces Transmission of Zika virus by Aedes aegypti. Sci Rep 6:28792. 102. Aliota MT, Walker EC, Uribe Yepes A, Velez ID, Christensen BM, Osorio JE. 2016. The wMel Strain of Wolbachia Reduces Transmission of Chikungunya Virus in Aedes aegypti. PLoS Negl Trop Dis 10:e0004677. 103. Dutra HL, Rocha MN, Dias FB, Mansur SB, Caragata EP, Moreira LA. 2016. Wolbachia Blocks Currently Circulating Zika Virus Isolates in Brazilian Aedes aegypti Mosquitoes. Cell Host Microbe 19:771-4. 104. O'Neill SL, Ryan PA, Turley AP, Wilson G, Retzki K, Iturbe-Ormaetxe I, Dong Y, Kenny N, Paton CJ, Ritchie SA, Brown-Kenyon J, Stanford D, Wittmeier N, Anders KL, Simmons CP. 2018. Scaled deployment of Wolbachia to protect the community from dengue and other Aedes transmitted arboviruses. Gates Open Res 2:36.

206

105. Schmidt TL, Barton NH, Rasic G, Turley AP, Montgomery BL, Iturbe-Ormaetxe I, Cook PE, Ryan PA, Ritchie SA, Hoffmann AA, O'Neill SL, Turelli M. 2017. Local introduction and heterogeneous spatial spread of dengue-suppressing Wolbachia through an urban population of Aedes aegypti. PLoS Biol 15:e2001894. 106. Indriani C, Ahmad RA, Wiratama BS, Arguni E, Supriyati E, Sasmono RT, Kisworini FY, Ryan PA, O'Neill SL, Simmons CP, Utarini A, Anders KL. 2018. Baseline Characterization of Dengue Epidemiology in Yogyakarta City, Indonesia, before a Randomized Controlled Trial of Wolbachia for Arboviral Disease Control. Am J Trop Med Hyg 99:1299-1307. 107. Specht S, Mand S, Marfo-Debrekyei Y, Debrah AY, Konadu P, Adjei O, Buttner DW, Hoerauf A. 2008. Efficacy of 2- and 4-week rifampicin treatment on the Wolbachia of Onchocerca volvulus. Parasitol Res 103:1303-9. 108. Supali T, Djuardi Y, Pfarr KM, Wibowo H, Taylor MJ, Hoerauf A, Houwing- Duistermaat JJ, Yazdanbakhsh M, Sartono E. 2008. Doxycycline treatment of Brugia malayi-infected persons reduces microfilaremia and adverse reactions after diethylcarbamazine and albendazole treatment. Clin Infect Dis 46:1385-93. 109. Ghedin E, Wang S, Spiro D, Caler E, Zhao Q, Crabtree J, Allen JE, Delcher AL, Guiliano DB, Miranda-Saavedra D, Angiuoli SV, Creasy T, Amedeo P, Haas B, El-Sayed NM, Wortman JR, Feldblyum T, Tallon L, Schatz M, Shumway M, Koo H, Salzberg SL, Schobel S, Pertea M, Pop M, White O, Barton GJ, Carlow CK, Crawford MJ, Daub J, Dimmic MW, Estes CF, Foster JM, Ganatra M, Gregory WF, Johnson NM, Jin J, Komuniecki R, Korf I, Kumar S, Laney S, Li BW, Li W, Lindblom TH, Lustigman S, Ma D, Maina CV, Martin DM, McCarter JP, McReynolds L, et al. 2007. Draft genome of the filarial nematode parasite Brugia malayi. Science 317:1756-60. 110. O'Connell EM, Bennuru S, Steel C, Dolan MA, Nutman TB. 2015. Targeting Filarial Abl-like Kinases: Orally Available, Food and Drug Administration- Approved Tyrosine Kinase Inhibitors Are Microfilaricidal and Macrofilaricidal. J Infect Dis 212:684-93. 111. Cotton JA, Bennuru S, Grote A, Harsha B, Tracey A, Beech R, Doyle SR, Dunn M, Hotopp JC, Holroyd N, Kikuchi T, Lambert O, Mhashilkar A, Mutowo P, Nursimulu N, Ribeiro JM, Rogers MB, Stanley E, Swapna LS, Tsai IJ, Unnasch TR, Voronin D, Parkinson J, Nutman TB, Ghedin E, Berriman M, Lustigman S. 2016. The genome of Onchocerca volvulus, agent of river blindness. Nat Microbiol 2:16216. 112. Crowther GJ, Shanmugam D, Carmona SJ, Doyle MA, Hertz-Fowler C, Berriman M, Nwaka S, Ralph SA, Roos DS, Van Voorhis WC, Aguero F. 2010. Identification of attractive drug targets in neglected-disease pathogens using an in silico approach. PLoS Negl Trop Dis 4:e804. 113. Bennuru S, O'Connell EM, Drame PM, Nutman TB. 2018. Mining Filarial Genomes for Diagnostic and Therapeutic Targets. Trends Parasitol 34:80-90. 114. Kumar S, Chaudhary K, Foster JM, Novelli JF, Zhang Y, Wang S, Spiro D, Ghedin E, Carlow CK. 2007. Mining predicted essential genes of Brugia malayi for nematode drug targets. PLoS One 2:e1189.

207

115. Raverdy S, Foster JM, Roopenian E, Carlow CK. 2008. The Wolbachia endosymbiont of Brugia malayi has an active pyruvate phosphate dikinase. Mol Biochem Parasitol 160:163-6. 116. Foster JM, Raverdy S, Ganatra MB, Colussi PA, Taron CH, Carlow CK. 2009. The Wolbachia endosymbiont of Brugia malayi has an active phosphoglycerate mutase: a candidate target for anti-filarial therapies. Parasitol Res 104:1047-52. 117. Choi YJ, Ghedin E, Berriman M, McQuillan J, Holroyd N, Mayhew GF, Christensen BM, Michalski ML. 2011. A deep sequencing approach to comparatively analyze the transcriptome of lifecycle stages of the filarial worm, Brugia malayi. PLoS Negl Trop Dis 5:e1409. 118. Grote A, Voronin D, Ding T, Twaddle A, Unnasch TR, Lustigman S, Ghedin E. 2017. Defining Brugia malayi and Wolbachia symbiosis by stage-specific dual RNA-seq. PLoS Negl Trop Dis 11:e0005357. 119. Taylor CM, Martin J, Rao RU, Powell K, Abubucker S, Mitreva M. 2013. Using existing drugs as leads for broad spectrum anthelmintics targeting protein kinases. PLoS Pathog 9:e1003149. 120. Bennuru S, Cotton JA, Ribeiro JM, Grote A, Harsha B, Holroyd N, Mhashilkar A, Molina DM, Randall AZ, Shandling AD, Unnasch TR, Ghedin E, Berriman M, Lustigman S, Nutman TB. 2016. Stage-Specific Transcriptome and Proteome Analyses of the Filarial Parasite Onchocerca volvulus and Its Wolbachia Endosymbiont. MBio 7. 121. Babayan SA, Allen JE, Taylor DW. 2012. Future prospects and challenges of vaccines against filariasis. Parasite Immunol 34:243-53. 122. Menon R, Gasser RB, Mitreva M, Ranganathan S. 2012. An analysis of the transcriptome of Teladorsagia circumcincta: its biological and biotechnological implications. BMC Genomics 13 Suppl 7:S10. 123. Zhan B, Arumugam S, Kennedy MW, Tricoche N, Lian LY, Asojo OA, Bennuru S, Bottazzi ME, Hotez PJ, Lustigman S, Klei TR. 2018. Ligand binding properties of two Brugia malayi fatty acid and retinol (FAR) binding proteins and their vaccine efficacies against challenge infection in gerbils. PLoS Negl Trop Dis 12:e0006772. 124. Morris CP, Bennuru S, Kropp LE, Zweben JA, Meng Z, Taylor RT, Chan K, Veenstra TD, Nutman TB, Mitre E. 2015. A Proteomic Analysis of the Body Wall, Digestive Tract, and Reproductive Tract of Brugia malayi. PLoS Negl Trop Dis 9:e0004054. 125. Bennuru S, Lustigman S, Abraham D, Nutman TB. 2017. Metabolite profiling of infection-associated metabolic markers of onchocerciasis. Mol Biochem Parasitol 215:58-69. 126. Drame PM, Bennuru S, Nutman TB. 2017. Discovery of Specific Antigens That Can Predict Microfilarial Intensity in Loa loa Infection. J Clin Microbiol 55:2671- 2678. 127. Drame PM, Meng Z, Bennuru S, Herrick JA, Veenstra TD, Nutman TB. 2016. Identification and Validation of Loa loa Microfilaria-Specific Biomarkers: a Rational Design Approach Using Proteomics and Novel Immunoassays. MBio 7:e02132-15.

208

128. Li BW, Rush AC, Weil GJ. 2014. High level expression of a glutamate-gated chloride channel gene in reproductive tissues of Brugia malayi may explain the sterilizing effect of ivermectin on filarial worms. Int J Parasitol Drugs Drug Resist 4:71-6. 129. Schwab AE, Churcher TS, Schwab AJ, Basanez MG, Prichard RK. 2007. An analysis of the population genetics of potential multi-drug resistance in Wuchereria bancrofti due to combination chemotherapy. Parasitology 134:1025- 40. 130. Anders S, Pyl PT, Huber W. 2015. HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics 31:166-9. 131. Liao Y, Smyth GK, Shi W. 2014. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30:923-30. 132. Bray NL, Pimentel H, Melsted P, Pachter L. 2016. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34:525-7. 133. Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. 2017. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417- 419. 134. Li B, Dewey CN. 2011. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12:323. 135. Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J. 2000. Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci U S A 97:6652-7. 136. Naidoo S, Visser EA, Zwart L, Toit YD, Bhadauria V, Shuey LS. 2017. Dual RNA-Sequencing to Elucidate the Plant-Pathogen Duel. Curr Issues Mol Biol 27:127-142. 137. Niemiec MJ, Grumaz C, Ermert D, Desel C, Shankar M, Lopes JP, Mills IG, Stevens P, Sohn K, Urban CF. 2017. Dual transcriptome of the immediate neutrophil and Candida albicans interplay. BMC Genomics 18:696. 138. Li P, Xu Z, Sun X, Yin Y, Fan Y, Zhao J, Mao X, Huang J, Yang F, Zhu L. 2017. Transcript profiling of the immunological interactions between pleuropneumoniae serotype 7 and the host by dual RNA-seq. BMC Microbiol 17:193. 139. Bruno VM, Shetty AC, Yano J, Fidel PL, Jr., Noverr MC, Peters BM. 2015. Transcriptomic analysis of vulvovaginal candidiasis identifies a role for the NLRP3 inflammasome. MBio 6. 140. Humphrys MS, Creasy T, Sun Y, Shetty AC, Chibucos MC, Drabek EF, Fraser CM, Farooq U, Sengamalay N, Ott S, Shou H, Bavoil PM, Mahurkar A, Myers GS. 2013. Simultaneous transcriptional profiling of bacteria and their host cells. PLoS One 8:e80597. 141. Hennessy RC, Glaring MA, Olsson S, Stougaard P. 2017. Transcriptomic profiling of microbe-microbe interactions reveals the specific response of the biocontrol strain P. fluorescens In5 to the phytopathogen Rhizoctonia solani. BMC Res Notes 10:376.

209

142. Bing X, Attardo GM, Vigneron A, Aksoy E, Scolari F, Malacrida A, Weiss BL, Aksoy S. 2017. Unravelling the relationship between the tsetse fly and its obligate symbiont Wigglesworthia: transcriptomic and metabolomic landscapes reveal highly integrated physiological networks. Proc Biol Sci 284. 143. Camilios-Neto D, Bonato P, Wassem R, Tadra-Sfeir MZ, Brusamarello-Santos LC, Valdameri G, Donatti L, Faoro H, Weiss VA, Chubatsu LS, Pedrosa FO, Souza EM. 2014. Dual RNA-seq transcriptional analysis of wheat roots colonized by Azospirillum brasilense reveals up-regulation of nutrient acquisition and cell cycle genes. BMC Genomics 15:378. 144. Westermann AJ, Gorski SA, Vogel J. 2012. Dual RNA-seq of pathogen and host. Nat Rev Microbiol 10:618-30. 145. Westermann AJ, Barquist L, Vogel J. 2017. Resolving host-pathogen interactions by dual RNA-seq. PLoS Pathog 13:e1006033. 146. Kumar N, Lin M, Zhao X, Ott S, Santana-Cruz I, Daugherty S, Rikihisa Y, Sadzewicz L, Tallon LJ, Fraser CM, Dunning Hotopp JC. 2016. Efficient Enrichment of Bacterial mRNA from Host-Bacteria Total RNA Samples. Sci Rep 6:34850. 147. Zhao W, He X, Hoadley KA, Parker JS, Hayes DN, Perou CM. 2014. Comparison of RNA-Seq by poly (A) capture, ribosomal RNA depletion, and DNA microarray for expression profiling. BMC Genomics 15:419. 148. O'Neil D, Glowatz H, Schlumpberger M. 2013. Ribosomal RNA depletion for efficient use of RNA-seq capacity. Curr Protoc Mol Biol Chapter 4:Unit 4 19. 149. Kumar N, Creasy T, Sun Y, Flowers M, Tallon LJ, Dunning Hotopp JC. 2012. Efficient subtraction of insect rRNA prior to transcriptome analysis of Wolbachia- Drosophila lateral gene transfer. BMC Res Notes 5:230. 150. Amorim-Vaz S, Tran Vdu T, Pradervand S, Pagni M, Coste AT, Sanglard D. 2015. RNA Enrichment Method for Quantitative Transcriptional Analysis of Pathogens In Vivo Applied to the Fungus Candida albicans. MBio 6:e00942-15. 151. Choi YJ, Aliota MT, Mayhew GF, Erickson SM, Christensen BM. 2014. Dual RNA-seq of parasite and host reveals gene expression dynamics during filarial worm-mosquito interactions. PLoS Negl Trop Dis 8:e2905. 152. Sheppard DC, Rieg G, Chiang LY, Filler SG, Edwards JE, Jr., Ibrahim AS. 2004. Novel inhalational murine model of invasive pulmonary aspergillosis. Antimicrob Agents Chemother 48:1908-11. 153. Chiang LY, Sheppard DC, Gravelat FN, Patterson TF, Filler SG. 2008. Aspergillus fumigatus stimulates leukocyte adhesion molecules and cytokine production by endothelial cells in vitro and during invasive pulmonary disease. Infect Immun 76:3429-38. 154. Morgulis A, Gertz EM, Schaffer AA, Agarwala R. 2006. A fast and symmetric DUST implementation to mask low-complexity DNA sequences. J Comput Biol 13:1028-40. 155. Trapnell C, Pachter L, Salzberg SL. 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105-11.

210

156. Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW. 2007. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res 35:3100-8. 157. Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory- efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25. 158. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078-9. 159. Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods 9:357-9. 160. Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics 25:1754-60. 161. Kim D, Langmead B, Salzberg SL. 2015. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 12:357-60. 162. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28:511-5. 163. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L. 2011. Improving RNA- Seq expression estimates by correcting for fragment bias. Genome Biol 12:R22. 164. Roberts A, Pimentel H, Trapnell C, Pachter L. 2011. Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 27:2325-9. 165. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. 2013. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol 31:46-53. 166. Roberts A, Pachter L. 2013. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods 10:71-3. 167. Si Y, Liu P, Li P, Brutnell TP. 2014. Model-based clustering for RNA-seq data. Bioinformatics 30:197-205. 168. Garalde DR, Snell EA, Jachimowicz D, Sipos B, Lloyd JH, Bruce M, Pantic N, Admassu T, James P, Warland A, Jordan M, Ciccone J, Serra S, Keenan J, Martin S, McNeill L, Wallace EJ, Jayasinghe L, Wright C, Blasco J, Young S, Brocklebank D, Juul S, Clarke J, Heron AJ, Turner DJ. 2018. Highly parallel direct RNA sequencing on an array of nanopores. Nat Methods 15:201-206. 169. Bezanson J, Edelman A, Karpinski S, Shah VB. 2017. Julia: A Fresh Approach to Numerical Computing. SIAM Review 59:65-98. 170. Wynd S, Melrose WD, Durrheim DN, Carron J, Gyapong M. 2007. Understanding the community impact of lymphatic filariasis: a review of the sociocultural literature. Bull World Health Organ 85:493-8. 171. Ottesen EA. 2006. Lymphatic filariasis: Treatment, control and elimination. Adv Parasitol 61:395-441. 172. Adinarayanan S, Critchley J, Das PK, Gelband H. 2007. Diethylcarbamazine (DEC)-medicated salt for community-based control of lymphatic filariasis. Cochrane Database Syst Rev doi:10.1002/14651858.CD003758.pub2:CD003758.

211

173. Campbell WC. 1991. Ivermectin as an antiparasitic agent for use in humans. Annu Rev Microbiol 45:445-74. 174. Ballesteros C, Tritten L, O'Neill M, Burkman E, Zaky WI, Xia J, Moorhead A, Williams SA, Geary TG. 2016. The Effects of Ivermectin on Brugia malayi Females In Vitro: A Transcriptomic Approach. PLoS Negl Trop Dis 10:e0004929. 175. Noroes J, Dreyer G, Santos A, Mendes VG, Medeiros Z, Addiss D. 1997. Assessment of the efficacy of diethylcarbamazine on adult Wuchereria bancrofti in vivo. Trans R Soc Trop Med Hyg 91:78-81. 176. Derua YA, Kisinza WN, Simonsen PE. 2018. Lymphatic filariasis control in Tanzania: infection, disease perceptions and drug uptake patterns in an endemic community after multiple rounds of mass drug administration. Parasit Vectors 11:429. 177. Johnston KL, Ford L, Taylor MJ. 2014. Overcoming the challenges of drug discovery for neglected tropical diseases: the A.WOL experience. J Biomol Screen 19:335-43. 178. Turner JD, Sharma R, Al Jayoussi G, Tyrer HE, Gamble J, Hayward L, Priestley RS, Murphy EA, Davies J, Waterhouse D, Cook DAN, Clare RH, Cassidy A, Steven A, Johnston KL, McCall J, Ford L, Hemingway J, Ward SA, Taylor MJ. 2017. Albendazole and antibiotics synergize to deliver short-course anti- Wolbachia curative treatments in preclinical models of filariasis. Proc Natl Acad Sci U S A 114:E9712-E9721. 179. Taylor MJ, Hoerauf A, Townson S, Slatko BE, Ward SA. 2014. Anti-Wolbachia drug discovery and development: safe macrofilaricides for onchocerciasis and lymphatic filariasis. Parasitology 141:119-27. 180. Holman AG, Davis PJ, Foster JM, Carlow CK, Kumar S. 2009. Computational prediction of essential genes in an unculturable endosymbiotic bacterium, Wolbachia of Brugia malayi. BMC Microbiol 9:243. 181. Wu B, Novelli J, Foster J, Vaisvila R, Conway L, Ingram J, Ganatra M, Rao AU, Hamza I, Slatko B. 2009. The heme biosynthetic pathway of the obligate Wolbachia endosymbiont of Brugia malayi as a potential anti-filarial drug target. PLoS Negl Trop Dis 3:e475. 182. Mutafchiev Y, Bain O, Williams Z, McCall JW, Michalski ML. 2014. Intraperitoneal development of the filarial nematode Brugia malayi in the Mongolian jird (Meriones unguiculatus). Parasitol Res 113:1827-35. 183. Chung M, Teigen L, Liu H, Libro S, Shetty A, Kumar N, Zhao X, Bromley RE, Tallon LJ, Sadzewicz L, Fraser CM, Rasko DA, Filler SG, Foster JM, Michalski ML, Bruno VM, Dunning Hotopp JC. 2018. Targeted enrichment outperforms other enrichment techniques and enables more multi-species RNA-Seq analyses. Sci Rep 8:13377. 184. Langfelder P, Horvath S. 2008. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9:559. 185. Langfelder P, Horvath S. 2007. Eigengene networks for studying the relationships between co-expression modules. BMC Syst Biol 1:54.

212

186. Stogios PJ, Prive GG. 2004. The BACK domain in BTB-kelch proteins. Trends Biochem Sci 29:634-7. 187. Wang Q, Burles K, Couturier B, Randall CM, Shisler J, Barry M. 2014. Ectromelia virus encodes a BTB/kelch protein, EVM150, that inhibits NF-kappaB signaling. J Virol 88:4853-65. 188. Chen HY, Chen RH. 2016. Cullin 3 Ubiquitin Ligases in Cancer Biology: Functions and Therapeutic Implications. Front Oncol 6:113. 189. Pintard L, Willems A, Peter M. 2004. Cullin-based ubiquitin ligases: Cul3-BTB complexes join the family. EMBO J 23:1681-7. 190. Linder P. 2006. Dead-box proteins: a family affair--active and passive players in RNP-remodeling. Nucleic Acids Res 34:4168-80. 191. Durr H, Flaus A, Owen-Hughes T, Hopfner KP. 2006. Snf2 family ATPases and DExx box helicases: differences and unifying concepts from high-resolution crystal structures. Nucleic Acids Res 34:4160-7. 192. Chiu LY, Gong F, Miller KM. 2017. Bromodomain proteins: repairing DNA damage within chromatin. Philos Trans R Soc Lond B Biol Sci 372. 193. Cassandri M, Smirnov A, Novelli F, Pitolli C, Agostini M, Malewicz M, Melino G, Raschella G. 2017. Zinc-finger proteins in health and disease. Cell Death Discov 3:17071. 194. Boyer LA, Latek RR, Peterson CL. 2004. The SANT domain: a unique histone- tail-binding module? Nat Rev Mol Cell Biol 5:158-63. 195. Soza-Ried C, Hess I, Netuschil N, Schorpp M, Boehm T. 2010. Essential role of c-myb in definitive hematopoiesis is evolutionarily conserved. Proc Natl Acad Sci U S A 107:17304-8. 196. Burglin TR, Affolter M. 2016. Homeodomain proteins: an update. Chromosoma 125:497-521. 197. He W, Parker R. 2000. Functions of Lsm proteins in mRNA degradation and splicing. Curr Opin Cell Biol 12:346-50. 198. Kufel J, Allmang C, Petfalski E, Beggs J, Tollervey D. 2003. Lsm Proteins are required for normal processing and stability of ribosomal RNAs. J Biol Chem 278:2147-56. 199. Petersen TE, Thogersen HC, Skorstengaard K, Vibe-Pedersen K, Sahl P, Sottrup- Jensen L, Magnusson S. 1983. Partial primary structure of bovine plasma fibronectin: three types of internal homology. Proc Natl Acad Sci U S A 80:137- 41. 200. Gregory WF, Atmadja AK, Allen JE, Maizels RM. 2000. The abundant larval transcript-1 and -2 genes of Brugia malayi encode stage-specific candidate vaccine antigens for filariasis. Infect Immun 68:4174-9. 201. Wassarman P, Chen J, Cohen N, Litscher E, Liu C, Qi H, Williams Z. 1999. Structure and function of the mammalian egg zona pellucida. J Exp Zool 285:251- 8. 202. Wassarman KM. 2007. 6S RNA: a small RNA regulator of transcription. Curr Opin Microbiol 10:164-8.

213

203. Darby AC, Gill AC, Armstrong SD, Hartley CS, Xia D, Wastling JM, Makepeace BL. 2014. Integrated transcriptomic and proteomic analysis of the global response of Wolbachia to doxycycline-induced stress. ISME J 8:925-37. 204. Faucher SP, Friedlander G, Livny J, Margalit H, Shuman HA. 2010. 6S RNA optimizes intracellular multiplication. Proc Natl Acad Sci U S A 107:7533-8. 205. Wassarman KM. 2018. 6S RNA, a Global Regulator of Transcription. Microbiol Spectr 6. 206. Akopian D, Shen K, Zhang X, Shan SO. 2013. Signal recognition particle: an essential protein-targeting machine. Annu Rev Biochem 82:693-721. 207. Anantharaman V, Aravind L. 2002. The GOLD domain, a novel protein module involved in Golgi function and secretion. Genome Biol 3:research0023. 208. Fischer HM, Wheat CW, Heckel DG, Vogel H. 2008. Evolutionary origins of a novel host plant detoxification gene in butterflies. Mol Biol Evol 25:809-20. 209. Finn RN. 2007. Vertebrate yolk complexes and the functional implications of phosvitins and other subdomains in vitellogenins. Biol Reprod 76:926-35. 210. Park JH, Attardo GM, Hansen IA, Raikhel AS. 2006. GATA factor translation is the final downstream step in the amino acid/target-of-rapamycin-mediated vitellogenin gene expression in the anautogenous mosquito Aedes aegypti. J Biol Chem 281:11167-76. 211. Price DP, Nagarajan V, Churbanov A, Houde P, Milligan B, Drake LL, Gustafson JE, Hansen IA. 2011. The fat body transcriptomes of the yellow fever mosquito Aedes aegypti, pre- and post- blood meal. PLoS One 6:e22573. 212. Randall TA, Perera L, London RE, Mueller GA. 2013. Genomic, RNAseq, and molecular modeling evidence suggests that the major allergen domain in insects evolved from a homodimeric origin. Genome Biol Evol 5:2344-58. 213. Nolan T, Petris E, Muller HM, Cronin A, Catteruccia F, Crisanti A. 2011. Analysis of two novel midgut-specific promoters driving transgene expression in Anopheles stephensi mosquitoes. PLoS One 6:e16471. 214. Javadian E, Macdonald WW. 1974. The effect of infection with Brugia pahangi and Dirofilaria repens on the egg-production of Aedes aegypti. Ann Trop Med Parasitol 68:477-81. 215. Gleave K, Cook D, Taylor MJ, Reimer LJ. 2016. Filarial infection influences mosquito behaviour and fecundity. Sci Rep 6:36319. 216. de Jong WW, Leunissen JA, Voorter CE. 1993. Evolution of the alpha- crystallin/small heat-shock protein family. Mol Biol Evol 10:103-26. 217. Huntington JA. 2011. Serpin structure, function and dysfunction. J Thromb Haemost 9 Suppl 1:26-34. 218. Fujisawa T, Filippakopoulos P. 2017. Functions of bromodomain-containing proteins and their roles in homeostasis and cancer. Nat Rev Mol Cell Biol 18:246- 262. 219. Jostes S, Nettersheim D, Fellermeyer M, Schneider S, Hafezi F, Honecker F, Schumacher V, Geyer M, Kristiansen G, Schorle H. 2017. The bromodomain inhibitor JQ1 triggers growth arrest and apoptosis in testicular germ cell tumours in vitro and in vivo. J Cell Mol Med 21:1300-1314.

214

220. Bid HK, Kerk S. 2016. BET bromodomain inhibitor (JQ1) and tumor angiogenesis. Oncoscience 3:316-317. 221. Perez-Salvia M, Esteller M. 2017. Bromodomain inhibitors and cancer therapy: From structures to applications. Epigenetics 12:323-339. 222. Da Costa D, Agathanggelou A, Perry T, Weston V, Petermann E, Zlatanou A, Oldreive C, Wei W, Stewart G, Longman J, Smith E, Kearns P, Knapp S, Stankovic T. 2013. BET inhibition as a single or combined therapeutic approach in primary paediatric B-precursor acute lymphoblastic leukaemia. Blood Cancer J 3:e126. 223. Andrieu G, Belkina AC, Denis GV. 2016. Clinical trials for BET inhibitors run ahead of the science. Drug Discov Today Technol 19:45-50. 224. Filippakopoulos P, Qi J, Picaud S, Shen Y, Smith WB, Fedorov O, Morse EM, Keates T, Hickman TT, Felletar I, Philpott M, Munro S, McKeown MR, Wang Y, Christie AL, West N, Cameron MJ, Schwartz B, Heightman TD, La Thangue N, French CA, Wiest O, Kung AL, Knapp S, Bradner JE. 2010. Selective inhibition of BET bromodomains. Nature 468:1067-73. 225. Stathis A, Bertoni F. 2018. BET Proteins as Targets for Anticancer Treatment. Cancer Discov 8:24-36. 226. Ceron J, Rual JF, Chandra A, Dupuy D, Vidal M, van den Heuvel S. 2007. Large- scale RNAi screens identify novel genes that interact with the C. elegans retinoblastoma pathway as well as splicing-related components with synMuv B activity. BMC Dev Biol 7:30. 227. Sonnichsen B, Koski LB, Walsh A, Marschall P, Neumann B, Brehm M, Alleaume AM, Artelt J, Bettencourt P, Cassin E, Hewitson M, Holz C, Khan M, Lazik S, Martin C, Nitzsche B, Ruer M, Stamford J, Winzi M, Heinkel R, Roder M, Finell J, Hantsch H, Jones SJ, Jones M, Piano F, Gunsalus KC, Oegema K, Gonczy P, Coulson A, Hyman AA, Echeverri CJ. 2005. Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans. Nature 434:462-9. 228. Rao R, Well GJ. 2002. In vitro effects of antibiotics on Brugia malayi worm survival and reproduction. J Parasitol 88:605-11. 229. Comley JC, Rees MJ, Turner CH, Jenkins DC. 1989. Colorimetric quantitation of filarial viability. Int J Parasitol 19:77-83. 230. Trabucco SE, Gerstein RM, Evens AM, Bradner JE, Shultz LD, Greiner DL, Zhang H. 2015. Inhibition of bromodomain proteins for the treatment of human diffuse large B-cell lymphoma. Clin Cancer Res 21:113-22. 231. Luck AN, Evans CC, Riggs MD, Foster JM, Moorhead AR, Slatko BE, Michalski ML. 2014. Concurrent transcriptional profiling of Dirofilaria immitis and its Wolbachia endosymbiont throughout the nematode life cycle reveals coordinated gene expression. BMC Genomics 15:1041. 232. Lehane MJ, Laurence BR. 1977. Flight muscle ultrastructure of susceptible and refractory mosquitoes parasitized by larval Brugia pahangi. Parasitology 74:87- 92. 233. Michalski ML, Griffiths KG, Williams SA, Kaplan RM, Moorhead AR. 2011. The NIH-NIAID Filariasis Research Reagent Resource Center. PLoS Negl Trop Dis 5:e1261.

215

234. Griffiths KG, Mayhew GF, Zink RL, Erickson SM, Fuchs JF, McDermott CM, Christensen BM, Michalski ML. 2009. Use of microarray hybridization to identify Brugia genes involved in mosquito infectivity. Parasitol Res 106:227-35. 235. Griffiths KG, Alworth LC, Harvey SB, Michalski ML. 2010. Using an intravenous catheter to carry out abdominal lavage in the gerbil. Lab Anim (NY) 39:143-8. 236. Chung M, Teigen L, Libro S, Bromley RE, Kumar N, Sadzewicz L, Tallon LJ, Foster JM, Michalski ML, Dunning Hotopp JC. 2018. Multispecies Transcriptomics Data Set of Brugia malayi, Its Wolbachia Endosymbiont wBm, and Aedes aegypti across the B. malayi Life Cycle. Microbiol Resour Announc 7. 237. Chung M, Adkins RS, Shetty AC, Sadzewicz L, Tallon LJ, Fraser CM, Rasko DA, Mahurkar A, Dunning Hotopp J. 2018. FADU: A Feature Counting Tool for Prokaryotic RNA-Seq Analysis. bioRxiv doi:10.1101/337600. 238. Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139-40. 239. Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S. 2014. InterProScan 5: genome- scale protein function classification. Bioinformatics 30:1236-40. 240. Small ST, Reimer LJ, Tisch DJ, King CL, Christensen BM, Siba PM, Kazura JW, Serre D, Zimmerman PA. 2016. Population genomics of the filarial nematode parasite Wuchereria bancrofti from mosquitoes. Mol Ecol 25:1465-77. 241. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. 2004. Versatile and open software for comparing large genomes. Genome Biol 5:R12. 242. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 19:455-77. 243. Rissman AI, Mau B, Biehl BS, Darling AE, Glasner JD, Perna NT. 2009. Reordering contigs of draft genomes using the Mauve aligner. Bioinformatics 25:2071-3. 244. Delcher AL, Bratke KA, Powers EC, Salzberg SL. 2007. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23:673-9. 245. Galens K, Orvis J, Daugherty S, Creasy HH, Angiuoli S, White O, Wortman J, Mahurkar A, Giglio MG. 2011. The IGS Standard Operating Procedure for Automated Prokaryotic Annotation. Stand Genomic Sci 4:244-51. 246. Angiuoli SV, Salzberg SL. 2011. Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics 27:334-42. 247. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75:7537-41.

216

248. Klasson L, Walker T, Sebaihia M, Sanders MJ, Quail MA, Lord A, Sanders S, Earl J, O'Neill SL, Thomson N, Sinkins SP, Parkhill J. 2008. Genome evolution of Wolbachia strain wPip from the Culex pipiens group. Mol Biol Evol 25:1877- 87. 249. Klasson L, Westberg J, Sapountzis P, Naslund K, Lutnaes Y, Darby AC, Veneti Z, Chen L, Braig HR, Garrett R, Bourtzis K, Andersson SG. 2009. The mosaic genome structure of the Wolbachia wRi strain infecting Drosophila simulans. Proc Natl Acad Sci U S A 106:5725-30. 250. Darby AC, Armstrong SD, Bah GS, Kaur G, Hughes MA, Kay SM, Koldkjaer P, Rainbow L, Radford AD, Blaxter ML, Tanya VN, Trees AJ, Cordaux R, Wastling JM, Makepeace BL. 2012. Analysis of gene expression from the Wolbachia genome of a filarial nematode supports both metabolic and defensive roles within the symbiosis. Genome Res 22:2467-77. 251. Comandatore F, Sassera D, Montagna M, Kumar S, Koutsovoulos G, Thomas G, Repton C, Babayan SA, Gray N, Cordaux R, Darby A, Makepeace B, Blaxter M. 2013. Phylogenomics and analysis of shared genes suggest a single transition to mutualism in Wolbachia of nematodes. Genome Biol Evol 5:1668-74. 252. Ellegaard KM, Klasson L, Naslund K, Bourtzis K, Andersson SG. 2013. Comparative genomics of Wolbachia and the bacterial species concept. PLoS Genet 9:e1003381. 253. Nikoh N, Hosokawa T, Moriyama M, Oshima K, Hattori M, Fukatsu T. 2014. Evolutionary origin of insect-Wolbachia nutritional mutualism. Proc Natl Acad Sci U S A 111:10257-62. 254. Sutton ER, Harris SR, Parkhill J, Sinkins SP. 2014. Comparative genome analysis of Wolbachia strain wAu. BMC Genomics 15:928. 255. Lindsey AR, Werren JH, Richards S, Stouthamer R. 2016. Comparative Genomics of a Parthenogenesis-Inducing Wolbachia Symbiont. G3 (Bethesda) 6:2113-23. 256. Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22:2688-90. 257. Riley DR, Angiuoli SV, Crabtree J, Dunning Hotopp JC, Tettelin H. 2012. Using Sybil for interactive comparative genomics of microbes on the web. Bioinformatics 28:160-6. 258. Geniez S, Foster JM, Kumar S, Moumen B, Leproust E, Hardy O, Guadalupe M, Thomas SJ, Boone B, Hendrickson C, Bouchon D, Greve P, Slatko BE. 2012. Targeted genome enrichment for efficient purification of endosymbiont DNA from host DNA. Symbiosis 58:201-207. 259. Ioannidis P, Johnston KL, Riley DR, Kumar N, White JR, Olarte KT, Ott S, Tallon LJ, Foster JM, Taylor MJ, Dunning Hotopp JC. 2013. Extensively duplicated and transcriptionally active recent lateral gene transfer from a bacterial Wolbachia endosymbiont to its host filarial nematode Brugia malayi. BMC Genomics 14:639. 260. Dunning Hotopp JC, Slatko BE, Foster JM. 2017. Targeted Enrichment and Sequencing of Recent Endosymbiont-Host Lateral Gene Transfers. Sci Rep 7:857.

217

261. Wang K, Li M, Hakonarson H. 2010. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164. 262. Delcher AL, Phillippy A, Carlton J, Salzberg SL. 2002. Fast algorithms for large- scale genome alignment and comparison. Nucleic Acids Res 30:2478-83. 263. Skovgaard M, Jensen LJ, Brunak S, Ussery D, Krogh A. 2001. On the total number of genes and their length distribution in complete microbial genomes. Trends Genet 17:425-8. 264. Gould SJ, William B. Provine Collection on Evolution and Genetics. 1984. Hen's teeth and horse's toes. Norton, New York. 265. Ramasamy D, Mishra AK, Lagier JC, Padhmanabhan R, Rossi M, Sentausa E, Raoult D, Fournier PE. 2014. A polyphasic strategy incorporating genomic data for the taxonomic description of novel bacterial species. Int J Syst Evol Microbiol 64:384-91. 266. Schleifer KH. 2009. Classification of Bacteria and Archaea: past, present and future. Syst Appl Microbiol 32:533-42. 267. Bisen PS, Debnath M, Prasad GB. 2012. Identification and Classification of Microbes. In P. S. Bisen MDaGBP (ed), Microbes doi:doi:10.1002/9781118311912.ch4. 268. Rossello-Mora R, Amann R. 2001. The species concept for prokaryotes. FEMS Microbiol Rev 25:39-67. 269. Brenner DJ, Fanning GR, Steigerwalt AG. 1972. Deoxyribonucleic acid relatedness among species of Erwinia and between Erwinia species and other enterobacteria. J Bacteriol 110:12-7. 270. Dubnau D, Smith I, Morell P, Marmur J. 1965. Gene conservation in Bacillus species. I. Conserved genetic and nucleic acid base sequence homologies. Proc Natl Acad Sci U S A 54:491-8. 271. Boone DR, Castenholz RW, SpringerLink (Online service). 2001. Bergey's Manual® of Systematic Bacteriology : Overview: A Phylogenetic Backbone and Taxonomic Framework for Procaryotic Systematics, Second Edition. ed. Springer New York : Imprint: Springer, New York, NY. 272. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glockner FO. 2013. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res 41:D590-6. 273. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R, Hugenholtz P. 2012. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J 6:610-8. 274. Palys T, Nakamura LK, Cohan FM. 1997. Discovery and classification of ecological diversity in the bacterial world: the role of DNA sequence data. Int J Syst Bacteriol 47:1145-56. 275. Janda JM, Abbott SL. 2007. 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls. J Clin Microbiol 45:2761-4.

218

276. Rossello-Mora R, Amann R. 2015. Past and future species definitions for Bacteria and Archaea. Syst Appl Microbiol 38:209-16. 277. Yarza P, Yilmaz P, Pruesse E, Glockner FO, Ludwig W, Schleifer KH, Whitman WB, Euzeby J, Amann R, Rossello-Mora R. 2014. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Microbiol 12:635-45. 278. Chun J, Oren A, Ventosa A, Christensen H, Arahal DR, da Costa MS, Rooney AP, Yi H, Xu XW, De Meyer S, Trujillo ME. 2018. Proposed minimal standards for the use of genome data for the taxonomy of prokaryotes. Int J Syst Evol Microbiol 68:461-466. 279. Karlsson R, Gonzales-Siles L, Boulund F, Svensson-Stadler L, Skovbjerg S, Karlsson A, Davidson M, Hulth S, Kristiansson E, Moore ER. 2015. Proteotyping: Proteomic characterization, classification and identification of microorganisms--A prospectus. Syst Appl Microbiol 38:246-57. 280. De Bruyne K, Slabbinck B, Waegeman W, Vauterin P, De Baets B, Vandamme P. 2011. Bacterial species identification from MALDI-TOF mass spectra through data analysis and machine learning. Syst Appl Microbiol 34:20-9. 281. Goris J, Konstantinidis KT, Klappenbach JA, Coenye T, Vandamme P, Tiedje JM. 2007. DNA-DNA hybridization values and their relationship to whole- genome sequence similarities. Int J Syst Evol Microbiol 57:81-91. 282. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, Feil EJ, Stackebrandt E, Van de Peer Y, Vandamme P, Thompson FL, Swings J. 2005. Opinion: Re- evaluating prokaryotic species. Nat Rev Microbiol 3:733-9. 283. Auch AF, von Jan M, Klenk HP, Goker M. 2010. Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison. Stand Genomic Sci 2:117-34. 284. Aanensen DM, Spratt BG. 2005. The multilocus sequence typing network: mlst.net. Nucleic Acids Res 33:W728-33. 285. Larsen MV, Cosentino S, Rasmussen S, Friis C, Hasman H, Marvig RL, Jelsbak L, Sicheritz-Ponten T, Ussery DW, Aarestrup FM, Lund O. 2012. Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol 50:1355- 61. 286. Auch AF, Klenk HP, Goker M. 2010. Standard operating procedure for calculating genome-to-genome distances based on high-scoring segment pairs. Stand Genomic Sci 2:142-8. 287. Meier-Kolthoff JP, Auch AF, Klenk HP, Goker M. 2013. Genome sequence- based species delimitation with confidence intervals and improved distance functions. BMC Bioinformatics 14:60. 288. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, Khouri H, Radune D, Dimitrov G, Watkins K, O'Connor KJ, Smith S, Utterback TR, White O, Rubens CE, Grandi G, Madoff LC, Kasper DL, Telford JL, Wessels MR, Rappuoli R,

219

Fraser CM. 2005. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci U S A 102:13950-5. 289. Darling AC, Mau B, Blattner FR, Perna NT. 2004. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14:1394-403. 290. Lindsey ARI, Bordenstein SR, Newton ILG, Rasgon JL. 2016. Wolbachia pipientis should not be split into multiple species: A response to Ramirez-Puebla et al., "Species in Wolbachia? Proposal for the designation of 'Candidatus Wolbachia bourtzisii', 'Candidatus Wolbachia onchocercicola', 'Candidatus Wolbachia blaxteri', 'Candidatus Wolbachia brugii', 'Candidatus Wolbachia taylori', 'Candidatus Wolbachia collembolicola' and 'Candidatus Wolbachia multihospitum' for the different species within Wolbachia supergroups". Syst Appl Microbiol 39:220-222. 291. Dumler JS, Barbet AF, Bekker CP, Dasch GA, Palmer GH, Ray SC, Rikihisa Y, Rurangirwa FR. 2001. Reorganization of genera in the families Rickettsiaceae and Anaplasmataceae in the order Rickettsiales: unification of some species of Ehrlichia with Anaplasma, Cowdria with Ehrlichia and Ehrlichia with Neorickettsia, descriptions of six new species combinations and designation of Ehrlichia equi and 'HGE agent' as subjective synonyms of Ehrlichia phagocytophila. Int J Syst Evol Microbiol 51:2145-65. 292. Land M, Hauser L, Jun SR, Nookaew I, Leuze MR, Ahn TH, Karpinets T, Lund O, Kora G, Wassenaar T, Poudel S, Ussery DW. 2015. Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics 15:141-61. 293. Andersson SG, Zomorodipour A, Andersson JO, Sicheritz-Ponten T, Alsmark UC, Podowski RM, Naslund AK, Eriksson AS, Winkler HH, Kurland CG. 1998. The genome sequence of and the origin of mitochondria. Nature 396:133-40. 294. Ogata H, Audic S, Renesto-Audiffren P, Fournier PE, Barbe V, Samson D, Roux V, Cossart P, Weissenbach J, Claverie JM, Raoult D. 2001. Mechanisms of evolution in and R. prowazekii. Science 293:2093-8. 295. Brayton KA, Kappmeyer LS, Herndon DR, Dark MJ, Tibbals DL, Palmer GH, McGuire TC, Knowles DP, Jr. 2005. Complete genome sequencing of Anaplasma marginale reveals that the surface is skewed to two superfamilies of outer membrane proteins. Proc Natl Acad Sci U S A 102:844-9. 296. Dunning Hotopp JC, Lin M, Madupu R, Crabtree J, Angiuoli SV, Eisen JA, Seshadri R, Ren Q, Wu M, Utterback TR, Smith S, Lewis M, Khouri H, Zhang C, Niu H, Lin Q, Ohashi N, Zhi N, Nelson W, Brinkac LM, Dodson RJ, Rosovitz MJ, Sundaram J, Daugherty SC, Davidsen T, Durkin AS, Gwinn M, Haft DH, Selengut JD, Sullivan SA, Zafar N, Zhou L, Benahmed F, Forberger H, Halpin R, Mulligan S, Robinson J, White O, Rikihisa Y, Tettelin H. 2006. Comparative genomics of emerging human agents. PLoS Genet 2:e21. 297. Collins NE, Liebenberg J, de Villiers EP, Brayton KA, Louw E, Pretorius A, Faber FE, van Heerden H, Josemans A, van Kleef M, Steyn HC, van Strijp MF, Zweygarth E, Jongejan F, Maillard JC, Berthier D, Botha M, Joubert F, Corton CH, Thomson NR, Allsopp MT, Allsopp BA. 2005. The genome of the

220

heartwater agent Ehrlichia ruminantium contains multiple tandem repeats of actively variable copy number. Proc Natl Acad Sci U S A 102:838-43. 298. Segata N, Bornigen D, Morgan XC, Huttenhower C. 2013. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat Commun 4:2304. 299. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil PA, Hugenholtz P. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36:996-1004. 300. Xia X, Xie Z, Salemi M, Chen L, Wang Y. 2003. An index of substitution saturation and its application. Mol Phylogenet Evol 26:1-7. 301. Simmons MP, Ochoterena H, Freudenstein JV. 2002. Amino acid vs. nucleotide characters: challenging preconceived notions. Mol Phylogenet Evol 24:78-90. 302. Simmons MP, Carr TG, O'Neill K. 2004. Relative character-state space, amount of potential phylogenetic information, and heterogeneity of nucleotide and amino acid characters. Mol Phylogenet Evol 32:913-26. 303. Townsend JP, Lopez-Giraldez F, Friedman R. 2008. The phylogenetic informativeness of nucleotide and amino acid sequences for reconstructing the vertebrate tree. J Mol Evol 67:437-47. 304. Treangen TJ, Ondov BD, Koren S, Phillippy AM. 2014. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol 15:524. 305. Li L, Stoeckert CJ, Jr., Roos DS. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13:2178-89. 306. Philippe H, Forterre P. 1999. The rooting of the universal tree of life is not reliable. J Mol Evol 49:509-23. 307. Walker DH, Ismail N. 2008. Emerging and re-emerging rickettsioses: endothelial cell infection and early disease events. Nat Rev Microbiol 6:375-86. 308. Georgiades K, Merhej V, El Karkouri K, Raoult D, Pontarotti P. 2011. Gene gain and loss events in Rickettsia and Orientia species. Biol Direct 6:6. 309. Liao HM, Chao CC, Lei H, Li B, Tsai S, Hung GC, Ching WM, Lo SC. 2017. Intraspecies comparative genomics of three strains of Orientia tsutsugamushi with different antibiotic sensitivity. Genom Data 12:84-88. 310. Fournier PE, Dumler JS, Greub G, Zhang J, Wu Y, Raoult D. 2003. Gene sequence-based criteria for identification of new rickettsia isolates and description of Rickettsia heilongjiangensis sp. nov. J Clin Microbiol 41:5456-65. 311. Gillespie JJ, Beier MS, Rahman MS, Ammerman NC, Shallom JM, Purkayastha A, Sobral BS, Azad AF. 2007. Plasmids and rickettsial evolution: insight from Rickettsia felis. PLoS One 2:e266. 312. Salje J. 2017. Orientia tsutsugamushi: A neglected but fascinating obligate intracellular bacterial pathogen. PLoS Pathog 13:e1006657. 313. Rar VA, Pukhovskaya NM, Ryabchikova EI, Vysochina NP, Bakhmetyeva SV, Zdanovskaia NI, Ivanov LI, Tikunova NV. 2015. Molecular-genetic and ultrastructural characteristics of 'Candidatus Ehrlichia khabarensis', a new member of the Ehrlichia genus. Ticks Tick Borne Dis 6:658-67.

221

314. Inokuma H, Brouqui P, Drancourt M, Raoult D. 2001. Citrate synthase gene sequence: a new tool for phylogenetic analysis and identification of Ehrlichia. J Clin Microbiol 39:3031-9. 315. Gofton AW, Loh SM, Barbosa AD, Paparini A, Gillett A, Macgregor J, Oskam CL, Ryan UM, Irwin PJ. 2018. A novel Ehrlichia species in blood and Ixodes ornithorhynchi ticks from platypuses (Ornithorhynchus anatinus) in Queensland and Tasmania, Australia. Ticks Tick Borne Dis 9:435-442. 316. Ybanez AP, Matsumoto K, Kishimoto T, Inokuma H. 2012. Molecular analyses of a potentially novel Anaplasma species closely related to Anaplasma phagocytophilum detected in sika deer (Cervus nippon yesoensis) in Japan. Vet Microbiol 157:232-6. 317. Li H, Zheng YC, Ma L, Jia N, Jiang BG, Jiang RR, Huo QB, Wang YW, Liu HB, Chu YL, Song YD, Yao NN, Sun T, Zeng FY, Dumler JS, Jiang JF, Cao WC. 2015. Human infection with a novel tick-borne Anaplasma species in China: a surveillance study. Lancet Infect Dis 15:663-70. 318. Yang J, Liu Z, Niu Q, Liu J, Han R, Liu G, Shi Y, Luo J, Yin H. 2016. Molecular survey and characterization of a novel Anaplasma species closely related to Anaplasma capra in ticks, northwestern China. Parasit Vectors 9:603. 319. Battilani M, De Arcangeli S, Balboni A, Dondi F. 2017. Genetic diversity and molecular epidemiology of Anaplasma. Infect Genet Evol 49:195-211. 320. Philip CB, Hadlow WJ, Hughes LE. 1954. Studies on salmon poisoning disease of canines. I. The rickettsial relationships and pathogenicity of Neorickettsia helmintheca. Exp Parasitol 3:336-50. 321. Parker CT, Tindall BJ, Garrity GM. 2015. International Code of Nomenclature of Prokaryotes. Int J Syst Evol Microbiol doi:10.1099/ijsem.0.000778. 322. Murray RG, Schleifer KH. 1994. Taxonomic notes: a proposal for recording the properties of putative taxa of procaryotes. Int J Syst Bacteriol 44:174-6. 323. Maiden MC, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R, Zhang Q, Zhou J, Zurth K, Caugant DA, Feavers IM, Achtman M, Spratt BG. 1998. Multilocus sequence typing: a portable approach to the identification of clones within populations of pathogenic microorganisms. Proc Natl Acad Sci U S A 95:3140-5. 324. Carver TJ, Rutherford KM, Berriman M, Rajandream MA, Barrell BG, Parkhill J. 2005. ACT: the Artemis Comparison Tool. Bioinformatics 21:3422-3. 325. Carver T, Berriman M, Tivey A, Patel C, Bohme U, Barrell BG, Parkhill J, Rajandream MA. 2008. Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database. Bioinformatics 24:2672-6. 326. Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2013. GenBank. Nucleic Acids Res 41:D36-42. 327. Yoon SH, Ha SM, Lim J, Kwon S, Chun J. 2017. A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Van Leeuwenhoek 110:1281-1286. 328. Edgar RC. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460-1.

222

329. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421. 330. Nguyen LT, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32:268-74. 331. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. 2017. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods 14:587-589. 332. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. 2018. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol 35:518-522. 333. Letunic I, Bork P. 2016. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res 44:W242-5. 334. Paradis E, Claude J, Strimmer K. 2004. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 20:289-90. 335. Schliep KP. 2011. phangorn: phylogenetic analysis in R. Bioinformatics 27:592- 3. 336. Katoh K, Misawa K, Kuma K, Miyata T. 2002. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059-66. 337. Chernomor O, von Haeseler A, Minh BQ. 2016. Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices. Syst Biol 65:997-1008. 338. Hertig M, Wolbach SB. 1924. Studies on Rickettsia-Like Micro-Organisms in Insects. J Med Res 44:329-374 7. 339. Hertig M. 1936. The Rickettsia, Wolbachia pipientis (gen. et sp.n.) and Associated Inclusions of the Mosquito, Culex pipiens. Parasitology 28:453-486. 340. Suitor EC, Weiss E. 1961. Isolation af a Rickettsialike Microorganism (Wolbachia Persica, N. SP.) From Argas Persicus (Oken). The Journal of Infectious Diseases 108:95-106. 341. Philip CB. 1956. COMMENTS ON THE CLASSIFICATION OF THE ORDER RICKETTSIALES. Canadian Journal of Microbiology 2:261-270. 342. Fox GE, Magrum LJ, Balch WE, Wolfe RS, Woese CR. 1977. Classification of methanogenic bacteria by 16S ribosomal RNA characterization. Proc Natl Acad Sci U S A 74:4537-41. 343. Woese CR, Kandler O, Wheelis ML. 1990. Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 87:4576-9. 344. Woese CR, Fox GE. 1977. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A 74:5088-90. 345. Woo PC, Lau SK, Teng JL, Tse H, Yuen KY. 2008. Then and now: use of 16S rDNA gene sequencing for bacterial identification and discovery of novel bacteria in clinical microbiology laboratories. Clin Microbiol Infect 14:908-34. 346. Rousset F, Solignac M. 1995. Evolution of single and double Wolbachia symbioses during speciation in the Drosophila simulans complex. Proc Natl Acad Sci U S A 92:6389-93.

223

347. Werren JH, Zhang W, Guo LR. 1995. Evolution and phylogeny of Wolbachia: reproductive parasites of arthropods. Proc Biol Sci 261:55-63. 348. Bourtzis K, Nirgianaki A, Onyango P, Savakis C. 1994. A prokaryotic dnaA sequence in Drosophila melanogaster: Wolbachia infection and cytoplasmic incompatibility among laboratory strains. Insect Mol Biol 3:131-42. 349. Braig HR, Zhou W, Dobson SL, O'Neill SL. 1998. Cloning and characterization of a gene encoding the major surface protein of the bacterial endosymbiont Wolbachia pipientis. J Bacteriol 180:2373-8. 350. Baldo L, Lo N, Werren JH. 2005. Mosaic nature of the wolbachia surface protein. J Bacteriol 187:5406-18. 351. Baldo L, Desjardins CA, Russell JA, Stahlhut JK, Werren JH. 2010. Accelerated microevolution in an outer membrane protein (OMP) of the intracellular bacteria Wolbachia. BMC Evol Biol 10:48. 352. Werren JH, Bartos JD. 2001. Recombination in Wolbachia. Curr Biol 11:431-5. 353. Jiggins FM, Hurst GD, Yang Z. 2002. Host-symbiont conflicts: positive selection on an outer membrane protein of parasitic but not mutualistic Rickettsiaceae. Mol Biol Evol 19:1341-9. 354. Baldo L, Bartos JD, Werren JH, Bazzocchi C, Casiraghi M, Panelli S. 2002. Different rates of nucleotide substitutions in Wolbachia endosymbionts of arthropods and nematodes: arms race or host shifts? Parassitologia 44:179-87. 355. Baldo L, Bordenstein S, Wernegreen JJ, Werren JH. 2006. Widespread recombination throughout Wolbachia genomes. Mol Biol Evol 23:437-49. 356. Paraskevopoulos C, Bordenstein SR, Wernegreen JJ, Werren JH, Bourtzis K. 2006. Toward a Wolbachia multilocus sequence typing system: discrimination of Wolbachia strains present in Drosophila species. Curr Microbiol 53:388-95. 357. Casiraghi M, Bordenstein SR, Baldo L, Lo N, Beninati T, Wernegreen JJ, Werren JH, Bandi C. 2005. Phylogeny of Wolbachia pipientis based on gltA, groEL and ftsZ gene sequences: clustering of arthropod and nematode symbionts in the F supergroup, and evidence for further diversity in the Wolbachia tree. Microbiology 151:4015-22. 358. Johnson JL. 1980. The use of DNA homology in bacterial taxonomy and identification. Clinical Microbiology Newsletter 2:1-3. 359. Konstantinidis KT, Ramette A, Tiedje JM. 2006. The bacterial species definition in the genomic era. Philos Trans R Soc Lond B Biol Sci 361:1929-40. 360. Chung M, Munro JB, Tettelin H, Dunning Hotopp JC. 2018. Using Core Genome Alignments To Assign Bacterial Species. mSystems 3. 361. Richter M, Rossello-Mora R. 2009. Shifting the genomic gold standard for the prokaryotic species definition. Proc Natl Acad Sci U S A 106:19126-31. 362. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A, Chaumeil PA, Hugenholtz P. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol doi:10.1038/nbt.4229. 363. Kim M, Oh HS, Park SC, Chun J. 2014. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int J Syst Evol Microbiol 64:346-51.

224

364. Deloger M, El Karoui M, Petit MA. 2009. A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J Bacteriol 191:91-9. 365. Gupta A, Sharma VK. 2015. Using the taxon-specific genes for the taxonomic classification of bacterial genomes. BMC Genomics 16:396. 366. Lang JM, Darling AE, Eisen JA. 2013. Phylogeny of bacterial and archaeal genomes using conserved genes: supertrees and supermatrices. PLoS One 8:e62510. 367. Han N, Qiang Y, Zhang W. 2016. ANItools web: a web tool for fast genome comparison within multiple bacterial strains. Database (Oxford) 2016. 368. Lee I, Ouk Kim Y, Park SC, Chun J. 2016. OrthoANI: An improved algorithm and software for calculating average nucleotide identity. Int J Syst Evol Microbiol 66:1100-1103. 369. Chan CX, Ragan MA. 2013. Next-generation phylogenomics. Biol Direct 8:3. 370. Darling AE, Mau B, Perna NT. 2010. progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5:e11147. 371. Parte AC. 2018. LPSN - List of Prokaryotic names with Standing in Nomenclature (bacterio.net), 20 years on. Int J Syst Evol Microbiol 68:1825- 1829. 372. Liesack W, Weyland H, Stackebrandt E. 1991. Potential risks of gene amplification by PCR as determined by 16S rDNA analysis of a mixed-culture of strict barophilic bacteria. Microb Ecol 21:191-8. 373. Amann RI, Ludwig W, Schleifer KH. 1995. Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 59:143-69. 374. Konstantinidis KT, Rossello-Mora R. 2015. Classifying the uncultivated microbial majority: A place for metagenomic data in the Candidatus proposal. Syst Appl Microbiol 38:223-30. 375. Whitman WB. 2015. Genome sequences as the type material for taxonomic descriptions of prokaryotes. Syst Appl Microbiol 38:217-22. 376. Whitman WB. 2016. Modest proposals to expand the type material for naming of prokaryotes. Int J Syst Evol Microbiol 66:2108-12. 377. Konstantinidis KT, Rossello-Mora R, Amann R. 2017. Uncultivated microbes in need of their own taxonomy. ISME J 11:2399-2406. 378. Rossello-Mora R, Whitman WB. 2018. Dialogue on the nomenclature and classification of prokaryotes. Syst Appl Microbiol doi:10.1016/j.syapm.2018.07.002. 379. O'Neill SL, Pettigrew MM, Sinkins SP, Braig HR, Andreadis TG, Tesh RB. 1997. In vitro cultivation of Wolbachia pipientis in an Aedes albopictus cell line. Insect Mol Biol 6:33-9. 380. Woolfit M, Iturbe-Ormaetxe I, Brownlie JC, Walker T, Riegler M, Seleznev A, Popovici J, Rances E, Wee BA, Pavlides J, Sullivan MJ, Beatson SA, Lane A, Sidhu M, McMeniman CJ, McGraw EA, O'Neill SL. 2013. Genomic evolution of the pathogenic Wolbachia strain, wMelPop. Genome Biol Evol 5:2189-204.

225

381. McMeniman CJ, Lane AM, Fong AW, Voronin DA, Iturbe-Ormaetxe I, Yamada R, McGraw EA, O'Neill SL. 2008. Host adaptation of a Wolbachia strain after long-term serial passage in mosquito cell lines. Appl Environ Microbiol 74:6963- 9. 382. Fenollar F, La Scola B, Inokuma H, Dumler JS, Taylor MJ, Raoult D. 2003. Culture and phenotypic characterization of a Wolbachia pipientis isolate. J Clin Microbiol 41:5434-41. 383. Dobson SL, Marsland EJ, Veneti Z, Bourtzis K, O'Neill SL. 2002. Characterization of Wolbachia host cell range via the in vitro establishment of infections. Appl Environ Microbiol 68:656-60. 384. Noda H, Miyoshi T, Koizumi Y. 2002. In vitro cultivation of Wolbachia in insect and mammalian cell lines. In Vitro Cell Dev Biol Anim 38:423-7. 385. Dunning Hotopp JC, Clark ME, Oliveira DC, Foster JM, Fischer P, Munoz Torres MC, Giebel JD, Kumar N, Ishmael N, Wang S, Ingram J, Nene RV, Shepard J, Tomkins J, Richards S, Spiro DJ, Ghedin E, Slatko BE, Tettelin H, Werren JH. 2007. Widespread lateral gene transfer from intracellular bacteria to multicellular eukaryotes. Science 317:1753-6. 386. Chafee ME, Funk DJ, Harrison RG, Bordenstein SR. 2010. Lateral phage transfer in obligate intracellular bacteria (wolbachia): verification from natural populations. Mol Biol Evol 27:501-5. 387. Brelsfoard C, Tsiamis G, Falchetto M, Gomulski LM, Telleria E, Alam U, Doudoumis V, Scolari F, Benoit JB, Swain M, Takac P, Malacrida AR, Bourtzis K, Aksoy S. 2014. Presence of extensive Wolbachia symbiont insertions discovered in the genome of its host Glossina morsitans morsitans. PLoS Negl Trop Dis 8:e2728. 388. Dunning Hotopp JC, Klasson L. 2018. The Complexities and Nuances of Analyzing the Genome of Drosophila ananassae and Its Wolbachia Endosymbiont. G3 (Bethesda) 8:373-374. 389. Glowska E, Dragun-Damian A, Dabert M, Gerth M. 2015. New Wolbachia supergroups detected in quill (Acari: Syringophilidae). Infect Genet Evol 30:140-6. 390. Chung M, Small ST, Serre D, Zimmerman PA, Dunning Hotopp JC. 2017. Draft genome sequence of the Wolbachia endosymbiont of Wuchereria bancrofti wWb. Pathog Dis 75. 391. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621-8. 392. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. 2008. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 320:1344-9. 393. Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR. 2008. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133:523-36.

226