Molecular Genotyping and Whole Genome Sequencing of Canadian cayetanensis Specimens

by

Christine Allison Yanta

A Thesis presented to The University of Guelph

In partial fulfilment of requirements for the degree of Master of Science in Pathobiology

Guelph, Ontario, Canada

© Christine Yanta, May, 2021

ii

ABSTRACT

MOLECULAR GENOTYPING AND WHOLE GENOME SEQUENCING OF CANADIAN SPECIMENS

Christine Yanta Advisors: University of Guelph, 2021 Professor John R. Barta Dr. Rebecca A. Guy

To improve our understanding of the genetic diversity of the human-infecting coccidium,

Cyclospora cayetanensis, 160 clinical fecal samples representing Canadian cases, from four provinces (ON=119, QC=24, BC=7, NL=10) identified between 2010 and 2020, were genotyped using a next generation sequencing targeted amplicon approach consisting of eight markers. Genotyping data were collected for at least one marker for 96.2% (154/160) of specimens and only 36.9% (59/160) of specimens had genotyping data for all eight markers. We identified eighteen genetic clusters from the 79.4% (127/160) of specimens that successfully clustered.

Whole genome sequence assemblies were generated for five of the cyclosporiasis cases that were genotyped, including a hybrid assembly that improved the current reference assembly

(GCF_002999335.1); this refined assembly was 44.2 Mbp in length (297 contigs, N50 value of

654019). These first molecular data generated from Canadian cyclosporiasis cases will support and inform future epidemiological investigations aimed at mitigating cyclosporiasis outbreaks.

iii

ACKNOWLEDGEMENTS

The success of this Master’s research project would have not been possible without the help and support of the amazing people around me. First and foremost, I would like to thank my co-advisor Dr. Rebecca Guy. Dr. Guy first welcomed me as a co-op student in her parasitology lab and has since provided me numerous opportunities to learn new and fascinating parasite and genomics-related things. Without her full confidence and continuous support, I would have never flourished to become the researcher and person I am today. I am also very grateful for my other co-advisor Professor John Barta. Throughout my research, he always believed in my abilities and provided support when needed. I would also like to thank my advisory committee member, Dr. Claire Jardine, for taking the time to provide feedback and advice throughout the project. I would also like to thank my family at NML Guelph. You welcomed me with open arms and since then were always willing to help a girl out. I would especially like to thank the parasitology group, Laura Martin and Marisa Rankin, for the endless support and making the days stuck in the biological safety cabinet with clinical specimens fly by. Thank you to Amy Feddema for being one of my biggest cheerleaders and always having the best coffee break chats. Last but certainly not least, I would like to give a huge shout-out to my family for all of their unconditional love and emotional support throughout this journey. To my parents, Donald and Allice, words simply cannot express how grateful I am to have the both of you in my life. Without your encouragement to chase my dreams and discover my potential, I would not be where I am and who I am today. I thank my brother Jeff for always being there for me and helping me anyway that you could. Finally, I want to thank my partner, Tim Hill, for being the rock in my life. Thank you for always listening, supporting, feeding, and believing in me through the thick and thin of times. You showed me how to simply take a break when things get tough and to enjoy the fresh- air and nature around us. I am so fortunate to have met you and cannot wait to see what is next to come in our journey together.

iv

TABLE OF CONTENTS

ABSTRACT ...... ii ACKNOWLEDGEMENTS ...... iii TABLE OF CONTENTS ...... iv LIST OF TABLES ...... vi LIST OF FIGURES ...... vii LIST OF ABBREVIATIONS ...... viii LIST OF APPENDICES ...... ix DECLARATION OF WORK PERFORMED ...... x CHAPTER 1: INTRODUCTION AND LITERATURE REVIEW ...... 1 1.0 Introduction ...... 1 2.0 Biological Characteristics ...... 3 2.1 Morphology...... 3 2.2 Life Cycle...... 4 2.3 Infectivity ...... 5 3.0 ...... 6 3.1 Geographic Distribution and Prevalence ...... 6 3.2 Transmission ...... 6 3.3 Risk Factors and Symptoms...... 9 3.4 Treatment ...... 9 4.0 Detection ...... 9 4.1 Clinical Detection Methods ...... 9 4.2 Detection in Produce ...... 11 5.0 Molecular Characteristics ...... 13 5.1 Molecular Structure ...... 13 5.2 Gene Targets and Schemes ...... 17 6.0 Conclusion ...... 22 STUDY OBJECTIVES AND RATIONALE ...... 23 CHAPTER 2: EVALUATION OF THE ENSEMBLE-BASED MLST SCHEME ...... 25 2.1 Abstract ...... 25 2.2 Introduction ...... 25 2.3 Materials and Methods ...... 27 2.3.1 Clinical Fecal Specimens ...... 27 2.3.2 Ethics...... 28 2.3.3 DNA Extraction ...... 28 2.3.4 PCR Amplification and Amplicon Clean-Up ...... 28 2.3.5 DNA Sequencing ...... 29 2.3.6 Sequence Analysis and Clustering ...... 29 2.4 Results ...... 32 2.4.1 Sequencing Success ...... 32 2.4.2 Clustering ...... 39 2.5 Discussion ...... 41 2.6 Conclusion ...... 44

v

CHAPTER 3: WHOLE GENOME SEQUENCING OF CANADIAN CYCLOSPORA CAYETANENSIS SPECIMENS ...... 45 3.1 Abstract ...... 45 3.2 Introduction ...... 45 3.3 Materials and Methods ...... 47 3.3.1 Stool Specimens ...... 47 3.3.2 Ethics...... 47 3.3.3 Oocyst Purification ...... 47 3.3.4 DNA Extraction ...... 49 3.3.5 Illumina MiSeq Sequencing ...... 50 3.3.6 Oxford Nanopore MinION Sequencing ...... 50 3.3.7 Assembling Short-read Assemblies with MiSeq Reads ...... 50 3.3.8 Assembling Hybrid Assemblies with MinION and MiSeq Reads...... 51 3.4 Results ...... 52 3.4.1 Sample Purification ...... 52 3.4.2 Short-Read Assemblies ...... 53 3.4.3 Hybrid Assemblies ...... 56 3.5 Discussion ...... 58 3.6 Conclusion ...... 60 CHAPTER 4: GENERAL DISCUSSIONS AND CONCLUSIONS ...... 61 REFERENCES ...... 64 APPENDICES ...... 76

vi

LIST OF TABLES

Table 1.1. Features of Cyclospora cayetanensis that cause challenges in epidemiological case linkage studies ...... 8

Table 1.2. Commercial assays for detecting protozoan parasites ...... 12

Table 1.3. Molecular methods developed to discriminate Cyclospora cayetanensis isolates ...19

Table 2.1. Modified PCR reaction conditions used for the Cyclospora cayetanensis targeted amplicon NGS scheme ...... 30

Table 2.2. Provincial and yearly distribution of Cyclospora cayetanensis specimens received ...... 33

Table 2.3. Sequencing success rate and haplotypes observed for each marker in the subtyping scheme ...... 35

Table 2.4. Haplotypes present in the nuclear markers HC378 and HC360i2 ...... 37

Table 2.5. Mitochondrial junction subtypes observed in Canadian Cyclospora cayetanensis specimens ...... 38

Table 2.6. Haplotypes observed in the mitochondrial MSR marker in Canadian Cyclospora cayetanensis specimens ...... 38

Table 2.7. Distribution of Cyclospora cayetanensis samples that successfully clustered ...... 39

Table 2.8. Success rate of each loci of samples that successfully clustered ...... 40

Table 3.1. Loss of Cyclospora cayetanensis in samples attributed to the purification process 53

Table 3.2. Illumina read metadata for each Cyclospora cayetanensis specimen ...... 54

Table 3.3. Comparison of Canadian Cyclospora cayetanensis short-read assemblers using Quast statistics ...... 55

Table 3.4. BUSCO assessment of short read assemblies of Canadian Cyclospora cayetanensis assemblies ...... 56

Table 3.5. MinION sequencing metadata of a Canadian Cyclospora cayetanensis specimen ..57

Table 3.6. Quast assembly statistics of the Cyclospora cayetanensis hybrid assemblies ...... 57

Table 3.7. BUSCO statistics of the Cyclospora cayetanensis hybrid assemblies ...... 58

vii

LIST OF FIGURES

Figure 1.1. Phylogenetic relationship between Cyclospora spp. and other coccidian organisms 2

Figure 1.2. Unstained Cyclospora cayetanensis oocyst under light ...... 3

Figure 1.3. Geographic origins of Cyclospora cayetanensis isolates from which genomes have been assembled ...... 14

Figure 1.4. Mitochondrial genome organization of Cyclospora cayetanensis ...... 16

Figure 1.5. genome organization of Cyclospora cayetanensis ...... 18

Figure 2.1. Cumulative success rate of markers sequenced in the Cyclospora cayetanensis targeted deep amplicon genotyping scheme ...... 34

Figure 2.2. Distribution of haplotypes observed for nuclear CDC1-4 markers ...... 36

Figure 2.3. Cluster dendrogram of Canadian Cyclospora cayetanensis specimens ...... 41

viii

LIST OF ABBREVIATIONS

ASV Amplicon Sequencing Variants BCCDC British Columbia Centre for Disease Control CDC Centers for Disease Control and Prevention DNA Deoxyribonucleic Acid LSPQ Laboratoire de Santé Publique du Québec MLST Multilocus Sequencing Tool NGS Next Generation Sequencing NPHL Newfoundland Public Health Laboratory Nu Nuclear Mt Mitochondrion PCR Polymerase Chain Reaction PHO Public Health Ontario qPCR Quantitative Polymerase Chain Reaction SAF Sodium-acetate Acetic-acid Formalin SNP Single Nucleotide Polymorphism WGA Selective Whole Genome Amplification USA United States of America WGA Whole Genome Amplification WGS Whole Genome Sequence

ix

LIST OF APPENDICES

Appendix 1. Script for analyzing Cyclospora cayetanensis targeted amplicon sequences ...... 76

Appendix 2. Preparation of Sheather’s solution for density gradient purification of Cyclospora cayetanensis Oocysts ...... 79

Appendix 3. Whole genome amplification study of Canadian Cyclospora cayetanensis specimens ...... 80

Appendix 4. Script for generating whole genome assemblies for Cyclospora cayetanensis .....88

x

DECLARATION OF WORK PERFORMED

This research was supported by the grant awarded to Dr. Brent Dixon, Dr. Rebecca A. Guy and Associate Professor James Wasmuth from the Ontario Ministry of Agriculture Food and Rural Affairs (OMAFRA) along with funding received by the Public Health Agency of Canada (PHAC). Christine Yanta was the recipient of the Alexander Graham Bell Canada Graduate Scholarship from the National Sciences and Engineering Research Council of Canada (NSERC), the Ontario Veterinary College Graduate Scholarship, and the Pathobiology Award for Excellence. All work was carried out under the supervision of Dr. Rebecca A. Guy at the Division of Enteric Diseases of the Public Health Agency of Canada and Dr. John R. Barta at the Department of Pathobiology at the University of Guelph. In Chapter 2, Christine Yanta genotyped Canadian Cyclospora cayetanensis specimens received from Dr. Antoine Corbeil (Public Health Ontario), Dr. Hervé Menan (Québec Public Health Laboratory), Dr. Robert Needle (Public Health and Microbiology Laboratory Eastern Health), and Dr. Muhammad Morshed (BC Centre for Disease Control). The U.S. Centers for Disease Control and Prevention (CDC) provided the NGS targeted amplicon genotyping scheme. Christine Yanta performed the DNA extractions, PCRs, library preparations, sequencing, and data analyses with input from Dr. Joel Barratt (CDC), Katelyn Houghton (CDC), and Dr. Rebecca A. Guy. The manuscript was written by Christine Yanta and reviewed and edited by Dr. Rebecca A. Guy, Dr. John R. Barta, and Dr. Claire Jardine. In Chapter 3, Christine Yanta generated whole genome sequence data for Canadian C. cayetanensis specimens received from Dr. Antoine Corbeil (Public Health Ontario) and Dr. Robert Needle (Public Health and Microbiology Laboratory Eastern Health). Christine Yanta semi- purified and enriched the C. cayetanensis oocysts following the method provided by Dr. Yvonne Qvarnstrom (CDC). Christine Yanta performed the DNA extractions, whole genome library preparations, sequenced the libraries on both the Illumina MiSeq and Oxford Nanopore MinION sequencers, assembled the whole genomes, and attempted whole genome amplification methods (Appendix 3) with input given by Dr. Rebecca A. Guy, Dr. John R. Barta, and Stephen Pollo. The manuscript was written by Christine Yanta and reviewed and edited by Dr. Rebecca A. Guy, Dr. John R. Barta, and Dr. Claire Jardine. This work was presented as a poster presentation at the virtual Molecular Parasitology Meeting XXXI in September 2020.

CHAPTER 1: INTRODUCTION AND LITERATURE REVIEW

1.0 Introduction Enteric protozoan parasites are a major global contributor to diarrheal disease, contributing to nearly 357 million illnesses or 2.94 million disability-adjusted life years in 2010 (Torgerson et al., 2015). The majority of these infections affect people living in developing countries, while increased travel and globalization of food supply have resulted in greater risks of acquiring enteric protozoan pathogens in developed countries (Ortega & Sanchez, 2010). Unfortunately, enteric parasites are often under-reported due to inadequate surveillance systems and detection methods, posing many challenges in mitigating outbreaks (Ryan et al., 2017). Cyclospora cayetanensis is an emerging and understudied enteric parasite that is becoming recognized worldwide for its potential to cause widespread diarrheal illness. In terms of imported food in the United States, C. cayetanensis is the second most common cause of diarrheal disease, next to Salmonella spp. (Gould et al., 2017). This parasite is becoming a common causative agent of foodborne outbreaks in Canada and the United States, with outbreaks occurring nearly every summer (Kozak et al., 2013; Casillas et al., 2018; BCCDC Annual Summary, 2019; Almeria et al., 2019). This parasite belongs to the protistan under the subclass. The coccidian parasites follow the typical apicomplexan life cycle and are capable of infecting all vertebrates (Barta, 2001). Previously, Cyclospora spp. infections were mistaken for either large , blue-green (cyanobacterium-like), or coccidia-like bodies before becoming officially characterized in 1993 (Ortega & Sanchez, 2010). Nineteen Cyclospora species have been described, with Cyclospora cayetanensis being the only one with the ability to infect humans (Giangaspero & Gasser, 2019). From phylogenetic analysis of the small subunit ribosomal ribonucleic acid (SSU rRNA) gene, C. cayetanensis is most closely related to the avian-infecting species (Figure 1.1).

1

Figure 1.1. Phylogenetic relationships among Cyclospora spp. and other apicomplexan organisms, rooted using Cryptosporidium as the outgroup. The neighbour-joining tree (Kimura two parameter model) constructed using nuclear small subunit (18S) rDNA sequences suggests C. cayetanensis is most closely related to Eimeria spp. that infect birds. Bootstrap values at the nodes were calculated based on 1000 replicates.

Since C. cayetanensis was characterized in the early 1990s, research has focused on defining the oocyst life cycle stage, morphology, and diagnostic methods (Ortega & Sanchez, 2010). However, important characteristics including its complete life cycle, transmission dynamics, and epidemiology have yet to be fully understood. Consequently, a standardized molecular epidemiological tool has not been developed, limiting the ability to respond to outbreaks efficiently and effectively (Cinar et al., 2020). The lack of animal models and culturing techniques in particular have hindered the ability to propagate and fully investigate this parasite (Ortega & Sanchez, 2010). 2

This literature review will first outline the current biological and epidemiological knowledge of C. cayetanensis, particularly its life cycle, prevalence, vehicle transmission, signs and symptoms of the disease, and treatment options. Subsequent sections will review the genomics of the parasite, current diagnostic methods, assess the effectiveness of distinguishing outbreak isolates with current molecular schemes, and identify enhanced methods that will further aid epidemiological case-linkage investigations.

2.0 Biological Characteristics 2.1 Morphology

Under light microscopy, Cyclospora cayetanensis appears as a refractile spheroidal oocyst (Figure 1.2) that measures 8-10 μm in diameter (Ortega et al., 1993). Structurally, these oocysts have a 50 nm cell wall that is surrounded by a 63 nm outer fibrillar coat and contains both a polar body and oocyst residuum (Ortega et al., 1993; Ortega et al., 1994). Each C. cayetanensis oocyst contains two ovoidal sporocysts (4.0 x 6.3 μm) with 62 nm thick walls, and when sporulated, each sporocyst contains two crescent-shaped sporozoites (i.e., four sporozoites total in a sporulated oocyst), measuring 1.2 x 9.0 μm in size (Ortega et al., 1994). Genetically, each sporocyst contains two haploid (1N) sporozoites, which are thought to be genetically identical based on studies with Eimeria (see Shirley & Harvey, 1996). Sporozoites from different sporocysts in a single oocyst can be genetically different (Mzilahowa et al., 2007). Similar to other apicomplexans, each sporozoite has a membrane-bound nucleus along with (Ortega et al., 1993).

Figure 1.2. Unstained Cyclospora cayetanensis oocysts under light microscopy. The unstained, unsporulated oocyst of C. cayetanensis, viewed at 1000X magnification, appears as a spheroidal, refractile body that measures 8-10

μm in diameter.

3

2.2 Life Cycle

The life cycle of Cyclospora cayetanensis has been definitively identified for the oocyst stage, with endogenous stages only being described through independent patient biopsy specimens with gastrointestinal diseases (Sun et al., 1996; Ortega et al., 1997). Recently, Dubey et al. (2020) described the developmental cycle for this parasite from a patient’s gallbladder. Cyclospora cayetanensis normally infects the small intestine, so it is unclear how C. cayetanensis reached the gallbladder to infect this patient, though it could be presumed that the sporozoites travelled through the bile duct (Almeria et al., 2019). The monoxenous life cycle begins when a susceptible human host ingests food or water contaminated with infectious oocysts. When ingested, the oocyst travels down the , where the sporozoites are released into the intestinal lumen of the duodenum and jejunum through oocyst excision (Sun et al., 1996; Ortega et al., 1997). When the sporozoites enter the enterocytes, they transform into uninucleated zoites and undergo schizogony (Dubey et al., 2020). Dubey et al. (2020) suggested that schizogony in C. cayetanensis resembles more closely than Eimeria spp. as immature schizonts and merozoites were reported in the same host cell, though further confirmation studies are required. results in the formation of type I and type II meronts. Type I meronts contain 8-12 small merozoites (3-4 μm long) that remain in the asexual cycle to infect neighbouring epithelial cells, whereas the 4 large merozoites (12-15 μm long) in type II meronts enter other host cells to undergo sexual development to form gametocytes (Orterga et al., 1997; Smith et al., 1997). However, type II meronts have not been reported in other studies (see Dubey et al., 2020). Once gametogony occurs, the flagellated microgametes (male) will fertilize the macrogametes (female) to produce a zygote, which then develops into an environmentally resistant, unsporulated oocyst (Smith et al., 1997; Dubey et al., 2020). This oocyst is released and shed into the environment through fecal matter to then be sporulated (Shields & Olson, 2003; Ortega & Sanchez, 2010; Dubey et al., 2020).

2.2.1 Sporulation

Unlike most other human-infecting enteric parasites, C. cayetanensis oocysts require prolonged time in the environment to become infectious (Smith et al., 1997; Mansfield & Gajadhar, 2004). Based on the seasonality in globally reported infections, it is believed successful 4

sporulation depends on several climatic factors, including humidity, temperature, photoperiod, and exposure to air (Strausbaugh & Herwaldt, 2000; Almeria et al., 2019) However, the exact environmental conditions are still poorly understood, as numerous studies have not been able to effectively sporulate isolated oocysts (Ortega et al., 1993; Smith et al., 1997; Sathyanarayanan & Ortega, 2006). Laboratory experiments have shown successful sporulation at 22°C and 30°C when the oocysts were stored in 2.5% potassium dichromate or deionized water for 7 to 14 days (Ortega et al., 1993; Smith et al., 1997). Low sporulation rates, from 10-24%, were detected (Ortega et al., 1993; Smith et al., 1997). Sporulation was hindered when oocysts were stored at 4°C for one or two months and then incubated at 30°C for one week, as only 9% and 12% of oocysts successfully sporulated, respectively (Smith et al., 1997). When testing the effects of temperature on sporulation on dairy and basil products, oocysts effectively sporulated after being stored at optimal produce storage conditions of 4°C and 23°C for up to one week (Sathyanarayanan & Ortega, 2006). However, oocysts stored in water or on basil at -20°C for more than two days or 37°C for more than four days are unable to sporulate (Sathyanarayanan & Ortega, 2006; Ortega & Liao, 2006). Any exposure to extreme temperatures (-70°C, 70°C, 100°C) for more than 15 minutes will also inactivate the oocysts (Sathyanarayanan & Ortega, 2006; Ortega & Liao, 2006). Furthermore, treating water with chlorine or food with fungicides or insecticides at recommended concentration levels for other pathogens is not effective as oocysts are left viable for sporulation (Sathyanarayanan & Ortega, 2004). Despite the sporulation studies described, it is still unclear as to which conditions are required for C. cayetanensis to sporulate in the environment.

2.3 Infectivity

When a human ingests a sporulated oocyst, they may develop the gastrointestinal disease cyclosporiasis. The time from infection to disease is thought to take one week (Ortega et al., 1997); however, the mechanisms of intracellular replication and development are still under investigation due to the difficulties of obtaining viable C. cayetanensis oocysts and lack of animal models. It is suggested there is a specific, unknown trigger for infection due to unsuccessful attempts to infect animals and cells (Eberhard et al., 2000). Although the exact infectious dose is unknown, it is

5

estimated that as little as 10 to 100 oocysts can cause infection (Dixon et al., 2005; Chacín-Bonilla, 2010).

3.0 Epidemiology 3.1 Geographic Distribution and Prevalence

Human cyclosporiasis is distributed worldwide, with a prevalence of 3.55% and infections being documented in 56 countries (Li et al., 2019). This global burden, however, is greatly underestimated due to the current low-sensitivity methods of detection, lack of routine testing, and intermittent shedding (Chacín-Bonilla, 2010; Giangaspero & Gasser, 2019). This enteric disease is mostly endemic in developing countries that have tropical or subtropical climates, including Nepal, Venezuela, China, Mexico and Peru (Chacín-Bonilla, 2010). In these endemic regions, the young and old populations are more susceptible to develop clinical symptoms when ingesting sporulated oocysts, compared to the healthy adults that result in asymptomatic infections (Tsang et al., 2013). This is indicative of partial immunity (Tsang et al., 2013). In contrast, people of all ages are susceptible to cyclosporiasis in non-endemic regions, including Canada, the United States, the United and Italy (Chacín-Bonilla, 2010).

3.2 Transmission

A vehicle is required for C. cayetanensis to be transmitted, as direct human-to-human transmission is unlikely due to the long incubation period for sporulation (Shields & Olson, 2003). In disease-endemic regions, infections are most commonly traced back to close contact with sporulated oocysts in the food, water or soil, whereas cyclosporiasis is often initiated in non- endemic regions through foodborne outbreaks (Shields & Olson, 2003). However, the modes are not completely documented and the relative rates of transmission modes are unknown (Shields & Olson, 2003; Chacín-Bonilla, 2008).

3.2.1 Food Outbreaks

There have been increasing reports of fruits and vegetables contaminated with C. cayetanensis oocysts being imported into non-endemic countries and causing outbreaks (Casillas et al., 2018; Morton et al., 2019; Hadjilouka & Tsaltas, 2020). Compared to other parasites, C. cayetanensis

6

oocysts contain more surface adhesions (Ortega & Shields, 2015), and therefore are more efficient at sticking to produce. Foodborne outbreaks are typically associated with produce that are difficult to clean thoroughly and consumed raw, including herbs, leafy greens (such as basil, cilantro, lettuce) and berries (Ortega & Sanchez, 2010; Dixon et al., 2016; Whitfield et al., 2017; Hadjilouka & Tsaltas, 2020). Foodborne outbreaks in non-endemic, developed regions are typically left unresolved, or resolved with significant uncertainty in the source of the outbreak (Kozak et al., 2013; Morton et al., 2019; Hadjilouka & Tsaltas, 2020) due to the lack of tools available for linking clinical disease with source attribution. Food sources that have been heated or pasteurized have not been associated with illnesses (Dawson, 2005).

3.2.2 Water Outbreaks

Water contaminated with fecal matter is another significant source of C. cayetanensis oocysts. Oocysts have been detected in both chlorinated water and untreated water and in both endemic (Sherchand et al., 1999; Bhandari et al., 2015) and non-endemic (Kitajima et al., 2014, Giangaspero et al., 2015) areas, suggesting transmission could occur through recreational water use and drinking. In addition to being resistant to water decontaminants, oocysts can pass through physical barriers in a treatment process (Mansfield & Gajdhar, 2004). Water contaminated with C. cayetanensis oocysts can also be a source of foodborne transmission if used during plant crop irrigation, fertilizer application, or used for washing/processing foods (Almeria et al., 2019).

3.2.3 Epidemiological Investigations

Currently in Canada, if a locally-acquired foodborne outbreak for C. cayetanensis is declared, an epidemiological case-linkage investigation is conducted. Due to the lack of standardized and validated genotyping methods, case linkages are performed against the epidemiological data alone. Briefly, the principal step to these investigations involve hypothesis generation comparing food exposures between reported cases and the general population and performing McNemar’s odds ratios for matched pairs and variable analyses (Morton et al., 2019). Recently, a control bank made up of participants who have willingly consented to be a part of future outbreak studies has been used as the control in these investigations to improve response rates (Morton et al., 2019). This analysis is subjected to vague and sparse data due to key challenges that are associated with C.

7

cayetanensis investigations (Table 1.1), ultimately hindering the ability to find the initial source of infection with significant certainty. This includes difficulties with collecting complete data in regards to food exposures from up to 14 days prior to symptom offset, coinciding with the organism’s incubation period, and product traceability as the shelf-life of most produce is less than the incubation period (Morton et al., 2019). Furthermore, common vehicles for transmission of this organism are often found in several different types and brands of pre-packaged items (e.g. vegetable trays, fruit trays, salads) (see Hadjilouca & Tsaltas, 2020) in addition to being sold separately from the same supplier or distributor, creating further difficulties in identifying the root source. As a result, epidemiological studies alone are inadequate to fully resolve foodborne outbreaks with C. cayetanensis and require routine laboratory methods (yet to be validated) to aid in case-linkage investigations.

Table 1.1. Features of Cyclospora cayetanensis that cause challenges in epidemiological case- linkage studies Cyclospora cayetanensis Characteristic Challenge for Epidemiological Tracking

Low infective dose Easily acquired Difficulties in detection on contaminated vehicle Marked seasonality Oocysts survive in environment for long periods of time Long incubation period Difficulty identifying potential source, especially (avg. 7 days) foodborne outbreaks (produce has short shelf life) Difficulty for patient to recall specific foods that were ingested Environmental transmission vehicles: Geographically dispersed infections food, water, soil Single imported produce may lack one source of exposure Disease symptoms similar to other enteric Prevalence often underdiagnosed pathogens

8

3.3 Risk Factors and Symptoms

When infected with C. cayetanensis, patients exhibit symptoms similar to those of other enteric pathogens, including watery , , loss of appetite, abdominal cramping, bloating, fatigue, and low-grade (Shields & Olson, 2003). For immunocompromised patients, this disease can be severe due to prolonged or chronic diarrhea causing dehydration (Shields & Olson, 2003). For healthy older children or adult patients in endemic regions, symptoms are often mild and self-limiting or asymptomatic when sporulated oocysts are ingested (Chacín-Bonilla, 2010). Young children and the elderly often experience more severe symptoms (Chacín-Bonilla, 2010). Repeated infections are often milder, indicating there is an adaptive immunity component to this disease (Ortega & Sanchez, 2010; Tsang et al., 2013).

3.4 Treatment

Cyclosporiasis is most effectively treated with the antibiotic trimethoprim-sulfamethoxazole (TMP-SMX) (Pape et al., 1994; Li et al., 2020). This drug works by ultimately blocking the synthesis of folic acid (Smilack, 1999). If the patient is allergic to sulfonamides, ciprofloxacin or can be administered instead to treat the diarrheal disease, however this treatment is less effective (Verdier et al., 2000; Li et al., 2020). Typically these treatments are administered if the illness lasts for more than a few days or if the patient is suffering from dehydration or is immunocompromised (Li et al., 2020).

4.0 Detection 4.1 Clinical Detection Methods

To detect >95% of C. cayetanensis infections, three stool samples should be collected for examination within a 10 day period because intestinal typically do not appear in stool in consistent numbers for consecutive days (Garcia et al., 2017). For samples with low amounts of oocysts, concentration methods (such as sucrose flotation or formalin-ether sedimentation) are recommended. Clinical samples are diagnosed using one of two techniques: microscopy or molecular techniques.

9

4.1.1 Microscopy

Cyclospora cayetanensis oocysts can be observed directly using wet smears under bright-field with differential interference contrast (DIC) microscopy, as small, spheroidal, refractile bodies (Garcia et al., 2017). In addition, there are two staining techniques for C. cayetanensis detection: modified acid-fast or safranin stain. In a modified acid-fast stain, the light pink to dark purple oocysts stand out of the blue-green background. However, this method requires special training and a keen eye as oocysts may not retain the stain and may not appear perfectly circular, creating confusion with other artifacts present (Visvesvera, 1997). Safranin stains the oocysts more uniformly, with a 98% success rate, but requires the smear and stain to be heated through microwave treatment (Visvesvera, 1997). Alternatively, oocysts can be detected through fluorescence. Cyclospora cayetanensis oocysts have the capability to autofluoresce blue when exposed to 330-365 nm light or green at 450-490 nm excitation (Berlin et al., 1998), allowing them to be readily identified. Although more sensitive and specific than staining methods, this method of detection requires expensive equipment, including special filters in a fluorescent microscope, unavailable to some diagnostic labs.

4.1.2 Molecular

Molecular detection is a significantly more sensitive method for detecting the presence of C. cayetanensis than the traditional microscopy methods (Mundaca et al., 2008; Shields et al., 2012; Murphy et al., 2017; Qvarnstrom et al., 2018). Clinically, there exists only a few approved, commercially available diagnostic panels with the ability to detect C. cayetanensis (Table 1.2). Evaluating the performance of the BioFire® FilmArray® Gastrointestinal (GI) Panel (BioFire, 2020), a panel that can detect C. cayetanensis, Buss et al. (2015) confirmed 100% specificity and sensitivity for C. cayetanensis detection with Hitchcock et al. (2019) finding 97% reproducibility in clinical settings. It is important to note that not all diagnostic laboratories have access to a panel that can detect C. cayetanensis, therefore testing at some labs may only be conducted through microscopy.

10

4.2 Detection in Produce

In fresh contaminated produce, it is not uncommon to have low numbers of oocysts, therefore requiring an effective collection method. Several different washing solutions have been used to obtain the oocysts from the food matrices (Shields et al., 2012; Chandra et al., 2014), with 0.1% Alconox, a commercially available detergent (Alconox, White Plains, NY), being the most effective for washing the oocysts off the produce (Shields et al., 2012; Murphy et al., 2018). Another method to isolate C. cayetanensis oocysts from fruits and vegetables is through lectin- coated paramagnetic beads (Robertson et al., 2000). Antibodies are a promising technique for isolating oocysts from produce, similar to (Bukhari et al., 1998). However, commercial antibodies are not yet available for C. cayetanensis (Almeria et al., 2019). Both quantitative PCR (qPCR) and multiplex qPCR techniques have been developed to estimate the quantity of oocysts within the collected solution (Steele et al., 2003; Lalonde & Gajadhar, 2011, 2016; Murphy et al., 2017; Temesgen et al., 2019). The singleplex qPCR assays target the small subunit (SSU) rRNA gene fragment (Murphy et al., 2017) or the nuclear rDNA internal transcribed spacer 1 region (Temesgen et al., 2019). These molecular methods are sensitive, with a limit of detection of five C. cayetanensis oocysts (Murphy et al., 2017) or one oocyst (Temesgen et al., 2019), respectively, despite the food matrices containing PCR inhibitors, such as polysaccharides. The multiplex qPCR reactions allow for four protozoan parasites (C. cayetanensis, Cryptosporidium, T. gondii, and Giardia) to be detected simultaneously in a rapid manner (Shapiro et al., 2019). These methods are only able to estimate the quantity of C. cayetanensis present, with no detail of the population structure for source linkage.

11

Table 1.2. Commercial assays for detecting protozoan parasites Multiplex qPCR Panel Parasites Detected Reference

CC BH DF EH GD C

Allplex™ GI-Parasite Seegene, Seoul, Assay ✓ ✓ ✓ ✓ ✓ ✓ South Korea

BD MAX™ Enteric BD, NJ, USA Parasite Panel ✓ ✓ ✓

BioFire® FilmArray® BioFire Diagnostics, Gastrointestinal (GI) Panel ✓ ✓ ✓ ✓ UT, USA

Easyscreen enteric protozoa Genetic Signatures, ✓ ✓ ✓ ✓ ✓ New South Wales, FTD Stool parasites Fast Track ✓ ✓ ✓ Diagnostics, Luxembourg, Europe G-DiaPara™ Diagenode ✓ ✓ ✓ Diagnostics, Seraing, Belgium QiAstat-Dx Qiagen, Hilden, ✓ ✓ ✓ ✓ Germany

RIDA®GENE Parasitic R-Biopharm, Stool Panel ✓ ✓ ✓ ✓ Darmstadt, Germany xTAG® Gastrointestinal Luminex, TX, USA Pathogen Panel ✓ ✓ ✓

Parasites detected are: CC=Cyclospora cayetanensis, BH= hominis, DF=Dientamoeba fragilis, EH=Entamoeba histolytica, GD=Giardia duodenalis, C=Cryptosporidium.

12

5.0 Molecular Characteristics

Although diagnostic molecular methods are able to detect the presence of C. cayetanensis oocysts, it remains difficult to perform case cluster investigations. Three primary difficulties that are impeding the advancement of molecular epidemiological tools are (1) few clinical samples available from cyclosporiasis cases because cases are under-diagnosed with symptoms identical to other more common and well-known enteric diseases; (2) limited parasite material from each case is available typically because a standard fecal sample could contain fewer than 105 parasites due to low and intermittent shedding (Nascimento et al., 2016); and, (3) fecal samples are often stored in a SAF (sodium acetate/acetic acid/formalin) solution, a fixative that adheres to DNA, creating negative impacts on downstream molecular analysis. Recent genetic studies for this organism have attempted to develop a standardized molecular subtyping tool (Guo et al., 2016; Nascimento et al., 2019, 2020; Barratt et al., 2019; Cinar et al., 2020) to supplement epidemiological case-linkage investigations previously described. This next section will focus on the molecular structure of C. cayetanensis based on recent whole genome sequencing developments, along with investigating these recent attempts.

5.1 Molecular Structure

The genome of C. cayetanensis has been successfully sequenced using short-read sequencing technologies (Liu et al., 2016; Qvarnstrom et al., 2018). To date, there are 37 whole genome sequences available for Cyclospora cayetanensis on NCBI (National Center for Biotechnology Information), representing seven different countries, with approximately half belonging to isolates from the United States (see Figure 1.3). Cyclospora cayetanensis contains the three genomes (i.e. nuclear, mitochondrial, and apicoplast) typically possessed by members of the phylum Apicomplexa.

13

1

Figure 1.3. Geographic origins of Cyclospora cayetanensis isolates from which genomes have been assembled. Of the 37 currently available C. cayetanensis genomes on NCBI GenBank, only seven different countries are represented, with approximately half originating from the United States.

5.1.1 Nuclear Genome

The size of nuclear genomes across the Apicomplexa have been reduced in nature compared to free-living , partially due to lineage-specific gene loss (Woo et al., 2015). The coccidian lineage contains the largest nuclear genomes within the Apicomplexa phylum, with most species measuring over 50 Mbp in size, and the most reported number of genes (Kissinger et al., 2003; Cai et al., 2003). The nuclear genome of C. cayetanensis is estimated to be 44 Mbp long with a GC content of 52%. The current reference genome, CcayRef3, consists of 738 contigs and an N50 of 192,560 (NCBI Genome). Examining the completeness of this reference genome using BUSCO (v4.0.2) with the Apicomplexa lineage dataset (Seppey et al., 2019), 432 (96.9%) of single-copy orthologous genes were complete, with 10 (2.2%) fragmented and 4 (0.9%) missing genes. There

14

are ~7500 genes encoded in this genome, with unique characteristics compared to other apicomplexan parasites (Liu et al., 2016). For instance, C. cayetanensis has unique surface antigens and different metabolism genes and post-translational modifications to proteins (Liu et al., 2016). Studying the nuclear genome of this organism is difficult due to its large size and high degree of variations attributable to the genetic recombination during the sexual phase, including high genome heterozygosity when analyzing nuclear gene targets of sub-populations (Guo et al., 2016; Cinar et al., 2020). Furthermore, no two nuclear genomes will be completely identical due to the sexual re-assortment of the genome (Barratt et al., 2019).

5.1.2 Mitochondrial Genome

The mitochondrial genome is useful for phylogenetic analysis because it is inherited maternally with limited genetic recombination. This genome ranges in size across the Apicomplexa from 6 to 11 kbp, in may groups, displays well-conserved genetic content. Almost all apicomplexan mitochondrial genomes encode three highly-conserved protein-coding genes: cytochrome c oxidase subunits I and III (COI and COIII, respectively) and cytochrome b (CytB), alongside highly fragmented small subunit (SSU) and large subunit (LSU) rDNA (Kairo et al., 1993; Brayton et al., 2007; Hikosaka et al., 2011; Lin et al., 2011; Ogedengbe et al., 2014; Schreeg et al., 2016). Similar to other coccidia, the mitochondrial genome of C. cayetanensis is a concatemeric linear molecule, arranged in head-tail configuration, measuring 6.3 Kbp in length (Figure 1.4) (Cinar et al., 2015; Ogedengbe et al., 2015; Cinar et al., 2020). Based on the relative proportion of sequence reads mapped to each genome, Tang et al. (2015) estimates a higher than normal mitochondrial copy number compared to other apicomplexan parasites, with an approximation of 513 copies (Nascimento et al., 2019). This genome has a GC content of 33% and contains 35 genes: three protein-coding (COI, COIII, CytB), 18 large subunit and 14 small subunit fragmented rRNA genes (Cinar et al., 2015; Ogedengbe et al., 2015). Through comparative analysis, this region contains both single nucleotide polymorphisms (SNPs) and insertion and deletion (InDels) profiles throughout, along with a variant region in the junction area (Cinar et al., 2015; Gopinath et al., 2018; Cinar et al., 2020).

15

Figure 1.4. Mitochondrial genome organization of Cyclospora cayetanensis (KP658101.1). The three protein-coding genes, 18 large subunits, 14 small subunits, and 3 unassigned fragmented rRNAs are represented in yellow, blue, red, and purple, respectively.

5.1.3 Apicoplast Genome

The single copy, circular apicoplast genome has been under evolutionary selective pressure and is highly reduced in genome size and gene content compared to the entity it derived from, the red algae (Janouškovec et al., 2010). For comparison, a red algae plastid measures at least 150 kbp in size with about 250 genes (de Vries & Archibald, 2018), whereas the plastid in apicomplexans measures about 35 kbp and contains 54 to 70 genes (Wang et al., 2019). This reductive nature is not entirely related to their loss of photosynthetic capabilities; the closely related chromerid protest, , also displays a reductive nature in the apicoplast genome (Salomaki & Kolisko, 2019). The gene content is highly conserved among the available apicoplast genomes (Wilson et al., 1996; Cai et al., 2003; Gardner et al., 2005; Brayton et al., 2007; Seeber et al., 2014) with the majority of the encoded genes are responsible for genome maintenance, including ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs). Coccidia share a more similar apicoplast structure to those of haemosporinids compared to piroplasms (Wilson et al., 1996; Cai et al., 2003; Seeber et al., 2014). The apicoplast genome of C. cayetanensis measures ~34 Kbp in length with a GC content of 22% (Tang et al., 2015; Cinar et al., 2016). This genome encodes for 66 genes, including 29 protein-coding genes, four rRNA genes, and 33 tRNA genes to encode for all 20 amino acids (Cinar et al., 2016). The apicoplast genome of C. cayetanensis shares complete gene synteny with the closely related coccidium, Eimeria tenella (Tang et al., 2015). Structurally, the apicoplast genome contains an inverted repeat region that contains duplicated copies of select tRNA genes,

16

in addition to the large (LSU) and small (SSU) subunit ribosomal DNA genes (Figure 1.5) (Cinar et al., 2016). Genes are encoded bidirectionally. This highly conserved genome exhibits 25 SNPs amongst 11 different genomes spanning several geographical locations along with a unique repeat insertion sequence in one of the Nepalese samples (Cinar et al., 2016).

5.2 Gene Targets and Schemes

An ideal genotyping method includes the ability to type each specimen, discriminate against others, has high reproducibility and agrees with epidemiological data (van Belkum et al., 2007). There have been many attempts to develop a subtyping scheme that can discriminate the subpopulations of C. cayetanensis isolates during an outbreak (Table 1.3) to ultimately identify the causative source and resolve the outbreak in a timely manner to reduce the number of infections. Guo et al. (2016) developed the first multilocus subtyping tool (MLST) consisting of five nuclear microsatellite regions. When evaluated on clinical specimens, only 53-59% of samples were successfully subtyped at all five loci (Guo et al., 2016; Li et al., 2017), indicating low typeability. The major pitfall to this scheme included unreadable Sanger sequences due to gene heterozygosity or the presence of mixed populations (Guo et al., 2016; Li et al., 2017; Hofstetter et al., 2019). Houghton et al. (2020) identified four candidate nuclear markers for genotyping C. cayetanensis: CDC-1 (ATP synthase), CDC-2 (U3 small nucleolar RNA-associated protein 11), CDC-3 (uncharacterized), and CDC-4 (ATP-dependent RNA helicase rrp3). These markers were subjected to conventional PCR and Sanger sequencing for evaluation. Of the 93 specimens collected in this study, 57 (61.3%) samples had complete typing information for all four markers, covering 13 SNPs and resolving 19 different genotypes (Houghton et al., 2020). In this study, a presence-absence table was generated to analyze the relationship between those with complete typing information (Houghton et al., 2020). These markers were selected as genomic evidence indicated they were single-copy; however, mixed peaks in the chromatograms were observed

17

Figure 1.5. Apicoplast genome organization of Cyclospora cayetanensis (CHN_HEN01, NC_028632.1). The ~34 Kbp plastid encodes for 66 genes and contains an inverted repeat at the bottom of the circle depicted by the red and dark blue boxes. The schematic was created with OGDRAWv1.3.1 (Greiner et al., 2019).

18

Table 1.3. Molecular methods developed to discriminate Cyclospora cayetanensis isolates Method Gene Target(s) Genetic Subtyping Success Comments References (genome) Variation Rate

1 MLST CYC3 (nuclear) 4 bp repeat (x2) 83% (53/64); 64% (49/76) ● Low typeability Guo et al., (Nested CYC13 (nuclear) 3 bp repeat 88% (56/64); 78% (59/76) ● Adequate discriminatory power 2016; PCRs) CYC15 (nuclear) 3 bp repeat 97% (62/64); 75% (57/76) ● Epidemiological concordance not Li et al., CYC21 (nuclear) 2 bp repeat 91% (58/64); 68% (52/76) addressed 2017 CYC22 (nuclear) 2 bp repeat 83% (53/64); 74% (56/76)

Complete Typing 53% (34/64); 59% (45/75)

1 MLST CYC21 (nuclear) 2 bp repeat 96% (24/25) ● Full concordance with 2/9 Hofstetter et (conventional CYC22 (nuclear) 2 bp repeat 92% (23/25) epidemiological clusters al., 2019 PCR) Complete Typing 84% (21/25)

Conventional Mitochondrial 15 bp repeat 99% (133/134) ● 7/10 outbreaks/temporospatial clusters Nascimento PCR Linking Region End motif had 1 type identified et al., 2019 SNPs ● 3/10 outbreaks/temporospatial clusters had ≥2 types identified

Quantitative Mitochondrial 7 bp MNVs, 100% (36/36) ● Significant geographical Guo et al., with Melt Linking Region SNVs, 15 bp segregation 2019 Curve repeat ● Typing resolution and source tracking not assessed

1 MLST MSR (mito) 4 SNPs 92% (81/88) ● Comparing genetic links to 8 Barratt et (conventional HC360i2 (nuclear) 20 SNPs 99% (87/88) epidemiologically-linked clusters: al., 2019 PCR) HC378 (nuclear) 15 SNPs 91% (80/88) ○ 4/8 had full concordance

19

Complete Typing 84% (74/88) ○ 2/8 had partial concordance ○ 2/8 had no concordance

Whole Mitochondrial 12 SNPs, 100% (24/24) ● Significant geographical segregation Cinar et al., Genome Genome hypervariable ● Concordance with epidemiological 2020 Sequence repeat clusters not fully assessed

Conventional CDC-1 (nuclear) 7 SNPs 61% (57/93) ● Concordance with epidemiological Houghton et PCR (cPCR) CDC-2 (nuclear) 1 SNP 77% (71/93) clusters not fully assessed al., 2020 CDC-3 (nuclear) 2 SNPs 75% (70/93) CDC-4 (nuclear) 3 SNPs 74% (69/93)

Complete Typing 61% (57/93)

3 Targeted CDC-1 (nuclear) 7 SNPs 56.8% (378/666) ● 94% concordance with epidemiological Nascimento Amplicon CDC-2 (nuclear) 1 SNP 83.5% (556/666)3 clusters et al., 2020 2 3 NGS Assay CDC-3 (nuclear) 2 SNPs 83.0% (553/666) ● Analytical sensitivity and specificity of CDC-4 (nuclear) 3 SNPs 53.2% (354/666)3 93.8% and 99.7%, respectively HC360i2 (nuclear) 20 SNPs 96.1% (640/666)3 HC378 (nuclear) 15 SNPs 93.8% (625/666)3 MSR 4 SNPs 97.9% (652/666)3 (mitochondrial) MTJ 15 bp repeat, 86.2% (574/666)3 (mitochondrial) end motif SNPs

Successful 70% (648/927) Genotyping 1MultiLocus Sequencing Tool (MLST) 2Next Generation Sequencing (NGS) 3Individual marker performance given for 648 US 2018 outbreak + 18 validation samples that successfully clustered

20

indicating mixed genotypes are common in C. cayetanensis infections (Nascimento et al., 2020). Comparisons between clusters generated using this method and clusters defined by epidemiological data have not been assessed in the Houghton et al. (2020) study to fully assess the effectiveness of these markers. Furthermore, Barratt et al. (2019) suggested that due to the genetically heterogeneous infections of C. cayetanensis, the analysis of the complex haplotype compositions must not be limited to phylogenetic analysis alone, as previously performed with the original MLST scheme. Instead Barratt et al. (2019) developed an ensemble that can link C. cayetanensis isolates to epidemiological case clusters using similarity-based algorithms when analyzing a set of haplotypes. To assess the efficacy, three haplotype markers, each containing SNPs for typing, were identified and sequenced. These markers include the mitochondrial rRNA (MSR), a nuclear Sec14 family protein (378) and an unidentified nuclear marker (360i2) (Barratt et al., 2019). With >90% amplification success, four of the eight epidemiological outbreak clusters were in agreement with the ensemble-assigned links (Barratt et al., 2019). However, the other four were not fully in agreement as the three markers did not represent the complete picture of the C. cayetanensis genome (Barratt et al., 2019). As a result, this method could be useful for case-linkage investigations with epidemiological data but more insightful, haplotype markers are required. In addition to studies focusing on gene targets in the nuclear genome, a promising single gene marker has been developed for discriminating C. cayetanensis subpopulations. Nascimento et al. (2019) developed a scheme targeting the short mitochondrial junction region, a variable region that links mitochondrial genome copies. This marker had a 99% amplification success rate, distinguishing 14 subtypes in 132 samples across the United States collected from 2013-2016 (Nascimento et al., 2019). Of those results, seven of ten outbreak clusters had identical subtypes, while three displayed two or more subtypes in the cluster (Nascimento et al., 2019). Guo et al. (2019) modified this idea by creating a quantitative PCR (qPCR) with a melt curve analysis of the same gene target to rapidly assess genetic heterogeneity across isolates. This region shows promise for discriminating isolates, whether through conventional PCR or qPCR, although future studies are required. Cinar et al. (2020) expanded on these studies by creating a protocol to sequence the entire mitochondrial genome and analyze the SNP and insertion and deletion (InDel) profiles, in

21

addition to the linking region, to improve the discriminatory power. The inclusion of markers within the mitochondrial genome are a promising candidate for molecular subtyping schemes due to its high copy number and lack of genetic recombination (Cinar et al., 2020). To utilize the various schemes developed for C. cayetanensis, Nascimento et al. (2020) designed a targeted amplicon next-generation sequencing assay that combines the eight markers collectively described by Houghton et al. (2020), Barratt et al. (2019), and Nascimento et al. (2019) and clusters the data using the ensemble-based distance statistic as described by Barratt et al. (2019). This MLST was tested on 2018 US outbreak samples collected by the CDC and clustering results showed 94% concordance with epidemiological information, along with 93.8% and 99.7% analytical sensitivity and specificity, respectively (Nascimento et al., 2020). The sequencing success rate for a sample to be included in the ensemble analysis was 70% (Nascimento et al., 2020). This targeted amplicon approach shows promise for supporting epidemiological investigations, though improvements are required as this approach encounters difficulty with clustering small numbers of specimens from the same outbreak along with achieving high sequencing success rates. Further research is required to identify additional successful haplotype markers with high sequencing success rates and discriminatory power in the nuclear genome to gain further insight in the subpopulation structure of C. cayetanensis infections.

6.0 Conclusion

Existing epidemiological outbreak investigations often fail to definitively identify outbreak sources. Molecular methods show promise in addressing this shortcoming. However, further research is required to supplement currently available molecular information. Both whole genome sequences and targeted genomic sequences, from various C. cayetanensis isolates, are required to better focus on which areas of the genome have high discriminatory power to cluster isolates from identical sources of infection. Addressing this gap will allow for a genotyping method to be validated to complement epidemiological investigations to resolve cyclosporiasis outbreaks.

22

STUDY OBJECTIVES AND RATIONALE

The goal of this study was to enhance the genetic knowledge of Cyclospora cayetanensis by examining Canadian specimens through whole genome sequencing and targeted amplicon approaches. This is the first study to sequence clinical specimens that tested positive for C. cayetanensis where the infections were acquired in Canada.

Objective #1: Evaluate the current ensemble-based subtyping tool developed by Nascimento et al. (2020) on both Canadian outbreak and travel-related cases.

Rationale: Currently, when a national C. cayetanensis outbreak is declared in Canada, investigations to determine the source of infection are conducted using epidemiological data alone. However, due to inherent challenges associated with epidemiological investigations on this organism, these investigations are often unsuccessful at identifying the outbreak source. Therefore, routine molecular laboratory tools are required to complement epidemiological data (Morton et al., 2019). The recently developed ensemble-based subtyping scheme described by Nascimento et al. (2020) is the first subtyping tool for C. cayetanensis to have sufficient agreement (94%) with epidemiological data. Thus, this scheme will be evaluated on Canadian C. cayetanensis positive specimens collected from 2010-2020 to retrospectively determine whether the markers included are able to discriminate isolates to resolve outbreaks. This evaluation will determine which markers perform better than others and whether the ensemble-based approach for clustering molecular data coincides with the data obtained from epidemiological investigations.

Objective #2: Obtain whole genome sequences of Canadian Cyclospora cayetanensis isolates using short-read and long-read sequencing data.

Rationale: To identify potential markers that discriminate C. cayetanensis isolates, a thorough, comparative genomics analysis is required. However, due to the difficulty of obtaining sufficient high quality genetic material from specimens, amongst other inherent challenges with this organism, only 37 draft genome assemblies are available to date (NCBI Genome), none of which are from Canadian isolates. Of those genomes, all were assembled using short-read sequencing

23

technologies. To improve the geographical diversity to supplement comparative genomics studies, Canadian stool specimens containing sufficient C. cayetanensis oocysts will be collected, purified, and sequenced on the Illumina MiSeq (Illumina, CA, USA) to generate whole genome sequences. Furthermore, a subset of samples will be subjected to sequencing on the Oxford Nanopore MinION (Oxford Nanopore Technologies, Oxford, UK). These short- and long-read sequences will be combined to complete a hybrid assembly of the genome. This hybrid assembly will significantly improve the quality of currently available assemblies.

24

CHAPTER 2: EVALUATION OF THE ENSEMBLE-BASED MLST SCHEME

2.1 Abstract

Cyclospora cayetanensis is an emerging food and waterborne parasite that causes the enteric disease, cyclosporiasis in humans. There recently has been an increase in the frequency of multi- jurisdictional Canadian cyclosporiasis outbreaks, with an outbreak reported every spring or summer since 2013. Traditionally, C. cayetanensis outbreak investigations have relied solely on epidemiological data; however, gene level information could greatly enhance case clustering and source identification. Recent studies from the U.S. Centers for Disease Control and Prevention (CDC) have described a new tool to genetically cluster cyclosporiasis cases by combining a next- generation, targeted amplicon sequencing scheme that consists of eight markers (CDC1, CDC2, CDC3, CDC4, HC378, HC60i2, Mt-Junction, MSR), with an ensemble-based distance statistic that accounts for the genetic complexities associated with C. cayetanensis. This study was undertaken to evaluate the performance of each marker included in this genotyping scheme on 160 unpreserved, clinical fecal specimens from Canadian cyclosporiasis cases collected from 2010 to 2020. Genotyping data were generated for at least one marker for 96.2% of specimens and 36.9% of specimens had genotyping data for all eight markers. Despite the majority of specimens not having complete genotyping data, 79.4% of specimens clustered with one another using the Eukaryotyping tool developed by the CDC and 18 genetic clusters were identified. This is the first study to genotype Canadian cyclosporiasis cases and the performance of the scheme suggests it will be a valuable tool for supplementing epidemiological investigations to prevent further infections.

2.2 Introduction

Cyclospora cayetanensis is the causative agent for the food and waterborne illness cyclosporiasis. Similar to other enteric pathogens, this gastrointestinal disease is characterized by watery diarrhea, abdominal cramping, and weight loss (Ortega et al., 1993). Beginning in 2000, cyclosporiasis became a nationally notifiable disease in Canada and ever since 2013, national outbreaks of locally acquired cases have been declared yearly during the spring or summer months

25

(Morton et al., 2019). This recent increase of outbreaks is partly attributed to the development of molecular diagnostic tools with increased sensitivity, in addition to increased global food distribution and fresh produce consumption within the general public (Hadjilouka & Tsaltas, 2020). Currently, there is no validated laboratory subtyping method for outbreak investigations, leaving epidemiological case-linkage data as the only method to identify the initial source of infection and resolve the outbreak. Traceability analysis remains a challenge as the long incubation period of C. cayetanensis (2-14 days) limits the ability to collect complete food exposure data, the short shelf-life of produce hinders the ability to test for initial sources of infection, and low diagnostic rates causes sparse data to be obtained. Consequently, outbreak investigations are either resolved with significant uncertainty or the source of infection remains unidentified altogether (Hadjilouka & Tsaltas, 2020). There have been several attempts to develop markers that successfully genotype C. cayetanensis for outbreak resolution (Guo et al., 2016; Li et al., 2017; Hofstetter et al., 2019; Nascimento et al., 2019; Barratt et al., 2019; Houghton et al., 2020; Nascimento et al., 2020). The first multilocus subtyping tool, developed by Guo et al. (2016), involved nested PCRs for five nuclear markers that contained microsatellite regions. After evaluation, this tool was deemed not successful for genotyping C. cayetanensis infections due to low sequencing success rates and uninterpretable data due to heterozygous peaks within the Sanger sequencing data (Guo et al., 2016; Hofstetter et al., 2019). With the aid of recently available draft genomes for C. cayetanensis (Qvarnstrom et al., 2015; Liu et al., 2016), alternative markers were identified in either the mitochondrial or nuclear genomes (Nascimento et al., 2019; Barratt et al., 2019; Houghton et al., 2020; Nascimento et al., 2020). Of these, Nascimento et al. (2020) described a targeted, short amplicon deep sequencing multilocus sequencing tool (MLST) for genotyping C. cayetanensis infections. This scheme combines the eight most recently described markers (Nascimento et al., 2019; Barratt et al., 2019; Houghton et al., 2020), consisting of six nuclear (CDC1, CDC2, CDC3, CDC4, HC378, HC360i2) and two mitochondrial markers (Mt-Junction, MSR). To analyze the haplotype data collected from the eight-marker MLST described, Nascimento et al. (2020) integrated an ensemble-based method that uses both Bayesian and heuristic similarity- based components, described by Barratt et al. (2019), to calculate the relatedness between C.

26

cayetanensis specimens. This method outperforms the simpler statistical models, such as Bray- Curtis dissimilarity or Jaccard distances, as it accounts for partial data due to the difficulty encountered to successfully amplify all markers of interest, the presence of heterogeneous subtypes in a particular marker, differences in loci entropy, and the mode of inheritance for each marker (Barratt et al., 2019). Overall, Nascimento et al. (2020) reported encouraging results for this genotyping scheme to supplement epidemiological case-linkage investigations as genetic clusters were 94% concordant with epidemiologic clusters using 2018 cyclosporiasis specimens collected in the USA. This study evaluates the targeted short-read amplicon deep sequencing assay described by Nascimento et al. (2020) on Canadian cyclosporiasis cases that were collected between 2010 and 2020. To date, molecular genotyping methods have yet to be applied to Canadian C. cayetanensis specimens. By performing this genotyping method, we will gain insights into the genotypes that are found within Canada and determine whether the scheme will provide the necessary tools to complement epidemiological analyses to aid in outbreak mitigation.

2.3 Materials and Methods

2.3.1 Clinical Fecal Specimens

This study received 160 Cyclospora cayetanensis positive specimens at the National Microbiology Laboratory (NML) at Guelph. These samples were received from four provincial health partners: Public Health Ontario (PHO) Laboratory, Laboratoire de Santé Publique du Québec (LSPQ), British Columbia Centre for Disease Control (BCCDC) Public Health Laboratory, and Newfoundland Public Health Laboratory (NPHL). All clinical fecal specimens were verified as C. cayetanensis positive at their respective provincial health laboratory between 2010 and 2020 prior to being shipped to NML. The samples were either received frozen without preservatives or suspended in the transport medium Carey-Blair, and were stored at -20℃ and 4℃, respectively, prior to processing.

27

2.3.2 Ethics

Ethics approval was granted by the Research Ethics Board of Health Canada/Public Health Agency of Canada (Ethics certificate REB 2016-010P).

2.3.3 DNA Extraction

The stool samples were mixed and aliquoted into 0.2 g portions prior to being processed with the QIAmp Fast DNA Stool Mini Kit (Qiagen, Hilden, Germany). The manufacturer’s instructions were followed with modifications. Prior to adding Buffer AL to the supernatant, oocysts were pelleted by centrifuging the sample at 5,000×g for five minutes. Half of the supernatant was removed into a new microcentrifuge tube, where the oocysts were subjected to one minute of bead beating with lysing matrix Y (MP Biomedicals, OH, USA) at 6 m/s on the Omni Bead Ruptor 24 (Omni International, GA, USA). The supernatant was collected into a new microcentrifuge tube and the beads were washed using the supernatant collected prior to bead-beating. To ensure no beads were carried over, the tube was briefly centrifuged and the supernatant was collected. Finally, at the end of the manufacturer’s protocol, DNA was eluted in two rounds of 50 μl of ATE Buffer and stored at -20 ℃ until use.

2.3.4 PCR Amplification and Amplicon Clean-Up

All samples were analyzed at eight genotyping targets: six nuclear (CDC1, CDC2, CDC3, CDC4, HC378, HC360i2) and two mitochondrial (Mt-Junction, MSR), using the PCR protocol developed by the United States Center for Disease Control (CDC) (Nascimento et al., 2020), with minor modifications. To reduce inhibition, bovine serum albumin (BSA) heat shock fraction (Sigma-Aldrich, St. Louis, MO, USA) was added to each reaction at a final concentration of 300 ng/μl. The KAPA HiFi HotStart ReadyMix (Roche, Basel, Switzerland) and Platinum Taq (Invitrogen by Life Technologies, Carlsbad, CA, USA) were used as the polymerases for the nuclear and mitochondrial gene targets, respectively. The modified PCR reaction conditions to account for the different polymerases are outlined in Table 2.1. GBlock™ gene fragments were designed for all eight makers to act as the positive control. Primers and gBlock™ gene fragments were synthesized by Integrated DNA Technologies (IDT, Coralville, , USA). 28

Amplicons were visualized on the FlashGelTM DNA Cassette (Lonza Group, Basel, Switzerland). Samples with positive bands were purified and normalized using the SequalPrepTM Normalization Plate (96) Kit (Applied Biosystems by Thermo Fisher Scientific, Foster City, CA, USA) following manufacturer’s protocols. Part way through, amplicons were purified using the AMPure XP Beads (Beckman Coulter, Brea, CA, USA) and quantified using the NanodropTM 8000 Spectrophotometer for a more consistent normalization process. Finally, all clean and normalized amplicons were pooled for each sample for DNA sequencing preparation.

2.3.5 DNA Sequencing

The sequencing library, containing the combined product for each sample, was prepared using the Nextera XT DNA Library Preparation Kit (Illumina, San Diego, CA, USA). The manufacturer’s protocol was followed with the following exceptions: (1) 90 μl of the AMPure XP beads were used to clean up the libraries (instead of 30 μl), and (2) normalization of libraries was omitted. The library fragment size was measured using the Agilent 2100 Bioanalyzer system (Agilent Technologies, Santa Clara, CA, USA) and the concentration was measured with the QubitTM Fluorometer using the QubitTM dsDNA Broad Range Assay Kit (Invitrogen by Life Technologies, Carlsbad, CA, USA). The final library concentration was diluted to a final concentration of 4 nM, prior to being further diluted to 10-15 pM (dependent on fragment size) for loading onto the cartridge. The library was sequenced on the Illumina MiSeq using the MiSeq Reagent Nano Kit v2 (500 cycles, 2 x 250 bp) (Illumina, San Diego, CA, USA).

2.3.6 Sequence Analysis and Clustering

For each sample, the sequencing data were sorted for each of the eight markers using a custom- made BASH script that performs a homology search using BLASTN (Altschul et al., 1990) on the demultiplexed Illumina sequencing data against a reference database containing all published sequences for each marker. For simplifying the analysis, this reference database included the fragmented portions of the Nu HC378 (A, B, C) locus, Nu HC360i2 (A, B, C, D) locus, and Mt MSR (Left, Right) locus (see Nascimento et al., 2020). Subsequently, the paired end sequencing data was trimmed and quality filtered using BBDuk (v37.62) from the BBTools suite to a minimum Phred quality score of 20 and minimum read length of 50 base pairs. Amplicon sequencing variants 29

Table 2.1. Modified PCR reaction conditions used for the Cyclospora cayetanensis targeted amplicon NGS scheme1 Primers Gene Target Length Primer Sequence (5’ - 3’) PCR Conditions (concentration)

Initial Denaturation: 98 ℃ for 2 min 35 cycles: 98 ℃ for 15 sec Nu CDC1 GT1-F (600 nM) CTCCTTGCTGCTCAGAACGA 67 ℃ for 30 sec 175 bp (ATP synthase) GT1-R (600 nM) CAAGAGAGGAGCAGTGGCAA 65 ℃ for 30 sec Final Extension: 65 ℃ for 5 min Hold: 4 ℃

Nu CDC2 GT2-F (200 nM) TGCAAACTACTAAGGGCGCA (U3 small nucleolar RNA- 246 bp Initial Denaturation: GT2-R (200 nM) associated protein 11) CGCCTTCTCTTGAGCCTTGA 98 ℃ for 2 min 35 cycles: Nu CDC3 GT3-F (400 nM) AATCGAATCGGTGCAGTGCTTA 98 ℃ for 15 sec 220 bp (uncharacterised) GT3-R (400 nM) GACTGAACGTGTGAGAGGGG 67 ℃ for 15 sec 65 ℃ for 15 sec Nu CDC4 (ATP-dependent GT4-F (400 nM) GTAGATGGGTCCTTGAAGGCT 179 bp Final Extension: RNA helicase rrp3) GT4-R (400 nM) CAGACGCCTAAGGAACCGAA 65 ℃ for 5 min Hold: Nu HC378 HC378F (400 nM) CCCCTGCCTTGTTCTTGGTGAA 469 bp 4 ℃ (sec14 family protein) HC378R (400 nM) CCGGCGACACAGAGGTACC

30

Initial Denaturation: 98 ℃ for 2 min 35 cycles: 98 ℃ for 15 sec Nu HC360i2 HC360i2F (400 nM) CCCATTACGCCGCATAGAGT 67 ℃ for 30 sec 650 bp (uncharacterised) HC360i2R (400 nM) GCATTGCAAAGCCAGTCAGC 72 ℃ for 30 sec Final Extension: 72 ℃ for 5 min Hold: 4 ℃

Mt MT-Junction Initial Denaturation: MT5732F (200 nM) GTCGTTACACCATTCATGCAG (mitochondrial junction ±534 bp 94 ℃ for 2 min MT6266R (200 nM) CCCAAGCAATCGGATCGTGTT repeat) 35 cycles: 94 ℃ for 15 sec 55 ℃ for 30 sec 68 ℃ for 40 sec Mt MSR 15F (200 nM) GGACATGCAGTAACCTTTCCG 674 bp Final Extension: (mitochondrial rRNA) 688R (200 nM) AGGAAAGGTTAACCGCTGTCA 68 ℃ for 5 min Hold: 4 ℃ 1Developed by Nascimento et al. (2020)

31

(ASVs) for each marker were then determined de novo using the DADA2 (v1.16) package (Callahan et al., 2016) in R (v4.0) that uses a divisive amplicon denoising algorithm. Briefly, the error rates for each amplicon dataset was determined using a parametric error model prior to applying the core sample inference algorithm that determines the sequence variants (see Callahan et al., 2016). Subsequently, reads were mapped using bwa (v0.7.17) (Li & Durbin, 2009) against a database composed of the longest unique ASVs for each marker. The breadth and depth of coverage were assessed using SAMtools (Li et al., 2009) and samples were classified to have a specific haplotype for each marker if the breadth of coverage was 100% and had a sequencing depth of at least ten for the entire length of the marker. The haplotype(s) observed for all eight markers for each sample were compiled into a tab-delimited text file for clustering. The scripts used to perform the genotyping analysis can be found in Appendix 1. The specimens were clustered using the Eukaryotyping tool (Barratt et al., 2019; Nascimento et al., 2020). This tool uses both a heuristic and Bayesian component approach to calculate the distances between specimens, accounting for complex datasets. This includes missing data, heterozygosity from sexually reproducing organisms, and other genetic features that accurately reflect the genetic complexities of C. cayetanensis (Barratt et al., 2019; Nascimento et al., 2020). In order for a specimen to be accurately clustered, genotyping data must be present for (1) at least five markers or (2) at least four markers if three from the following list were successfully sequenced: Nu HC378, Nu HC360i2, Mt MTJ, Mt MSR. The epsilon value was calculated by dividing the total number of loci that were missing sequence data by the total number of loci for the given number of samples studied. The pairwise matrix generated from the Eukaryotyping tool was clustered and visualized following Nascimento et al. (2020) in R using both ‘cluster’ (v2.1.0) and ‘ggtree’ (v2.0.0) packages.

2.4 Results

2.4.1 Sequencing Success

Of the 160 C. cayetanensis specimens received for this study, 142 represented unique cyclosporiasis cases diagnosed between 2010 and 2020. The distribution of specimens received in each year and province are outlined in Table 2.2.

32

Table 2.2. Provincial and yearly distribution of Cyclospora cayetanensis specimens received

No. Specimens Received (% Total) Total Year ON QC BC NL Specimens

2020 29a (52.7%) 24 (43.6%) 0 (0%) 2 (3.6%) 55 (34.4%)

2019 48b (76.2%) 0 (0%) 7 (11.1%) 8 (12.7%) 63 (39.4%)

2018 4 (100%) 0 (0%) 0 (0%) 0 (0%) 4 (2.5%)

2017 19c (100%) 0 (0%) 0 (0%) 0 (0%) 19 (11.9%)

2016 3 (100%) 0 (0%) 0 (0%) 0 (0%) 3 (1.9%)

2015 10 (100%) 0 (0%) 0 (0%) 0 (0%) 10 (6.3%)

2014 4d (100%) 0 (0%) 0 (0%) 0 (0%) 4 (2.5%)

2013 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)

2012 1 (100%) 0 (0%) 0 (0%) 0 (0%) 1 (0.6%)

2011 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%)

2010 1 (100%) 0 (0%) 0 (0%) 0 (0%) 1 (0.6%)

Total 119 (74.4%) 24 (15.0%) 7 (4.4%) 10 (6.3%) 160 (100%) aTwo sets of duplicates were received in ON 2020 (27 unique cases total) bTwo sets of duplicates and three sets of triplicates were received in ON 2019 (40 unique cases total) cTwo sets of duplicates and two sets of triplicates were received in ON 2017 (13 unique cases total) dA triplicate was received in ON 2014 (2 unique cases total)

In total, 96.2% (154/160) had at least one marker sequenced successfully and 36.9% (59/160) had all eight markers sequenced successfully (Figure 2.1). The average number of markers that successfully amplified for a given sample was six.

33

Figure 2.1. Cumulative success rate of markers sequenced in the Cyclospora cayetanensis targeted deep amplicon genotyping scheme. The number of markers that are successfully sequenced and genotyped for each of the 160 Canadian clinical C. cayetanensis positive specimens assessed decreased as the number of markers increased.

Of the given markers in the targeted amplicon scheme, the mitochondrial markers performed better in terms of sequencing success rates than the nuclear markers (Table 2.3). The nuclear marker CDC1 was the least sensitive, with an average sequencing success rate of 40.6% (65/160), while the mitochondrial marker MSR had the highest average sequencing success rate of 92.5% (148/602). The more discriminatory markers (Nu HC378, Nu HC360i2, Mt Mt-Junction, Mt MSR) had an average sequencing success rate ≥74.4%. The less discriminatory markers (Nu CDC1-4) had an average sequencing success rate <70%, with the exception of Nu CDC3 that had a sequencing success rate of 76.9% (Table 2.3).

34

Table 2.3. Sequencing success rate and haplotypes observed for each marker in the subtyping scheme Marker Sequencing Success Rate (n=160) Number Haplotypes Observed

CDC1 65 (40.6%) 2

CDC2 109 (68.1%) 3

CDC3 123 (76.9%) 2

CDC4 106 (66.3%) 2

HC378 135 (84.4%) 6 (Part A); 2 (Part B); 4 (Part C)

HC360i2 119 (74.4%) 2 (Part A); 7 (Part B); 4 (Part C); 3 (Part D)

Mt- 144 (90.0%) 11 Junction

MSR 148 (92.5%) 3 (Left); 2 (Right)

2.4.1.1 Nuclear Markers Performance

To assess the distribution of haplotypes found within Canadian cyclosporiasis cases for both nuclear and mitochondrial markers, a representative for each unique case was chosen based on the highest sequencing read depth. Of the six nuclear markers, fewer haplotypes were observed for CDC1-4 than HC378 and HC360i2. Two haplotypes were recorded for all Nu CDC1-4 markers, except at the CDC2 locus. A new CDC2 haplotype (haplotype 3) was observed in a 2010 specimen with 99.5% identity to haplotype 2, with a single nucleotide substitution at base 102 of the GT2_Hap_2 CDC2 marker (MN367322.1). Accounting for unique specimens with successful sequencing at a specific marker, single haplotypes were observed for at least 95% for each of the CDC markers, with mixed haplotypes recorded for 1.6% (1/61), 3.0% (3/98), 4.5% (5/110), and 2.0% (2/99) of samples for Nu CDC1 through CDC4, respectively. Furthermore, a larger proportion of samples contained haplotype 2 for both CDC1 and CDC4 markers, representing 85.2% (52/61) and 82.8% (82/99) of samples, with 13.1% (8/61) and 15.2% (15/99) of unique specimens containing haplotype 1 at the Nu CDC1 and CDC4 loci, respectively. Contrastingly, a more even distribution was observed for 35

haplotype 1 and 2 for CDC2 and CDC3 markers. Haplotype 1 was observed in 54.1% (53/98) and 60.0% (66/110) of samples with Nu CDC2 and CDC3 markers, respectively. Haplotype 2 was observed in 41.8% (41/98) of samples for the Nu CDC2 marker and 35.5% (39/110) of samples in the Nu CDC3 marker (Figure 2.2).

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3

0.2 Proportion of Haplotypes Present 0.1 0 Hap1 Hap2 Hap1/2 Hap1 Hap2 Hap3 Hap1/2 Hap1 Hap2 Hap3 Hap1/2 Hap1 Hap2 Hap1/2 CDC1 CDC2 CDC3 CDC4 Marker

Figure 2.2. Distribution of haplotypes observed for nuclear CDC1-4 markers. The majority of Canadian C. cayetanensis specimens were homozygous at the Nu CDC1-4 loci, observing either haplotype 1 or 2, and few were heterozygous, observing a mix of haplotype 1 and 2 (1/2). A new haplotype (haplotype 3) was recorded at the Nu CDC2 loci in a 2010 specimen.

The nuclear markers HC378 and HC360i2 were more discriminatory than the nuclear markers CDC1-4. For the Nu HC378 marker, six haplotypes were observed in Part A, 2 haplotypes were observed in Part B, and 4 haplotypes were observed in Part C (Table 2.4). For the Nu HC360i2 marker, two haplotypes, six haplotypes, four haplotypes, and three haplotypes were observed in Part A-D, respectively (Table 2.4). Haplotype 7 in Part B and haplotype 3 in Part D of the Nu HC360i2 marker were not previously described, and these new variants were discovered in a 2016 specimen. The full sequence (haplotype 8) shares 99.4% sequence identity with haplotype 2 (MH185778.1), with three nucleotide substitutions observed. The majority of samples sequenced 36

at these two nuclear loci were heterozygous. For both loci, approximately 93% of samples contained mixed haplotypes (Table 2.4).

Table 2.4. Haplotypes present in the nuclear markers HC378 and HC360i2 Haplotypes Present Marker Homozygous Heterozygous A B C D

Nu Hap2, Hap3 Hap1, Hap2 Hap1, Hap2 - HC378 8 (6.6%) 113 (93.4%) Hap4, Hap5 Hap3, Hap4 (n=121) Hap6, Hap7

Nu Hap1, Hap2 Hap2, Hap3 Hap1, Hap2, Hap1, HC360i2 8 (7.4%) 100 (92.6%) Hap4, Hap5 Hap3, Hap4 Hap2, (n=108) Hap6, Hap7 Hap3 n = number of cases amplified at this marker

2.4.1.2 Mitochondrial Markers Performance

Eleven subtypes of the mitochondrial junction region were observed (Table 2.5). The most abundant subtype was Cmt154.A, representing 45.0% (58/129) of specimens, followed by Cmt154.B in 19.4% (25/129) of specimens. Interestingly, these two subtypes have been discovered in specimens every year since 2014. Three new haplotypes of the MTJ marker were discovered and named following the standard nomenclature outlined in Nascimento et al. (2019): Cmt154.F from a 2012 specimen, and Cmt199.D and Cmt214.B from two 2019 specimens. Notably, Cmt154.F also contains two consecutive SNPs (AT-TA) at base 61-62 of MH430076.1, which is not observed in any previously published MTJ haplotypes. Mixed haplotypes were observed in 3.9% (5/129) of specimens. Three and two haplotypes were observed on the left and right side of the Mt MSR loci, respectively (Table 2.6). Haplotype 1 and haplotype 2 for the left and right of the MSR loci were observed for all years for which samples were received. The left and right portion of the MSR marker observed mixed haplotypes for 5.3% (7/131) and 3.8% (5/131) of specimens.

37

Table 2.5. Mitochondrial junction subtypes observed in Canadian Cyclospora cayetanensis specimens Mitochondrial Junction Number of Samples (n=129) Years Observed Subtype

Cmt154.A 58 (45.0%) 2010, 2014-2020

Cmt154.B 25 (19.4%) 2014-2020

Cmt154.F* 1 (0.8%) 2012

Cmt169.A 16 (12.4%) 2015, 2017, 2019, 2020

Cmt169.B 9 (7.0%) 2017-2019

Cmt184.B 10 (7.8%) 2017, 2019, 2020

Cmt184.C 1 (0.8%) 2019

Cmt199.A 1 (0.8%) 2019

Cmt199.C 1 (0.8%) 2020

Cmt199.D* 1 (0.8%) 2019

Cmt214.B* 1 (0.8%) 2019

Mixed 5 (3.9%) 2015, 2019 *New subtypes observed in this study

Table 2.6. Haplotypes observed in the mitochondrial MSR marker in Canadian Cyclospora cayetanensis specimens Mt MSR Haplotype Number of Samples (n=131) Years Observed

Left 1 87 (66.4%) 2010, 2012, 2014-2020

2 16 (12.2%) 2015, 2019, 2020

3 21 (16.0%) 2017-2020

Mix 7 (5.3%) 2015, 2019

Right 1 30 (22.9%) 2015, 2019, 2020

2 96 (73.3%) 2010, 2012, 2014-2020

Mix 5 (3.8%) 2019 38

2.4.2 Clustering

Following the clustering criteria of the Eukaryotyping tool (Barratt et al., 2019; Nascimento et al., 2020), 79.4% (127/160) of specimens were successfully clustered with the ensemble-based method (Table 2.7). Of the samples that were successfully clustered, each locus had a sequencing success rate ≥79.5%, except CDC1. The nuclear CDC1 loci had the most missing data, with a sequencing success rate of 49.6% (Table 2.8).

Table 2.7. Distribution of Cyclospora cayetanensis samples that successfully clustered Year Successfully Clustered

2020 (n=55) 41 (74.6%)

2019 (n=63) 53 (84.1%)

2018 (n=4) 4 (100%)

2017 (n=19) 10 (52.6%)

2016 (n=3) 3 (100%)

2015 (n=10) 10 (100%)

2014 (n=4) 4 (100%)

2012 (n=1) 1 (100%)

2010 (n=1) 1 (100%)

Total (n=160) 127 (79.4%)

39

Table 2.8. Success rate of each loci of samples that successfully clustered Marker Sequencing Success Rate of Clustered Samples (n=127)

CDC1 63 (49.6%)

CDC2 104 (81.9%)

CDC3 114 (89.8%)

CDC4 101 (79.5%)

HC378 120 (94.5%)

HC360i2 114 (89.8%)

MTJ 118 (92.9%)

MSR 120 (94.5%)

A cluster dendrogram (Figure 2.3) was created based on a pairwise distance matrix that was calculated using the ensemble-based method described previously (see Barratt et al., 2019; Nascimento et al., 2020). The 127 specimens that had adequate sequencing information to be successfully clustered represented 118 cyclosporiasis cases as eight replicate samples were present (seven duplicates and one triplicate). Each replicate clustered with one another under the same clade as expected. Using the CDC targeted amplicon deep sequencing database of C. cayetanensis specimens, eighteen genetic clusters were identified (Figure 2.3). However, due to the absence of epidemiological data available for the Canadian specimens, we cannot fully evaluate these genetic clusters to determine whether specimens were clustered correctly.

40

Figure 2.3. Cluster dendrogram of Canadian Cyclospora cayetanensis specimens. This dendrogram represents 127 Canadian C. cayetanensis specimens, representing 118 cyclosporiasis cases that occurred between 2010 and 2020. Eighteen genetic clusters were identified using the CDC C. cayetanensis targeted amplicon deep sequencing reference database and Ward’s clustering method.

2.5 Discussion

We successfully applied the targeted amplicon NGS approach used for genotyping C. cayetanensis as described by Nascimento et al. (2020) to Canadian cyclosporiasis cases. This scheme was successful at genotyping at least one marker for 96.2% (154/160) of C. cayetanensis specimens collected between 2010 and 2020, and 79.4% (127/160) of specimens met the minimum requirements to cluster the specimens using the ensemble-based analysis (Nascimento et al., 2020; Barratt et al., 2019). These proportions are similar to the 93.7% success rate for identifying at least one marker genotyped and the 70% success rate for clustering as reported by Nascimento et al. (2020) when examining their USA specimens collected in 2018. The percent of specimens with successful sequencing data decreased as the number of cumulative markers increased. Interestingly, no trend was observed between the Ct value of a given sample using the qPCR assay described by Murphy et al. (2017) and number of markers that successfully sequenced for a given sample (data not shown). Possible explanations include varying degrees of DNA quality between specimens or the presence of varying inhibitors found in stool. Of the six nuclear markers in the scheme, the four Nu CDC loci had the fewest number of haplotypes reported along with some of the lowest sequencing success rates. However, they had a 41

low heterozygosity rate, with each marker reporting ≤5% of sequenced specimens with mixed haplotypes, simplifying analyses. The Nu CDC1 marker performed the least well with a sequencing success rate of 40.6% (65/160). This is lower than the 61% sequencing success rate reported by Houghton et al. (2020) using Sanger sequencing. Of the sequences that successfully clustered, Nascimento et al. (2020) reported 56.8% of specimens had sequencing data for the Nu CDC1 loci, whereas our study found 49.6%. In addition, only two different haplotypes were reported for this marker, similar to what was reported in the USA (Houghton et al., 2020; Nascimento et al., 2020) with the majority of samples having haplotype 2. Overall, the Nu CDC1 loci had the worst sequencing success rate and is one of the least discriminatory markers. The Nu CDC4 loci displayed a similar trend, with one of two haplotypes dominating, but with a higher sequencing success rate of 66.3% (106/160). As a result, both Nu CDC1 and CDC4 were the lowest performing markers in the scheme and therefore should be the first to be replaced by more sensitive and discriminatory markers. The other two Nu CDC markers, CDC2 and CDC3, were the next best performing markers. Similar to what was reported by Houghton et al. (2020), two haplotypes dominated at both Nu CDC2 and CDC3 loci. We did not observe any haplotype 3 at the CDC3 locus, but a new CDC2 haplotype (haplotype 3) was discovered in a 2010 specimen. The nuclear markers CDC2 and CDC3 had a sequencing success rate 68.1% (109/160) and 76.9% (123/160), respectively, and haplotype data was present for 81.9% (104/127) of clustered specimens at the Nu CDC2 locus and 89.8% (114/127) of specimens at the Nu CDC3 locus. This compares with the 83.5% and 83.0% of clustered US specimens with the haplotype data for the Nu CDC2 and CDC3 loci reported by Nascimento et al. (2020). The Nu HC378 and HC360i2 loci both had a heterozygosity rate of approximately 93%, with only ~7% of specimens with data at these loci having a single haplotype present. Barratt et al. (2019) reported heterozygous subtypes for 80% and 92% of specimens of the Nu HC378 and Nu HC360i2 loci, respectively. The data presented by Barratt et al. (2019) supported the notion that these loci exist as a single copy in each haploid genome. Our data further suggests that it is common to have different haplotypes for these loci in the haploid genome of each sporocyst. However, this high heterozygosity rate would make it more challenging to examine these markers

42

using Sanger sequencing due to the presence of mixed peaks. All subtypes currently described for these two markers in the USA (Barratt et al., 2019; Nascimento et al., 2020) were found in our Canadian dataset, except for haplotype 1 in the Part A region of the Nu HC378 marker. In addition, a new HC360i2 haplotype (haplotype 8) was reported in a 2016 specimen, sharing 99.4% sequence identity with haplotype 2 (MH185778.1). With moderate sequencing success rates of 84% (135/160) and 74% (119/160) for the Nu HC378 and HC360i2 loci, respectively, and a relatively high number of haplotypes identified, these two nuclear markers provide good discriminatory results for clustering cyclosporiasis cases. Unsurprisingly, the mitochondrial markers had greater sequencing success rates compared to the nuclear markers due to the relatively high proportion of genome copies present (estimated 500 copies per cell; see Tang et al., 2015). For the Mt MTJ locus, the distribution of haplotypes present in Canadian cyclosporiasis cases compared well with those reported in the USA from 2013-2016 (Nascimento et al., 2019), with 8 identical subtypes found between the two countries with similar distributions. Three new subtypes were discovered in this study that were not described in the US: Cmt154.F in a 2012 specimen, and Cmt199.D and Cmt214.B in two different 2019 specimens. These new subtypes were named based on the nomenclature described by Nascimento et al. (2019). Similarly, all haplotypes described at the Mt MSR locus (Barratt et al., 2019; Nascimento et al., 2020) were found in our Canadian specimens. Due to their high sequencing success rates and discriminatory power, these two mitochondrial markers can aid in resolving genetic clusters. In order for an MLST scheme to resolve genetic clusters accurately, the sequenced loci must be sensitive, reproducible, and have high discriminatory power. To improve the targeted amplicon scheme evaluated in this study, further nuclear markers with high entropy should replace the Nu CDC1 and CDC4 markers. The original MLST scheme developed by Guo et al. (2016) contained five microsatellite markers located in the nuclear genome. However, this scheme had poor resolution due to acquiring uninterpretable Sanger chromatograms that contained heterozygous sequences (see Guo et al., 2016; Li et al., 2017; Hofstetter et al., 2019). Fortunately, NGS amplicon sequencing can resolve heterozygous sequences within a sample, and therefore revisiting these markers and assessing their discriminatory power may improve the current scheme at resolving genetic clusters.

43

Due to the absence of Canadian epidemiological data for the specimens in this study, the overall evaluation of the NGS targeted amplicon scheme of how genetic clusters compare to epidemiological case-linkage clusters cannot be performed. Examining the eight replicates within the clustered dataset, all were located under the same clade as expected, despite six of the replicates having different numbers of successful markers sequenced. Further research is required to determine the concordance between the eighteen genetic clusters identified in this study to the epidemiologically-defined clusters.

2.6 Conclusion

This is the first study to successfully genotype Canadian clinical specimens and illustrate how the most recent targeted amplicon NGS scheme performs on these specimens. In total, we were able to genotype at least one marker for 96.2% of specimens collected and provide clustering information for 79.4% of specimens. The evaluation of the clustering performance of this MLST scheme could not be assessed in this study due to the lack of epidemiological data available. Of the eight markers represented in this scheme, the two mitochondrial markers had the highest sequencing success rates with good discriminatory power. Conversely, the two nuclear markers CDC1 and CDC4 had the lowest sequencing success rates with little discriminatory power. To improve the genetic resolution of this MLST scheme, research should be focused on identifying new markers through WGS comparative analyses, particularly on identifying novel nuclear markers with greater sequencing success rates and higher entropy that can replace the two current nuclear markers that have little discriminatory power.

44

CHAPTER 3: WHOLE GENOME SEQUENCING OF CANADIAN CYCLOSPORA CAYETANENSIS SPECIMENS

3.1 Abstract

Cyclospora cayetanensis is an emerging human pathogen worldwide and is becoming more prevalent in foodborne outbreaks across Canada. Recent research has focused on developing a molecular method to characterize C. cayetanensis isolates in order to complement epidemiological investigations, which are often hampered due to the long incubation and various environmental transmission vehicles associated with this coccidian parasite. Limited whole genome assemblies have been published for C. cayetanensis, with the majority of assemblies originating from isolates in the United States and assembled using short-read sequencing techniques. To improve the geographical diversity of genomes available, this study presents the first five whole genome assemblies from Canadian C. cayetanensis isolates. This study also presents the first genome assembly that uses both long and short read sequencing data for this parasite. This hybrid assembly was created by combining both long reads obtained from the Oxford Nanopore MinION and short reads obtained from the Illumina MiSeq. The final hybrid assembly measures 44.2 Mbp in length and consists of 297 contigs with an N50 value of 654019. The data presented in this study will aid in subsequent comparative genomic analyses for identifying novel subtyping markers for C. cayetanensis to ultimately aid in mitigating outbreaks. Moreover, the improved quality obtained in the hybrid genome assembly will provide further insight into the parasite’s genome organization. This information will ultimately aid in understanding the biology of C. cayetanensis and aid in mitigating outbreaks.

3.2 Introduction

Cyclospora cayetanensis is an emerging apicomplexan parasite that causes the gastrointestinal disease cyclosporiasis. Humans can acquire this infection by ingesting food or water contaminated with sporulated oocysts either during travel to an endemic region or through foods imported from an endemic region (Shields & Olson, 2003). Despite the growing frequency at which cyclosporiasis is diagnosed as a cause for foodborne outbreaks in North America (Casillas et al.,

45

2018; Morton et al., 2019; Hadjilouka & Tsaltas, 2020), there is a lack of validated molecular subtyping tools available to complement epidemiological case-linkage investigations due to the limited genetic information available for this parasite. There are many challenges associated with obtaining high-quality whole genomic sequence information for C. cayetanensis. Cyclosporiasis is often underdiagnosed, and when diagnosed in Canada the specimen is commonly fixed with SAF (Sodium acetate-Acetic acid-Formalin) transport medium to preserve the parasite for microscopic examination. The formalin cross-links and denatures the DNA (Srinivasan et al., 2002), inhibiting the ability to perform downstream molecular analysis. Of the few unpreserved samples collected, specimens with high parasite load are required as it is estimated that a standard diagnostic fecal sample contains only picogram amounts of C. cayetanensis DNA (Nascimento et al., 2016). Without a mechanism to culture this parasite, oocyst enrichment and purification from human stool is required prior to any sequencing efforts. This purification effort consists of discontinuous density gradients with the option of including flow cytometry (Qvarnstrom et al., 2018). The first whole genome assembly for C. cayetanensis was released in 2014 (NCBI Bioproject PRJNA256987) and both Tang et al. (2015) and Liu et al. (2016) have shown the genome to be very similar to the poultry-infecting coccidian parasite, Eimeria tenella. There are now 37 assemblies available (NCBI Genome), all assembled using short-read sequencing data and representing isolates from seven different countries. The current reference assembly, CcayRef3 (GCF_002999335.1), consists of 738 contigs that measure 44.4 Mbp in length and has a GC content of 51.9%. Other assembly statistics include an N50 of 192560, L50 of 73, and includes 407 (91.3%) complete single-copy orthologs when evaluated against BUSCO’s (version 5.0.0) apicomplexa_odb10 database (Seppey et al., 2019). Genomic analysis of C. cayetanensis’ genome reveals repeat-rich and repeat-poor regions alternating throughout the genome (Liu et al., 2016), creating challenges in assembling highly continuous and complete genomes with short-read sequences alone. Long-read sequencing technologies are required to resolve these repetitive elements to combine contigs and create a more complete and continuous genome assembly. Complete genomes provide the necessary information to identify structural variations along with understanding the biology of the organism.

46

To improve the geographical diversity of C. cayetanensis genomes available for comparative analysis studies, we sequenced five locally-acquired Canadian C. cayetanensis isolates using the Illumina MiSeq platform (Illumina, San Diego, CA, USA).

3.3 Materials and Methods

3.3.1 Stool Specimens

Five clinical stool specimens were collected from two Canadian provinces: one from the Newfoundland Public Health Laboratory (NPHL) in 2020 and four from Public Health Ontario (PHO), including three from 2019 and one from 2020. These samples were chosen for this study due to their relatively high parasite load, estimated with quantitative PCR (Murphy et al., 2017), and large sample volume. Samples received in 2019 (n=3) had been stored at 4 °C and were preserved in 2.5% (w/v) aqueous potassium dichromate upon receipt at the National Microbiology Laboratory. The 2020 samples (n=2) were received frozen and stored at -20 °C until use.

3.3.2 Ethics

Ethics approval was granted by the Research Ethics Board of Health Canada/Public Health Agency of Canada (Ethics certificate REB 2016-010P).

3.3.3 Oocyst Purification

Oocysts were collected, enriched and semi-purified by processing the sample through a sieve, discontinuous density gradients, and subjecting the sample to a bleach wash. This is a modified version of the purification process described by Qvarnstrom et al. (2018). To elaborate, 5 g of each stool sample was weighed into a 50 ml polypropylene tube and diluted with equal parts of 0.85% saline solution. The stool suspension was passed through a 125 µm flat sieve (SATA Rapid Preparation System 0.3 L Cup, SATA USA, Spring Valley, MN) through gravity filtration. Once filtered, the solid remnant on the filter was washed with 1.5% (w/v) Alconox detergent solution (Alconox Inc., White Plains, NY, USA) and re-filtered using the same apparatus. The filtrate was then centrifuged at 1500×g for 10 minutes. The supernatant was removed and the resulting pellet was resuspended in 8 ml or 20 ml, dependent if the pellet was <1 ml or ≥1 ml, respectively. 47

After filtration, each sample was processed through a differential sucrose gradient. Sheather’s solutions measuring 1.103 g/L and 1.064 g/L were prepared following the protocol outlined in Appendix 2. Once the density gradient solutions were prepared, 16 ml of the lower density solution was carefully pipetted over 16 ml of the higher density solution in a 50 ml polypropylene conical centrifuge tube. The sample was diluted with an equal volume of 1.5% Alconox solution and vortexed to ensure homogeneity. Carefully, 8 ml of the diluted sample was overlaid onto the Sheather’s gradient and centrifuged at 1000×g for 25 minutes at 4 ℃, with slow acceleration and deceleration set. Immediately following centrifugation, 25 ml was collected between the interface of the density gradients. These oocyst-enriched fractions were diluted with 15 ml of 0.85% saline solution and centrifuged at 3000×g for 10 minutes at 4 ℃. Supernatants were then removed and the resulting pellets were combined for each sample in a 50 ml tube by resuspending them in 5 ml of 0.85% saline solution. The pellets were pooled and each tube containing the pellets were rinsed twice with 0.85% saline solution and the rinsed solution was combined with the pooled resuspended pellets. The tube was centrifuged at 3000×g for 10 minutes at 4 ℃ and the supernatant was removed. The resulting pellet underwent a secondary sucrose gradient, to further remove contaminating organisms, by following the same protocol described for the primary gradient. The pellet following the secondary sucrose gradient was then resuspended with one ml of 0.85% saline solution in a microfuge tube. Following the sucrose gradient, purified oocysts were subjected to a cesium chloride gradient. The cesium chloride gradient was prepared by dissolving 21.75 g of cesium chloride into 103.25 ml of deionized water to have a specific gravity of 1.15 g/L. The purified oocysts obtained from the secondary sucrose gradient were centrifuged at 16,000×g for 3 minutes at 4 ℃ and the pelleted oocysts were resuspended in 1.2 ml of 1.5% Alconox solution. The resuspended solution was carefully mixed and 0.5 ml aliquots were overlaid on top of 800 µl of the cesium chloride solution in 1.5 ml microcentrifuge tubes. The tubes were centrifuged at 16,000×g for 3 minutes at 4 ℃ and 1 ml of the interface (half from the sample layer and half from the cesium chloride layer) was collected and placed into a new microcentrifuge tube. For each tube, the interface solution was thoroughly mixed by inversion prior to transferring half of the solution to a new microcentrifuge tube and diluting each tube with equal volume of 0.85% saline solution. The tubes were then

48

centrifuged at 16,000×g for 3 minutes at 4 ℃ and the pellets were pooled together into one tube by resuspending the pellets and rinsing the tubes with 0.85% saline solution until the final tube measured 1.2 ml in volume. The resulting tube was centrifuged at 16,000×g for 3 minutes at 4 ℃ and the supernatant was discarded. Finally, the cesium chloride gradient purified oocysts underwent an additional step that was not described in Qvarnstrom et al. (2018). A bleach wash was performed to further remove contaminating organisms by adding 1 ml of 3% sodium hypochlorite solution to the pelleted oocysts. The resuspended pellet was incubated on ice for 10 minutes and then centrifuged at 16,000×g for 3 minutes at 4 ℃. The supernatant was removed and the pellet was washed by resuspending it in 1 ml of 0.85% saline solution and centrifuging at 16,000×g for 3 minutes at 4 ℃. The final pellet of enriched and purified oocysts was resuspended in 0.85% saline solution and stored at 4 ℃ for up to two days until DNA extraction was performed.

3.3.4 DNA Extraction

The purified and enriched C. cayetanensis oocysts were pelleted by centrifugation at 16,000×g for 3 minutes at 4 ℃, prior to being processed with the QIAmp Fast DNA Stool Mini Kit (Qiagen, Hilden, Germany). The manufacturer’s instructions were followed with the following modifications. Prior to adding Buffer AL to the supernatant, oocysts were pelleted by centrifugation at 5,000×g for five minutes. Half of the supernatant was removed into a new microcentrifuge tube, to which the lysing matrix Y beads in a 2 ml tube (MP Biomedicals, OH, US) was added and the oocysts were subjected to one minute of bead beating at 6 m/s on the Omni Bead Ruptor. The supernatant was collected into a new microcentrifuge tube and the beads were washed using the previously collected supernatant prior to bead-beating. To ensure no beads were carried over, the supernatant was briefly centrifuged prior to being carefully removed and placed in a new microcentrifuge tube and AL buffer was added. Finally, at the end of the manufacturer’s protocol, DNA was eluted in two rounds of 50 μl of ATE Buffer and stored at -20 ℃ until use.

49

3.3.5 Illumina MiSeq Sequencing

The DNA for all five Canadian C. cayetanensis isolates were sequenced on the Illumina MiSeq. The sequencing library was prepared with the Nextera DNA Flex Library Preparation Kit (Illumina, San Diego, CA, USA) by following the manufacturer’s instructions with no deviations. A starting DNA concentration of 1 ng was used for all samples, determined using the QubitTM Fluorometer with the Qubit dsDNA Broad Range Assay Kit (Invitrogen by Life Technologies, Carlsbad, CA, USA). The library fragment size was measured using the Agilent 2100 Bioanalyzer system (Agilent Technologies, Santa Clara, CA, USA) and the concentration was measured with the QubitTM Fluorometer using the Qubit dsDNA High Sensitivity Assay Kit (Invitrogen by Life Technologies, Carlsbad, CA, USA). The barcoded genomic DNA library was loaded onto the MiSeq Reagent v3 (600 cycles, 2 x 300 bp) Kit with a final concentration of 15 pM and were paired-end sequenced.

3.3.6 Oxford Nanopore MinION Sequencing

Sample D was sequenced on the Oxford Nanopore MinION using the Rapid PCR Barcoding Kit (SQK-RPB004). The remaining four samples had insufficient amounts of DNA due to prior experiments attempting at amplifying the whole genome (see Appendix 3). A starting DNA concentration of 1 ng was used, determined using the Qubit dsDNA Broad Range Assay Kit, and the library was prepped following the manufacturer’s instructions with no deviations. The library was sequenced on the R9.4.1 flow cell for 24 hours (until <5% of sequencing pores were active).

3.3.7 Assembling Short-read Assemblies with MiSeq Reads

The initial quality of the demultiplexed Illumina sequencing reads were evaluated using FastQC (version 0.11.9) before quality filtering the reads through BBDuk (version 37.62) from the BBTools suite. Reads were trimmed on both ends to remove bases with Phred scores less than 20 and reads measuring less than 50 bp were removed. The remaining reads were de novo assembled using four methods: ABySS (version 4.3) (Simpson et al., 2009), IDBA-UD (version 1.1.3) (Peng et al., 2012), MaSuRCA (version 3.4.2) (Zimin et al., 2013), and SPAdes (version 3.11.1) (Bankevich et al., 2012). These assemblers were chosen as they are optimized for slightly different 50

applications. For instance, ABySS uses a de Bruijn graph and Bloom filter for assembling large genomes on short paired-end reads (Simpson et al., 2009). IDBA-UD iteratively builds de Bruijn graphs across a range of k-mers to optimize the assembly and takes into account uneven sequencing depths (Peng et al., 2012). MaSuRCA uses both de Bruijn graphs and overlap-layout consensus approaches to build assemblies (Zimin et al., 2013) while SPAdes uses a similar iterative approach to IDBA-UD in addition to contig error-correction and assembly merging algorithms (Bankevich et al., 2012). To remove contigs belonging to contaminated species, all de novo assembled contigs were aligned against a local database containing all currently available C. cayetanensis genomes using BLASTN (version 2.9.0+) (Altschul et al., 1990). The filtered reads were then mapped back to the de novo assembly using Minimap2 (version 2.17-r941) (Li, 2018) and mapped reads were collected using SAMtools (version 1.11) (Li et al., 2009). A second round of assembly was performed for each assembler using the mapped reads. The resulting genome assemblies from all assemblers were evaluated using QUAST (version 4.6.3) (Gurevich et al., 2013) and the completeness was assessed using BUSCO (version 5.0.0) with the apicomplexan_odb10 dataset (Seppey et al., 2019). The script used for generating the above short-read assemblies can be found in Appendix 4.

3.3.8 Assembling Hybrid Assemblies with MinION and MiSeq Reads

The long reads were basecalled using Guppy (version 4.0.11) on the MinION output files, with the library kit and flow cell parameters set as SQK-RPB004 and FLO-MIN106, respectively. Hybrid genome assemblies were generated using the following methods: Pilon (version 1.23) (Walker et al., 2014), hybridSPAdes (version 3.11.1) (Antipov et al., 2016) and Unicyler (version 0.4.7) (Wick et al., 2017). Briefly, Pilon can polish draft genome assemblies using paired-end sequencing data. As a result, a long-read de novo assembly was generated using Canu (version 2.1.1-1), a long-read assembler designed for noisy single-molecule sequences (Koren et al., 2017). Similar to the short-read assembly pipeline, all contigs generated were aligned to a local database containing all C. cayetanensis genomes to date using BLASTN (version 2.9.0+) (Altschul et al., 1990) and contigs derived from contaminants were removed. All de novo assemblies were polished using Nanopolish (version 0.13.2) by providing the tool a sorted and indexed alignment of mapped 51

basecalled reads using both Minimap2 and SAMtools prior to polishing the genome with paired- end reads with Pilon. Alternatively, HybridSPAdes (part of the SPAdes genome assembler) assembles both long-read and short-read sequencing data together to generate accurate assemblies (Antipov et al., 2016). The –careful option was used during the assembly process to minimize mismatches and short indels. Finally, Unicycler (designed for prokaryote genomes) was used using the --linear-seqs option. Unicycler acts as a SPAdes optimizer where an initial assembly graph is built prior to building bridges to resolve repetitive regions using long-reads (Wick et al., 2017). Contigs representing contaminants were removed following the same process as described above. When using Pilon, short reads were iteratively mapped to the newest assembly using Minimap2 and re-assembled until no changes were detected. Similarly, short reads and long reads were mapped iteratively to the most recent assembly generated by hybridSPAdes, with the previous assembly used as the –trusted-contigs parameter, until no changes were detected. To further close the gaps of both assemblies, IMAGE (version 2.4.1) (Tsai et al., 2010) was used with default parameters. Each of the assemblies were evaluated using Quast and BUSCO as described in the short-read assembly section. Genomic features, including predicted genes, tRNAs, and rRNAs, for the best hybrid assembly were identified using the gene prediction suite GENEMARK-ES (version 4.38) (Brodovsky & Lomsadze, 2012), tRNAscan-SE (version 2.0) (Chan & Lowe, 2019), and RNAmmer (version 1.2) (Lagesen et al., 2007), respectively. The script used for generating the above hybrid assemblies can be found in Appendix 4.

3.4 Results

3.4.1 Sample Purification

The Murphy et al. (2018) qPCR assay was used to estimate the concentration of C. cayetanensis DNA within the samples before and after the semi-purification and enrichment process, measured as the 18S rDNA copy number per µl after DNA extraction. There was an average 78.9% decrease in copy number per µl post purification (Table 3.1).

52

Table 3.1. Loss of Cyclospora cayetanensis in samples attributed to the purification process Sample A Sample B Sample C Sample D Sample E

18S rDNA 194837 186395 43969 21321 22984 Concentration Before Purification (Copy # per µl)

18S rDNA 24582 25429 15376 3994 5842 Concentration After Purification (Copy # per µl)

% loss 87.4 86.4 65.0 81.3 74.6

3.4.2 Short-Read Assemblies

The Illumina MiSeq sequencing runs produced an average of approximately 9.5 million and 29 million paired-end reads for the first and second sequencing run, respectively (Table 3.2). The increase in sequencing depth between the first and second sequencing run can be explained by the additional sample in the first run (3 samples in WGSeqRun1; 2 samples in WGSeqRun2). The purification process successfully removed the majority of contaminants as approximately 99% of the filtered reads were mapped to the C. cayetanensis genome.

The SPAdes assembler outperformed the other three short-read assemblers assessed in this study. SPAdes reported genome assemblies containing approximately 1100 contigs with the greatest N50 (Table 3.3) and the most complete BUSCO statistics (Table 3.4). MaSuRCA had the worst performance, as the final genome assembly had the highest number of reported contigs (>5000), shortest largest contig, lowest N50, and the least number of complete orthologs.

53

Table 3.2. Illumina read metadata for each Cyclospora cayetanensis specimen Sample A Sample B Sample C Sample D Sample E

Sequencing WGSeqRun1 WGSeqRun1 WGSeqRun1 WGSeqRun2 WGSeqRun2 Run

Total bases 1,996,229,415 2,137,096,319 1,154,487,106 2,584,622,206 3,571,461,156

Total reads 8,640,744 10,436,264 10,702,804 22,444,692 35,369,150

Filtered reads 8,061,164 9,135,920 9,617,666 13,459,710 27,204,318 (93.3%) (87.5%) (89.9%) (60.0%) (76.9%)

Mapped reads 8,029,537 9,050,986 9,500,617 13,416,895 27,017,859 (99.6%) (99.1%) (98.8%) (99.7%) (99.3%)

Average read 202 bp 175 bp 193 bp 159 bp 156 bp length

Estimated 37X 36X 41X 48X 95X Genome Coverage

54

Table 3.3. Comparison of Canadian Cyclospora cayetanensis short-read assemblers using Quast statistics Assembler QUAST Statistics Sample A Sample B Sample C Sample D Sample E

ABySS Assembly Length 42.6 Mbp 42.8 Mbp 42.7 Mbp 43.2 Mbp 43.8 Mbp No. contigs (>500bp) 3504 3441 3308 3345 2391 Largest Contig 196050 bp 237279 bp 239462 bp 176380 bp 259789 bp GC Content 51.99% 51.97% 51.97% 51.99% 51.96% N50 15080 15308 37266 34908 50802 L50 346 344 324 343 245

IDBA-UD Assembly Length 44.1 Mbp 44.0 Mbp 44.0 Mbp 43.9 Mbp 44.1 Mbp No. contigs (>500bp) 2091 2309 2047 2039 1875 Largest Contig 345525 bp 289255 bp 229291 bp 377150 bp 476263 bp GC Content 51.92% 51.92% 51.92% 51.92% 51.92% N50 34752 26518 69015 62966 85141 L50 189 227 193 201 152

MaSuRCA Assembly Length 44.0 Mbp 43.4 Mbp 43.5 Mbp 43.6 Mbp 42.9 Mbp No. contigs (>500bp) 5500 5625 5847 5094 9982 Largest Contig 113718 bp 130664 bp 115027 bp 158684 bp 73016 bp GC Content 51.94% 51.99% 51.98% 51.97% 52.03% N50 18816 20198 18165 21965 8437 L50 633 561 635 539 1417

SPAdes Assembly Length 44.6 Mbp 43.9 Mbp 44.0 Mbp 44.3 Mbp 43.9 Mbp No. contigs (>500bp) 1094 1072 1108 1116 1014 Largest Contig 622018 bp 965223 bp 917999 bp 698789 bp 1300829 bp GC Content 51.88% 51.92% 51.92% 51.89% 51.92% N50 167508 173539 181724 109149 197337 L50 82 81 75 121 67

55

Table 3.4. BUSCO assessment of short read assemblies of Canadian Cyclospora cayetanensis assemblies Assembler BUSCO Sample A Sample B Sample C Sample D Sample E Statistics (n=446)

ABySS Single-Copy 379 (85.0%) 381 (85.4%) 380 (85.2%) 382 (85.7%) 395 (88.6%) Duplicated 1 (0.2%) 3 (0.7%) 2 (0.4%) 1 (0.2%) 1 (0.2%) Fragmented 43 (9.6%) 42 (9.4%) 41 (9.2%) 40 (9.0%) 30 (6.7%) Missing 23 (5.2%) 20 (4.5%) 23 (5.2%) 23 (5.1%) 20 (4.5%)

IDBA-UD Single-Copy 376 (84.3%) 394 (88.3%) 391 (87.7%) 395 (88.6%) 398 (89.2%) Duplicated 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) Fragmented 43 (9.6%) 31 (7.0%) 36 (8.1%) 29 (6.5%) 30 (6.7%) Missing 26 (5.9%) 20 (4.5%) 18 (4.0%) 21 (4.7%) 17 (3.9%)

MaSuRCA Single-Copy 368 (82.5%) 350 (78.7%) 348 (78.0%) 356 (79.8%) 318 (71.3%) Duplicated 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) Fragmented 31 (7.0%) 63 (14.1%) 62 (13.9%) 57 (12.8%) 76 (17.0%) Missing 46 (10.3%) 32 (7.2%) 35 (7.9%) 32 (7.2%) 51 (11.5%)

SPAdes Single-Copy 409 (91.7%) 413 (92.6%) 409 (91.7%) 405 (90.8%) 410 (91.9%) Duplicated 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) 1 (0.2%) Fragmented 22 (4.9%) 20 (4.5%) 23 (5.2%) 26 (5.8%) 21 (4.7%) Missing 14 (3.2%) 12 (2.7%) 13 (2.9%) 14 (3.2%) 14 (3.2%) n=total number of orthologous genes in the apicomplexan_odb10 database (BUSCO version 5.0.0)

3.4.3 Hybrid Assemblies

Approximately 1.1 million raw reads were obtained from the MinION sequencing run, with 77.7% of reads passing the quality thresholds and mapping to the genome of C. cayetanensis (Table 3.5). This represents a total of 2,815,154,769 bases, with an average fragment length of 3323 bp and estimated genome coverage of 63X.

56

Table 3.5. MinION sequencing metadata of a Canadian Cyclospora cayetanensis specimen Metadata Sample D

Number of Reads 1,096,014 Number of Reads that Passed 851,973 (77.7%)

Number of Reads that Failed 244,041 (22.3%) Number of Reads that Mapped to C. cayetanensis 847,132 (99.4%) Average Length of Mapped Reads 3323

Longest Mapped Read Length 15860 Estimated Genome Coverage 63X

Unicycler produced the best hybrid assembly (Table 3.6). The final hybrid assembly consists of 297 contigs that measures 44.2 Mbp in length and has a GC content of 51.91%. The largest contig is about double of the current reference assembly (CcayRef3) measuring 2.16 Mbp with 22 contigs representing half of the genome. In terms of completeness, 91.9% (410/446) complete orthologs were present, which is two more than the current reference assembly.

Table 3.6. Quast assembly statistics of the Cyclospora cayetanensis hybrid assemblies Assembler Assembly No. contigs Largest GC N50 L50 Length (>500bp) contig Content

HybridSpades 44.6 Mbp 310 1973179 bp 51.87% 498605 28

Pilon 44.6 Mbp 343 1973187 bp 51.86% 561776 25

Unicycler 44.2 Mbp 297 2155147 bp 51.91% 654019 22

57

Table 3.7. BUSCO statistics of the Cyclospora cayetanensis hybrid assemblies BUSCO (n=446) Assembler Single-Copy Duplicated Fragmented Missing

HybridSpades 410 (91.9%) 1 (0.2%) 21 (4.7%) 14 (3.2%)

Pilon 410 (91.9%) 1 (0.2%) 16 (3.6%) 19 (4.3%)

Unicycler 409 (91.7%) 1 (0.2%) 21 (4.7%) 15 (3.4%)

In the Unicycler hybrid assembly, there are 128 tRNA genes, 9 rRNA genes, and 6247 predicted genes that are encoded. This compares with the current representative genome that contains 118 tRNA, 9 rRNA and 6043 genes.

3.5 Discussion

We collected five Canadian C. cayetanensis positive specimens with high parasite load and a relatively large amount of sample for whole genome sequencing. As there are currently no culturing methods for this coccidian parasite, the oocysts were concentrated and semi-purified from the stool matrix using a series of discontinuous density gradients as described by Qvarnstrom et al. (2018) followed by a bleach wash. Through this process, an average of 78.9% of the C. cayetanensis DNA (measured by the 18S rDNA concentration using the Murphy et al. (2018) qPCR assay) was lost. However, an average of ~99% of reads were mapped to the C. cayetanensis genome, which shows that despite the loss of oocysts through the purification process, the oocysts were highly purified using inexpensive equipment. A limited amount of DNA was available for whole genome sequencing after the purification procedure, leaving two routes for sequencing the genome: (1) performing a whole genome amplification method to increase the amount of DNA available, or (2) using library preparation kits with low starting material. The advantage to performing a whole genome amplification method prior to DNA sequencing is having sufficient DNA for multiple sequencing runs. However, the extra amplification introduces sequencing errors. On the other hand, not amplifying the DNA limits the possible number of sequencing runs, but reduces the sequencing errors introduced.

58

Another important factor is the high quantity of DNA required for MinION library preparation kits. To generate long-reads dependent on input fragment length, MinION library preparation kits require at least 400 ng of DNA, therefore a whole genome amplification step must be implemented. We attempted to amplify the genome using two approaches, Qiagen Repli-G kit and custom made selective whole genome amplification methods, however these methods failed due to low genome coverage and poor C. cayetanensis specific amplification (see Appendix 3). Due to the difficulties associated with whole genome amplification methods, we proceeded with two library preparation kits that require low starting input material. The Nextera DNA Flex kit uses bead-linked transposon chemistry which simultaneously fragments reads with even insert sizes (~350 bp), normalizes the input DNA, and ligates sequencing adaptors to the end of the fragments. We chose this bead-based library preparation method because it allows for fragmentation to occur in a non-biased fashion, providing even coverage across the genome, despite the limited input DNA amount. We chose the Rapid PCR Barcoding Kit as the library preparation method for the MinION as it uses transposase-based chemistry to add adapters to the ends of DNA fragments for subsequent use of the adapters as primers to amplify the DNA to provide enough material for sequencing. However, this chemistry limits the fragment length to approximately 2-3 kbp and introduces a PCR bias to the sequenced material. In this study, we observed an average read length of 3.3 kbp. We assessed multiple whole genome assemblers in this study as each program utilizes different algorithms to assemble sequencing datasets and ultimately resolve repeat regions and errors within the sequencing data. It has been shown that the performance of different assemblers varies depending on the dataset, as the problems encountered with de novo genome assembly (mainly repetitive regions) have yet to be fully resolved (Earl et al., 2011; Salzberg et al., 2012). For our short-read only dataset, SPAdes produced the most contiguous and complete genomes. On average, the five short-read assemblies consisted of 1018 contigs measuring ~44.2 Mbp in length with N50 values of 165851. These assemblies are similar to the 37 whole genome assemblies currently available in GenBank, as each were assembled using SPAdes and had an average of 1675 contigs and an N50 value of 121520 bp.

59

Unicycler produced the most contiguous and complete hybrid assembly. This hybrid reference genome is an improvement over the currently available reference assembly; however, further sequencing efforts are required to close the genome. Since the genome of C. cayetanensis contains many repeat-rich regions (Liu et al., 2016), longer reads are required to resolve these regions than those produced in this study. A few approaches to accomplish this goal include determining a method to amplify the whole genome of C. cayetanensis in order to sequence the longer fragments on the MinION, or obtaining a complete whole genome sequence of the closely-related organism Eimeria tenella (see Liu et al., 2016), and using that genome as the reference genome. By resolving the genome to the chromosomal level, further genomic studies can be conducted to gain insights into the biology and genetics of C. cayetanensis. By having a well-curated reference genome, whole genome assemblies of other isolates can be generated more easily and accurately against the reference by having a scaffold of the repeat-rich regions. The increasing number of high quality whole genome assemblies will allow for an in-depth comparative genomic analyses to be completed to find discriminatory markers that will aid in strain-level characterization.

3.6 Conclusion

This study presents the first five C. cayetanensis whole genome assemblies generated from Canadian isolates, with one of these assemblies representing the first whole genome assembly that uses both long reads generated by the MinION and short reads generated by the Illumina MiSeq for this organism. This hybrid assembly is an improvement from the currently available reference assembly. The existing reference, CcayRef3 (GCF_002999335.1), measures 44.2 Mbp long with 297 contigs and N50 of 654019. For our new hybrid assembly, we report 128 tRNA genes, 9 rRNA genes, and 6247 predicted genes that are encoded in this hybrid assembly. This study demonstrates a method to obtain long-read sequences for C. cayetanensis isolates to generate higher quality genomes compared to short-read sequences alone. It is important to keep improving on the current reference assemblies available for this coccidian parasite as complete genomes provide the opportunity to understand both the biology and genetic features of the organism.

60

CHAPTER 4: GENERAL DISCUSSIONS AND CONCLUSIONS

The research conducted for this thesis highlights the importance of continuing genomic research for C. cayetanensis. In Canada, cyclosporiasis outbreaks were first reported in the mid- 1990s (Herwaldt & Beach, 1999) and there have been an increasing number of outbreaks since (Morton et al., 2019). However, the source of infection remains unknown or suspected for over half of the major outbreaks that occur (Hadjilouka & Tsaltas, 2020) as no genotyping approaches have been used to complement epidemiological investigations until recently (Nascimento et al., 2020). The genetic data collected for this study provides insight into the C. cayetanensis genotypes found within Canadian cyclosporiasis infections along with improved genome assemblies to improve on the genetic information available for this parasite. The results generated in Chapter 2 of this thesis are the first to describe Canadian cyclosporiasis cases genotypically. We generated genotyping data for these isolates using the most recent genotyping tool developed for C. cayetanensis (Nascimento et al., 2020). Of the haplotypes discovered across the eight markers examined, the majority resembled those described from the United States (Nascimento et al., 2020). This is not surprising as Canada and the United States share similar food import suppliers for fresh produce, including Mexico, an endemic region for C. cayetanensis. However, we did identify five novel haplotypes across three markers within our Canadian samples. It is important to note that the specimens studied as reported in this chapter provide a limited genetic view of cyclosporiasis cases in Canada. We collected 160 unpreserved, C. cayetanensis positive specimens from 142 case-patients within a ten-year period (2010-2020), which is far less than the 370 laboratory confirmed cases reported in 2020 alone (Public Health Agency of Canada, 2020). One of the main reasons contributing to the relatively small number of unpreserved samples collected in Canada is that the majority of patients are diagnosed microscopically through an ova and parasite examination. As a result, specimens are suspended in SAF fixative for microscopic diagnosis, which interferes with downstream molecular analysis. By switching to a strictly molecular diagnostic method, a larger proportion of specimens could undergo genotyping to provide a more complete picture of which genotypes are dominant during infections. Minimally,

61

a preservative/fixative amenable to both standard O&P microscopic diagnoses and later downstream molecular methods would greatly enhance our ability to type and track outbreaks. Nonetheless, the data we present in Chapter 2 provide valuable insight into the performance of the genotyping tool on Canadian cyclosporiasis cases. Overall, 79.4% of C. cayetanensis specimens successfully clustered using the targeted amplicon NGS assay combined with an ensemble-based clustering approach (Barratt et al., 2019; Nascimento et al., 2020). We concluded that the two nuclear markers, CDC1 and CDC4, displayed the lowest sequencing success rates of 40.6% and 66.3%, respectively. To enhance this genotyping scheme, additional novel markers should be considered to decrease the amount of missing data for a particular dataset. To do so, whole genome comparisons of diverse geographical isolates should be conducted to identify nuclear regions that have high numbers of SNPs, utilizing the workflow described by Houghton et al. (2020). Ideally, the nuclear markers should be single-copy to minimize heterozygosity for ease of analysis and display high discriminatory power (Houghton et al., 2020). To support whole genome comparative analyses for this parasite, we describe whole genome sequencing data for Canadian C. cayetanensis isolates as discussed in Chapter 3 of this thesis. In this chapter, we described a new method to generate long-reads in order to improve on the assembly quality compared to the current reference genome available. Using the Rapid PCR Barcoding kit allows for the generation of long-reads using the Oxford Nanopore MinION sequencer from a limited amount of DNA. Unfortunately, this kit limits the length of DNA sequenced to approximately 2-3 kbp long. Nonetheless, we show that the addition of long-reads improves the genome assembly for C. cayetanensis. Compared to the current reference assembly, CcayRef3 (GCF_002999335.1), the number of contigs in the hybrid assembly was reduced to 297 (CcayRef3 had 738 contigs) with an N50 value of 654019 (CcayRef3 N50 of 192560). The completeness of the assemblies were relatively similar, with the hybrid assembly having 91.7% of single-copy apicomplexan orthologs present compared to 91.3% present in the CcayRef3 assembly. Further research is required to generate longer read lengths to further improve the genome as the hybrid assembly contains contigs that start and/or end in repeat sequences. Another limitation of this study was the small number of unfixed samples that had sufficient numbers of oocysts to work with. One of the main goals in this study was to produce long-reads

62

to improve the genome quality. Since the MinION sequencing technology requires >400 ng of DNA to produce sequencing reads with lengths equal to the DNA fragment size, whole genome amplification attempts were made to amplify the 1-2 ng that we routinely obtained following purification of the samples (see Appendix 3). However, our attempts were unsuccessful as the whole genome did not amplify uniformly and a high abundance of chimeric reads were present. Consequently, there was only sufficient DNA remaining in one sample for whole genome sequencing with the Rapid PCR Barcoding Kit. The failure of our whole genome amplification efforts may be partially attributed to the presence of inhibitors within the sample, among other factors. To meet the high DNA input demands on long-read sequencing technologies, further work needs to focus on identifying a method that will amplify the whole genome of C. cayetanensis or identifying methods to reduce the loss of oocysts during the semi-purification and enrichment protocol. The data presented in this study will provide the opportunity to both develop novel markers for subtyping C. cayetanensis and improve whole genome assemblies for this organism. We describe the first five whole genome assemblies of Canadian cyclosporiasis cases. This data will allow for comparative genomic analyses to be conducted to identify discriminatory regions. Our data suggests that the currently available NGS targeted amplicon genotyping scheme for C. cayetanensis can supplement epidemiological investigations in identifying sources of infection and linking cyclosporiasis outbreak cases together. Despite this promise, there is significant room for improvement, as we showed that approximately 20% of specimens could not be genotyped. To improve this genotyping scheme, further nuclear markers must be discovered. The whole genome assemblies from Canadian cyclosporiasis cases we generated in this study will provide additional insight into which markers are more discriminatory. Finally, we described a method to generate hybrid genome assemblies using both long and short read sequencing data for this organism which will result in more complete and continuous assemblies. An improved assembly for C. cayetanensis will provide a more detailed understanding into the genetic structure of this parasite along with facilitating the assemblies of future whole genomes that are sequenced. Overall, this research will both complement epidemiological investigations at mitigating future cyclosporiasis outbreaks and allow for further understanding into the genetics and biology of C. cayetanensis.

63

REFERENCES

Almeria, S., Cinar, H. N., & Dubey, J. P. (2019). Cyclospora cayetanensis and cyclosporiasis: An update. Microorganisms, 7(9). https://doi.org/10.3390/microorganisms7090317

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., Lipman, D. J. (1990) Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403-410. https://doi.org/10.1016/S0022- 2836(05)80360-2

Antipov, D., Korobeynikov, A., McLean, J. S., & Pevzner, P. A. (2016). hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics (Oxford, England), 32(7), 1009–1015. https://doi.org/10.1093/bioinformatics/btv688

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., Pyshkin, A. V., Sirotkin, A. V., Vyahhi, N., Tesler, G., Alekseyev, M. A., & Pevzner, P. A. (2012). SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology : a journal of computational molecular cell biology, 19(5), 455–477. https://doi.org/10.1089/cmb.2012.0021

Barratt, J. L. N., Park, S., Nascimento, F. S., Hofstetter, J., Plucinski, M., Casillas, S., Bradbury, R. S., Arrowood, M. J., Qvarnstrom, Y., & Talundzic, E. (2019). Genotyping genetically heterogeneous Cyclospora cayetanensis infections to complement epidemiological case linkage. Parasitology, 146(10), 1275–1283. https://doi.org/10.1017/S0031182019000581

Barta, J.R. (2001) Coccidiosis. Encyclopedia of Life Sciences, Wiley Online Library, John Wiley & Sons, Ltd., Hoboken, NJ., 8pp. https://doi.org/10.1038/npg.els.0001947.

BCCDC Annual Summary (2019) British Columbia Annual Summary of Reportable Diseases, 2017. British Columbia Center for Disease Control, Vancouver, BC.

Berlin, O. G., Peter, J. B., Gagne, C., Conteas, C. N., and Ash, L. R. (1998). Autofluorescence and the detection of cyclospora oocysts. Emerg. Infect. Dis., 4, 127–128. https://doi.org/10.3201/eid0401.980121

Bhandari, D., Tandukar, S., Parajuli, H., Thapa, P., Chaudhary, P., Shrestha, D., et al. (2015). Cyclospora infection among school children in Kathmandu, Nepal: prevalence and associated risk factors. Trop. Med. Health, 43, 211–216. https://doi.org/10.2149/tmh.2015-25

BioFire (2020). The BioFire® FilmArray® Gastrointestinal (GI) Panel. Retrieved April 1, 2020, from https://www.biofiredx.com/products/the-filmarray-panels/filmarraygi/

64

Borodovsky, M., & Lomsadze, A. (2011). Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES. Current protocols in bioinformatics, Chapter 4, Unit–4.6.10. https://doi.org/10.1002/0471250953.bi0406s35

Brayton, K.A., Lau, A.O.T., Herndon, D.R., Hannick, L., Kappmeyer, L.S., Berens, S.J., et al. (2007) Genome Sequence of bovis and Comparative Analysis of Apicomplexan Hemoprotozoa. PLoS Path, 3(10):e148. https://doi.org/10.1371/journal.ppat.0030148

Bukhari, Z., McCuin, R. M., Fricker, C. R., & Clancy, J. L. (1998). Immunomagnetic separation of Cryptosporidium parvum from source water samples of various turbidities. Applied and environmental microbiology, 64(11), 4495–4499. https://doi.org/10.1128/AEM.64.11.4495- 4499.1998

Buss, S. N., Leber, A., Chapin, K., Fey, P. D., Bankowski, M. J., Jones, M. K., Rogatcheva, M., Kanack, K. J., Bourzac, K. M. (2015). Multicenter evaluation of the BioFire FilmArray gastrointestinal panel for etiologic diagnosis of infectious . Journal of Clinical Microbiology, 53, 915–925. https://doi.org/10.1128/JCM.02674-14

Cai, X., Fuller, A.L., McDougald, L.R., Zhu, G. (2003) Apicoplast genome of the coccidian Eimeria tenella. Gene, 321(4):39-46. https://doi.org/10.1016/j.gene.2003.08.008

Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J., & Holmes, S. P. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature methods, 13(7), 581–583. https://doi.org/10.1038/nmeth.3869

Casillas, S. M., Bennett, C., Straily, A. (2018). Notes from the Field: Multiple Cyclosporiasis Outbreaks - United States, 2018. Morbidity and Mortality Weekly Report (MMWR), 67(39), 1101-1102. http://dx.doi.org/10.15585/mmwr.mm6739a6

Chacín-Bonilla, L. (2017) Cyclospora cayetanensis. Global Water Pathogen Project, Part 3, Michigan State University, E. Lansing, MI, UNESCO. https://doi.org/10.14321/waterpathogens.32

Chan, P. P., & Lowe, T. M. (2019). tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences. Methods in molecular biology (Clifton, N.J.), 1962, 1–14. https://doi.org/10.1007/978-1-4939-9173-0_1

Chandra, V., Torres, M., and Ortega, Y. R. (2014). Efficacy of wash solutions in recovering Cyclospora cayetanensis, Cryptosporidium parvum, and from basil. J. Food Prot. 77, 1348–1354. doi: 10.4315/0362-028X.JFP-13-381

65

Cinar, H. N., Gopinath, G., Jarvis, K., & Murphy, H. R. (2015). The complete mitochondrial genome of the foodborne parasitic pathogen Cyclospora cayetanensis. PLoS ONE, 10(6). https://doi.org/10.1371/journal.pone.0128645

Cinar, H. N., Gopinath, G., Murphy, H. R., Almeria, S., Durigan, M., Choi, D., Jang, A., Kim, E., Kim, R., Choi, S., Lee, J., Shin, Y., Lee, J., Qvarnstrom, Y., Benedict, T. K., Bishop, H. S., & da Silva, A. (2020). Molecular typing of Cyclospora cayetanensis in produce and clinical samples using targeted enrichment of complete mitochondrial genomes and next- generation sequencing. Parasites and Vectors, 13(1). https://doi.org/10.1186/s13071-020- 3997-3

Dawson, D. (2005). Foodborne protozoan parasites. International Journal of Food Microbiology, 103, 207-227. https://doi.org/10.1016/j.ijfoodmicro.2004.12.032 de Vries, J., Archibald, J.M. (2018) Plastid genomes. Curr Biol, 28:R336–R337. doi: 10.1016/j.cub.2018.01.027

Dixon, B. R., Bussey, J. M., Parrington, L. J., & Parenteau, M. (2005). Detection of Cyclospora cayetanensis oocysts in human fecal specimens by flow cytometry. Journal of clinical microbiology, 43(5), 2375–2379. https://doi.org/10.1128/JCM.43.5.2375-2379.2005

Dixon, B. R., Mihajilovic, B., Couture, H. Farber, J. M. (2016) Qualitative Risk Assessment: Cyclospora cayetanensis on Fresh Raspberries and Blackberries Imported into Canada. Food Protection Trends, 36(1), 18-32.

Dubey, J. P., Almeria, S., Mowery, J., Fortes, J. (2020) Endogenous Development Cycle of the Human Coccidian Cyclospora cayetanensis. J Parasitology, 106(2), 295-307. https://doi.org/10.1645/20-21

Earl, D., Bradnam, K., St John, J., Darling, A., Lin, D., Fass, J., Yu, H. O., Buffalo, V., Zerbino, D. R., Diekhans, M., Nguyen, N., Ariyaratne, P. N., Sung, W. K., Ning, Z., Haimel, M., Simpson, J. T., Fonseca, N. A., Birol, İ., Docking, T. R., Ho, I. Y., … Paten, B. (2011). Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome research, 21(12), 2224–2241. https://doi.org/10.1101/gr.126599.111

Eberhard, M. L., Ortega, Y. R., Hanes, D. E., Nace, E. K., Quy Do, R., Robl, M. G., Won, K. Y., Gavidia, C., Sass, N. L., Mansfield, K., Gozalo, A., Griffiths, J., Gilman, R., Sterling, C. R., & Arrowood, M. J. (2000). Attempts to establish experimental Cyclospora cayetanensis infection in laboratory animals. Journal of Parasitology, 86(3), 577–582. https://doi.org/10.1645/0022-3395(2000)086[0577:ateecc]2.0.co;2

66

Garcia, L. S., Arrowood, M., Kokoskin, E., Paltridge, G. P., Pillai, D. R., Procop, G. W., et al. (2017). Laboratory diagnosis of parasites from the gastrointestinal tract. Clin. Microbiol. Rev. 31:e00025–17. https://doi.org/10.1128/CMR.00025-17

Gardner, M.J., Bishop, R., Shah, T., de Villiers, E.P., Carlton, J.M., Hall, N., et al. (2005) Genome sequence of parva, a bovine pathogen that transforms lymphocytes. Science, 309:134-137. https://doi.org/10.1126/science.1110439

Giangaspero, A., Marangi, M., Koehler, A. V., Papini, R., Normanno, G. Lacasella, V., Lonigro, A., Gasser, R. B. (2015) Molecular detection of Cyclospora in water, soil, vegetables and humans in southern Italy signals a need for improved monitoring by health authorities. International Journal of Food Microbiology, 211, 95-100. https://doi.org/10.1016/j.ijfoodmicro.2015.07.002

Gopinath, G. R., Cinar, H. N., Murphy, H. R., Durigan, M., Almeria, M., Tall, B. D., & Dasilva, A. J. (2018). A hybrid reference-guided de novo assembly approach for generating Cyclospora mitochondrion genomes. Gut Pathogens, 10(1), 15. https://doi.org/10.1186/s13099-018-0242-0

Gould, L. H., Kline, J., Monahan, C., & Vierk, K. (2017). Outbreaks of Disease Associated with Food Imported into the United States, 1996-20141. Emerging infectious diseases, 23(3), 525–528. https://doi.org/10.3201/eid2303.161462

Greiner, S., Lehwark, P., Bock, R. (2019) OrganellarGenomeDRAW (OGDRAW) version 1.3.1: expanded toolkit for the graphical visualization of organellar genomes. Nucleic Acids Research, 47(W1), W59–W64. https://doi.org/10.1093/nar/gkz238

Guo, Y., Roellig, D. M., Li, N., Tang, K., Frace, M., Ortega, Y., Arrowood, M. J., Feng, Y., Qvarnstrom, Y., Wang, L., Moss, D. M., Zhang, L., & Xiao, L. (2016). Multilocus sequence typing tool for Cyclospora cayetanensis. Emerging Infectious Diseases, 22(8), 1464–1467. https://doi.org/10.3201/eid2208.150696

Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics (Oxford, England), 29(8), 1072–1075. https://doi.org/10.1093/bioinformatics/btt086

Hadjilouka, A., & Tsaltas, D. (2020). Cyclospora Cayetanensis—Major Outbreaks from Ready to Eat Fresh Fruits and Vegetables. Foods, 9(11), 1703. http://dx.doi.org/10.3390/foods9111703

Herwaldt, B. L., & Beach, M. J. (1999) The return of Cyclospora in 1997: another outbreak of cyclosporiasis in North America associated with imported raspberries. Cyclospora Working

67

Group. Annals of Internal Medicine, 130(3): 210-20. https://doi.org/10.7326/0003-4819-130- 3-199902020-00006

Hikosaka, K., Watanabe, Y., Kobayashi, F., Waki, S., Kita, K., Tanabe, K. (2011) Highly conserved gene arrangement of the mitochondrial genomes of 23 species. Parasitology International, 60(2):175-180. https://doi.org/10.1016/j.parint.2011.02.001

Hitchcock, M. M., Hogan, C. A., Budvytiene, I., and Banaei, N. (2019). Reproducibility of positive results for rare pathogens on the FilmArray GI Panel. Diagn. Microbiol. Infect. Dis. 95, 10–14. https://doi.org/10.1016/j.diagmicrobio.2019.03.013

Hofstetter, J. N., Nascimento, F. S., Park, S., Casillas, S., Herwaldt, B. L., Arrowood, M. J., & Qvarnstrom, Y. (2019). Evaluation of Multilocus Sequence Typing of Cyclospora cayetanensis based on microsatellite markers. Parasite, 26, 3. https://doi.org/10.1051/parasite/2019004

Janouškovec, J., Horak, A., Obornik, M., Lukes, J., Keeling, P.J. (2010) A common red algal origin of the apicomplexan, , and . PNAAS, 107(24):10949- 54. https://doi.org/10.1073/pnas.1003335107

Kairo, A., Fairlamb, A.H., Gobright, E., Nene, V. (1994). A 7.1 kb linear DNA molecule of Theileria parva has scrambled rDNA sequences and open reading frames for mitochondrially encoded proteins. The EMBO journal, 13(4):898–905.

Kissinger, J.C., Gajria, B., Li, L., Paulsen, I.T., Roos, D.S. (2003) ToxoDB: accessing the Toxoplasma gondii genome. Nucleic Acids Research, 31(1):234-236. https://doi.org/10.1093/nar/gkg072

Kitajima, M., Haramoto, E., Iker, B. C., Gerba, C. P. (2014) Occurrence of Cryptosporidium, Giardia, and Cyclospora in influent and effluent water at wastewater treatment plants in Arizona. Science of the Total Environment, 484, 129-136. https://doi.org/10.1016/j.scitotenv.2014.03.036

Kolmogorov, M., Yuan, J., Lin, Y., and Pevzner, P. (2019) Assembly of Long Error-Prone Reads Using Repeat Graphs. Nature Biotechnology, 37, 540-546. https://doi.org/10.1038/s41587- 019-0072-8

Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Phillippy, A.M. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research, 27, 722-736. https://doi.org/10.1101/gr.215087.116

68

Kozak, G. K., MacDonald, D., Landry, L., Farber, J. M. (2013) Foodborne outbreaks in Canada linked to produce: 2001 through 2009. Journal of Food Protection, 76, 173-183. https://doi.org/10.4315/0362-028X.JFP-12-126

Lagesen, K., Hallin, P., Rødland, E. A., Staerfeldt, H. H., Rognes, T., & Ussery, D. W. (2007). RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic acids research, 35(9), 3100–3108. https://doi.org/10.1093/nar/gkm160

Lalonde, L. F., and Gajadhar, A. A. (2011). Detection and differentiation of coccidian oocysts by real-time PCR and melting curve analysis. J. Parasitol. 97, 725–730. https://doi.org/10.1645/GE-2706.1

Lalonde, L. F., and Gajadhar, A. A. (2016). Optimization and validation of methods for isolation and real-time PCR identification of protozoan oocysts on leafy green vegetables and berry fruits. Food Waterborne Parasitol. 2, 1–7. https://doi.org/10.1016/j.fawpar.2015.12.002

Li, H. (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34(18), 3094–3100. https://doi.org/10.1093/bioinformatics/bty191

Li, J., Chang, Y., Shi, K. E., Wang, R., Fu, K., Li, S., et al. (2017). Multilocus sequence typing and clonal population genetic structure of Cyclospora cayetanensis in humans. Parasitology 144, 1890–1897. doi: 10.1017/S0031182017001299

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford, England), 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., & 1000 Genome Project Data Processing Subgroup (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England), 25(16), 2078– 2079. https://doi.org/10.1093/bioinformatics/btp352

Lin, R.Q., Qiu, L.L., Liu, G.H., Wu, X.Y., Weng, Y.B., Xie, W.Q., Hou, J., Pan, H., Yuan, Z.G., Zou, F.C., Hu, M., Zhu, X.Q. (2011) Characterization of the complete mitochondrial genomes of five Eimeria species from domestic chickens. Gene, 480(1-2):28-33. https://doi.org/10.1016/j.gene.2011.03.004

Liu, S., Wang, L., Zheng, H., Xu, Z., Roellig, D. M., Li, N., Frace, M. A., Tang, K., Arrowood, M. J., Moss, D. M., Zhang, L., Feng, Y., & Xiao, L. (2016). Comparative genomics reveals Cyclospora cayetanensis possesses coccidia-like metabolism and invasion components but unique surface antigens. BMC Genomics, 17(1), 316. https://doi.org/10.1186/s12864-016- 2632-3

69

Mansfield, L. S. & Gajadhar, A. A. (2004). Cyclospora cayetanensis, a food- and waterborne coccidian parasite. Veterinary Parasitology, 126(1-2), 73-90. https://doi.org/10.1016/j.vetpar.2004.09.011

Morton, V., Meghnath, K., Gheorghe, M., Fitzgerald-Husek, A., Hobbs, J., Honish, L., David, S. (2019) Use of a case–control study and control bank to investigate an outbreak of locally acquired cyclosporiasis in Canada. Canada Communicable Disease Report, 45(9), 225–229. https://doi.org/10.14745/ccdr.v45i09a01

Mundaca, C. C., Torres-Slimming, P. A., Araujo-Castillo, R. V., Moran, M., Bacon, D. J., Ortega, Y., Gilman, R. H., Blazes, D. L. (2008). Use of PCR to improve diagnostic yield in an outbreak of cyclosporiasis in Lima Peru. Transactions of the Royal Society of Tropical Medicine and Hygiene, 102, 712-717. https://doi.org/10.1016/j.trstmh.2008.03.003

Murphy, H. R., Lee, S., da Silva, A. J. (2017) Evaluation of an Improved U.S. Food and Drug Administration Method for the Detection of Cyclospora cayetanensis in Produce Using Real- Time PCR. Journal of Food Protection, 80(7), 1133-1144. doi: 10.4315/0362-028X.JFP-16- 492

Murphy, H. R., Cinar, H. N., Gopinath, G., Noe, K. E., Chatman, L. D., Miranda, N. E., et al. (2018). Interlaboratory validation of an improved method for detection of Cyclospora cayetanensis in produce using a real-time PCR assay. Food Microbiol. 69, 170–178. doi: 10.1016/j.fm.2017.08.008

Mzilahowa, T., McCall, P. J., & Hastings, I. M. (2007). "Sexual" population structure and genetics of the agent P. falciparum. PloS one, 2(7), e613. https://doi.org/10.1371/journal.pone.0000613

Nascimento, F. S., Wei-Pridgeon, Y., Arrowood, M. J., Moss, D., da Silva, A. J. (2016) Evaluation of library preparation methods for Illumina next generation sequencing of small amounts of DNA from foodborne parasites. Journal of Microbiological Methods, 130, 23-26. https://doi.org/10.1016/j.mimet.2016.08.020

Nascimento, F. S., Barta, J. R., Whale, J., Hofstetter, J. N., Casillas, S., Barratt, J., Talundzic, E., Arrowood, M. J., & Qvarnstrom, Y. (2019). Mitochondrial Junction Region as Genotyping Marker for Cyclospora cayetanensis. Emerging infectious diseases, 25(7), 1314–1319. https://doi.org/10.3201/eid2507.181447

Nascimento, F. S., Barratt, J., Houghton, K., Plucinski, M., Kelley, J., Casillas, S., Bennett, C. C., Snider, C., Tuladhar, R., Zhang, J., Clemons, B., Madison-Antenucci, S., Russell, A., Cebelinski, E., Haan, J., Robinson, T., Arrowood, M. J., Talundzic, E., Bradbury, R. S., & Qvarnstrom, Y. (2020). Evaluation of an ensemble-based distance statistic for clustering MLST datasets using epidemiologically defined clusters of cyclosporiasis. Epidemiology and infection, 148, e172. https://doi.org/10.1017/S0950268820001697 70

Ogedengbe, M.E., El-Sherry, S., Whale, J., Barta, J. (2014). Complete mitochondrial genome sequences from five Eimeria species (Apicomplexa; Coccidia; ) infecting domestic turkeys. Parasites Vectors, 7(335). https://doi.org/10.1186/1756-3305-7-335

Ogedengbe, M. E., Qvarnstrom, Y., da Silva, A. J., Arrowood, M. J., & Barta, J. R. (2015). A linear mitochondrial genome of Cyclospora cayetanensis (Eimeriidae, , Coccidiasina, Apicomplexa) suggests the ancestral start position within mitochondrial genomes of eimeriid coccidia. International Journal for Parasitology, 45(6), 361–365. https://doi.org/10.1016/j.ijpara.2015.02.006

Ortega, Y. R., Sterling, C. R., Gilman, R. H., Cama, V. A. Díaz, F. (1993). Cyclospora species - A new protozoan pathogen of humans. The New England Journal of Medicine, 328(18), 1308-1312. https://doi.org/10.1056/NEJM199305063281804

Ortega, Y. R, Gilman, R. H., & Sterling, C. R. (1994). A New Coccidian Parasite (Apicomplexa: Eimeriidae) from Humans. The Journal of Parasitology, 80(4), 625-629. https://www.jstor.org/stable/3283201

Ortega, Y. R., & Sanchez, R. (2010). Update on Cyclospora cayetanensis, a food-borne and waterborne parasite. Clinical Microbiology Reviews, 23(1), 218–234. https://doi.org/10.1128/CMR.00026-09

Pape, J. W., Verdier, R. I., Boncy, M., Boncy, J., and Johnson, W. D. Jr. (1994). Cyclospora infection in adults infected with HIV. Clinical manifestations, treatment, and prophylaxis. Ann. Intern. Med. 121, 654–657. https://doi.org/10.7326/0003-4819-121-9-199411010-00004

Peng, Y., Leung, H. C. M., Yiu, S. M., Chin, F. Y. L. (2012) IBDA-UD: a de novo assembler for single-cell and metagenomics sequencing data with highly uneven depth. Bioinformatics, 28(11), 1420-1428. https://doi.org/10.1093/bioinformatics/bts174

Qvarnstrom, Y., Wei-Pridgeon, Y., Li, W., Nascimento, F. S., Bishop, H. S., Herwaldt, B. L., Moss, D. M., Nayak, V., Srinivasamoorthy, G., Sheth, M., & Arrowood, M. J. (2015). Draft Genome Sequences from Cyclospora cayetanensis Oocysts Purified from a Human Stool Sample. Genome announcements, 3(6), e01324-15. https://doi.org/10.1128/genomeA.01324- 15

Qvarnstrom, Y., Wei-Pridgeon, Y., van Roey, E., Park, S., Srinivasamoorthy, G., Nascimento, F. S., Moss, D. M., Talundzic, E., & Arrowood, M. J. (2018). Purification of Cyclospora cayetanensis oocysts obtained from human stool specimens for whole genome sequencing. Gut Pathogens, 10(1). https://doi.org/10.1186/s13099-018-0272-7

71

Robertson, L.J., Gjerde, B., Campbell, A.T. (2000) Isolation of Cyclospora oocysts from fruits and vegetables using lectin-coated paramagnetic beads. J Food Prot, 63(10), 1410-4. https://doi.org/10.4315/0362-028x-63.10.1410

Ryan, U., Paparini, A., Oskam, C. (2017). New Technologies for Detection of Enteric Parasites. Trends in Parasitology, 33(7), 532-546. https://doi.org/10.1016/j.pt.2017.03.005

Salomaki, E.D., Kolisko, M. (2019). There Is Treasure Everywhere: Reductive Plastid Evolution in Apicomplexa in Light of Their Close Relatives. Biomolecules, 9(8):378. https://doi.org/10.3390/biom9080378

Salzberg, S. L., Phillippy, A. M., Zimin, A., Puiu, D., Magoc, T., Koren, S., Treangen, T. J., Schatz, M. C., Delcher, A. L., Roberts, M., Marçais, G., Pop, M., & Yorke, J. A. (2012). GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome research, 22(3), 557–567. https://doi.org/10.1101/gr.131383.111

Sathyanarayanan, L., & Ortega, Y. (2004). Effects of Pesticides on Sporulation of Cyclospora cayetanensis and Viability of Cryptosporidium parvum. Journal of Food Protection, 67(5), 1044-1049. https://doi.org/10.4315/0362-028x-67.5.1044

Sathyanarayanan, L., & Ortega, Y. (2006). Effects of Temperature and Different Food Matrices on Cyclospora cayetanensis Oocyst. The Journal of Parasitology, 92(2), 218-222. https://doi.org/10.1645/GE-630R.1

Schreeg, M.E., Marr, H.S., Tarigo, J.L., Cohn, L.A., Bird, D.M., Scholl, E.H., Levy, M.G., Wiegmann, B.M., Birkenheuer, A.J. (2016). Mitochondrial Genome Sequences and Structures Aid in the Resolution of phylogeny. PloS one, 11(11):e0165702. https://doi.org/10.1371/journal.pone.0165702

Seeber, F., Feagin, J.E., Parsons, M. (2014) Chapter 9 - The Apicoplast and Mitochondrion of Toxoplasma gondii. The Model Apicomplexan - Perspectives and Methods: Toxoplasma gondii (Second Edition), pp 297-350. https://doi.org/10.1016/B978-0-12-396481-6.00009-X

Seppey, M., Manni, M., Zdonov, E.M. (2019) BUSCO: Assessing Genome Assembly and Annotation Completeness. Kollmar M. (eds) Gene Prediction, Methods in Molecular Biology, 1962. https://doi.org/10.1007/978-1-4939-9173-0_14

Shapiro, K., Kim, M., Rajal, V. B., Arrowood, M. J., Packham, A., Aguilar, B., et al. (2019). Simultaneous detection of four protozoan parasites on leafy greens using a novel multiplex PCR assay. Food Microbiology, 84:103252. https://doi.org/10.1016/j.fm.2019.103252

72

Sherchand, J. B., Cross, J. H., Jimba, M., Sherchand, S., Shrestha, M P. (1999). Study of Cyclospora cayetanensis in health care facilities, sewage water and green leafy vegetables in Nepal. Southeast Asian Journal of Tropical Medicine, Public Health, 30(1), 58-63.

Shields, J. M., & Olson, B. H. (2003). Cyclospora cayetanensis: A review of an emerging parasitic coccidian. International Journal for Parasitology, 33(4), 371-391. https://doi.org/10.1016/S0020-7519(02)00268-0

Shields, J. M., Lee, M. M., and Murphy, H. R. (2012). Use of a common laboratory glassware detergent improves recovery of Cryptosporidium parvum and Cyclospora cayetanensis from lettuce, herbs and raspberries. International Journal Food Microbiology, 153, 123–128. https://doi.org/10.1016/j.ijfoodmicro.2011.10.025

Shirley, M.W. & Harvey, D.A. (1996) Eimeria tenella: infection with a single sporocyst gives a clonal population. Parasitology, 112(Pt 6), 523–528. https://doi.org/10.1017/s0031182000066099.

Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J., & Birol, I. (2009). ABySS: a parallel assembler for short read sequence data. Genome research, 19(6), 1117–1123. https://doi.org/10.1101/gr.089532.108

Smilack, J. D. (1999) Trimethoprim-Sulfamethoxazole. Mayo Clinic Proceedings, 74(7), 730- 734. https://doi.org/10.4065/74.7.730

Smith, H. V., Paton, C. A., Mitambo, M. M., & Girdwood, R. W. (1997). Sporulation of Cyclospora sp. oocysts. Applied and environmental microbiology, 63(4), 1631–1632. https://doi.org/10.1128/AEM.63.4.1631-1632.1997

Srinivasan, M., Sedmak, D., & Jewell, S. (2002). Effect of fixatives and tissue processing on the content and integrity of nucleic acids. The American journal of pathology, 161(6), 1961– 1971. https://doi.org/10.1016/S0002-9440(10)64472-0

Steele, M., Unger, S., and Odumeru, J. (2003). Sensitivity of PCR detection of Cyclospora cayetanensis in raspberries, basil, and mesclun lettuce. Journal of Microbiology Methods, 54, 277–280. doi: 10.1016/S0167-7012(03)00036-8

Strausbaugh, L. J. & Herwaldt, B. L. (2000) Cyclospora cayetanensis: A Review, Focusing on the Outbreaks of Cyclosporiasis in the 1990s. Clinical Infectious Diseases, 31(4), 1040– 1057. https://doi.org/10.1086/314051

Sun, T., Illardi, C. F., Asnis, D., Bresciani, A. R., Goldenberg, S., Roberts, B., Techberg, S. (1996). Light and electron microscopic identification of Cyclospora species in the small

73

intestine. Evidence of the presence of asexual life cycle in human host. American Journal of Clinical Pathology, 105, 216-220. https://doi.org/10.1093/ajcp/105.2.216

Tang, K., Guo, Y., Zhang, L., Rowe, L. A., Roellig, D. M., Frace, M. A., Li, N., Liu, S., Feng, Y., & Xiao, L. (2015). Genetic similarities between Cyclospora cayetanensis and cecum- infecting avian Eimeria spp. in apicoplast and mitochondrial genomes. Parasites and Vectors, 8(1). https://doi.org/10.1186/s13071-015-0966-3

Temesgen, T. T., Tysnes, K. R., & Robertson, L. J. (2019). A new protocol for molecular detection of Cyclospora cayetanensis as contaminants of berry fruits. Frontiers in Microbiology, 10(1939). https://doi.org/10.3389/fmicb.2019.01939

Torgerson, P. R., Devleesschauwer, B., Praet, N., Speybroeck, N., Willingham, A. L., Kasuga, F., Rokni, M. B., Zhou, X. N., Fèvre, E. M., Sripa, B., Gargouri, N., Fürst, T., Budke, C. M., Carabin, H., Kirk, M. D., Angulo, F. J., Havelaar, A., & de Silva, N. (2015). World Health Organization Estimates of the Global and Regional Disease Burden of 11 Foodborne Parasitic Diseases, 2010: A Data Synthesis. PLoS Medicine, 12(12). https://doi.org/10.1371/journal.pmed.1001920

Tsai, I.J., Otto, T.D. & Berriman, M. (2010) Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol, 11, R41. https://doi.org/10.1186/gb-2010-11-4-r41

Tsang, O. T., Wong, R. W., Lam, B. H., Chan, J. M., Tsang, K. Y., & Leung, W. S. (2013). Cyclospora infection in a young woman with human virus in Hong Kong: a case report. BMC research notes, 6, 521. https://doi.org/10.1186/1756-0500-6-521 van Belkum, A., Tassios, P. T., Dijkshoorn, L., Haeggman, S., Cookson, B., Fry, N. K., Fussing, V., Green, J., Feil, E., Gerner-Smidt, P., Brisse, S., Struelens, M. (2007) Guidelines for the validation and application of typing methods for use in bacterial epidemiology. Clinical Microbiology and Infection, 13, 1-46. DOI: 10.1111/j.1469-0691.2007.01786.x

Verdier, R. I., Fitzgerald, D. W., Johnson, W. D., & Pape, J. W. (2000). Trimethoprim- Sulfamethoxazole Compared with Ciprofloxacin for Treatment and Prophylaxis of belli and Cyclospora cayetanensis Infection in HIV-Infected Patients a Randomized, Controlled Trial. Annals of Internal Medicine, 132(11), 885-888. https://doi.org/10.7326/0003-4819-132-11-200006060-00006

Visvesvara, G. S., Moura, H., Kovacs-Nace, E., Wallace, S., and Eberhard, M. L. (1997). Uniform staining of Cyclospora oocysts in fecal smears by a modified safranin technique with microwave heating. Journal of Clinical Microbiology, 35, 730–733. doi: 10.1128/JCM.35.3.730-733.1997

74

Walker, B.J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S., et al. (2014) Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS ONE, 9(11): e112963. https://doi.org/10.1371/journal.pone.0112963

Whitfield Y., Johnson K., Hanson H., Huneault D. (2017) 2015 outbreak of cyclosporiasis linked to the consumption of imported sugar snap peas in Ontario, Canada. J. Food Prot, 80, 1666– 1669. https://doi.org/10.4315/0362-028X.JFP-17-084.

Wick, R. R., Judd, L. M., Gorrie, C. L., Holt, K. E. (2017) Unicyler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Computational Biology, 13(6): e1005595. https://doi.org/10.1371/journal.pcbi.1005595

Wilson, R.J.M., Denny, P.W., Preiser, P.R., Rangachari, K., Roberts, K., Roy, A., Whyte, A., Strath, M., Moore, D.J., Moore, P.W., Williamson, D.H. (1996) Complete gene map of the plastid-like DNA of the malaria parasite . Journal of Molecular Biology, 261(2):L155–172. https://doi.org/10.1006/jmbi.1996.0449

Woo, Y. H., Ansari, H., Otto, T. D., Klinger, C. M., Kolisko, M., Michálek, J., Saxena, A., Shanmugam, D., Tayyrov, A., Veluchamy, A., Ali, S., Bernal, A., del Campo, J., Cihlář, J., Flegontov, P., Gornik, S. G., Hajdušková, E., Horák, A., Janouškovec, J., Katris, N. J., … Pain, A. (2015). Chromerid genomes reveal the evolutionary path from photosynthetic algae to obligate intracellular parasites. eLife, 4, e06974. https://doi.org/10.7554/eLife.06974

Zimin, A. V., Marcais, G., Puiu, D., Roberts, M., Salzberg, S. L., Yorke, J. A. (2013) The MaSuRCA genome assembler. Bioinformatics, 29(21), 2669-2677. https://doi.org/10.1093/bioinformatics/btt476

75

APPENDICES

Appendix 1. Script for analyzing Cyclospora cayetanensis targeted amplicon sequences

The following includes the commands used to generate amplicon sequencing data from the NGS targeted amplicon scheme for C. cayetanensis. Note the square brackets [ ] imply parameters that need to be changed prior to running the commands.

##Make Database of All Amplicons in Genetic Scheme makeblastdb –in [Path/To/Cyclospora/Amplicon/Database] –dbtype ‘nucl’

##For each sample, separate the markers blastn -db [Path/To/Cyclospora/Amplicon/Database] -query *R1* \ -query_mate *R2* -infmt fastq -out [Path/To/Output] -outfmt tabular \ -num_threads [NUM_THREADS] -splice F -no_unaligned cut -f1,2,3,7,8,16 mblast_out.txt | sed '1d' | sed '1d' | sed 's/# \ Fields: //' | tr " " "_" | awk -v OFS='\t' 'NR==1 {$7="%_query_aln"; \ print $0} NR>1 { print $0, ($5-$4)/$6*100 }' > [Path/To/Output/File]

#Separate out the markers awk '$2=="CDC1"' [Path/To/Output/File] | cut -f1 | uniq -d > CDC1.txt awk '$2=="CDC2"' [Path/To/Output/File]| cut -f1 | uniq -d > CDC2.txt awk '$2=="CDC3 "' [Path/To/Output/File] | cut -f1 | uniq -d > CDC3.txt awk '$2=="CDC4"' [Path/To/Output/File] | cut -f1 | uniq -d > CDC4.txt awk '$2=="HC378"' [Path/To/Output/File] | cut -f1 | uniq -d > sec14.txt awk '$2=="HC360i2"' [Path/To/Output/File] | cut -f1 | uniq -d > 360i2.txt awk '$2=="MSR "' [Path/To/Output/File] | cut -f1 | uniq -d > msr.txt awk '$2=="MT-J"' [Path/To/Output/File] | cut -f1 | uniq -d > mtj.txt

#Python script to get the sequencing data each marker from Bio import SeqIO, bgzf import sys import argparse parser = argparse.ArgumentParser() parser.add_argument("-f", "--forward_reads", help="forward reads", \ action="store", dest="forward_file", default=True, required=True) parser.add_argument("-r", "--reverse_reads", help="reverse reads", \ action="store", dest="reverse_file", default=True, required=True) args = parser.parse_args() forward_reads = args.forward_file reverse_reads = args.reverse_file nheaders=8

76

amps=['CDC1','CDC2','CDC3','CDC4','360i2','msr','mtj','sec14'] amp_reads=[a+'.txt' for a in amps] f_data_files=[h + '_f.txt' for h in amps] r_data_files=[h + '_r.txt' for h in amps] lookup_index={} for j,filename in enumerate(amp_reads): with open(filename, 'r') as fid: for line in fid: lookup_index[line.strip()]=j open_data_files=[open(fd,'w') for fd in f_data_files] for seq_record in SeqIO.parse(forward_reads, "fastq"): if seq_record.id in lookup_index: index=lookup_index[seq_record.id] seq_record.description="" open_data_files[index].write(seq_record.format("fastq")) for fid in open_data_files: fid.close() open_data_files=[open(fd,'w') for fd in r_data_files] for seq_record in SeqIO.parse(reverse_reads, "fastq"): if seq_record.id in lookup_index: index=lookup_index[seq_record.id] seq_record.description="" open_data_files[index].write(seq_record.format("fastq")) for fid in open_data_files: fid.close()

##Once all sequencing data for each markers are concatenated into separate files, perform ASV searches with DADA2 – Rscript ##Script was written following ‘Amplicon Analysis – Happy Belly Bioinformatics Tutorial ##(https://astrobiomike.github.io/amplicon/dada2_workflow_ex) library(dada2) path <- "[Path/To/Directory/Containing/Reads]"

#Forward and Reverse fastq filenames fnFs <- sort(list.files(path, pattern="_f.fq", full.names = TRUE)) fnRs <- sort(list.files(path, pattern="_r.fq", full.names = TRUE)) sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)

# Place filtered files in filtered/ subdirectory filtFs <- file.path(path, "filtered", paste0(sample.names, \ "_F_filt.fq")) filtRs <- file.path(path, "filtered", paste0(sample.names, \ "_R_filt.fq")) 77

names(filtFs) <- sample.names names(filtRs) <- sample.names out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,\ maxN=[num], maxEE=[num], truncQ=[num], rm.phix=TRUE,\ compress=TRUE, multithread=TRUE) errF <- learnErrors(filtFs, multithread=TRUE) errR <- learnErrors(filtRs, multithread=TRUE) derepF <- derepFastq(filtFs, verbose=TRUE) derepR <- derepFastq(filtRs, verbose=TRUE) names(derepF) <- sample.names names(derepR) <- sample.names dadaF <- dada(derepF, err=errF, multithread=TRUE, pool="pseudo") dadaR <- dada(derepR, err=errR, multithread=TRUE, pool="pseudo") merged_amplicons <- mergePairs(dadaF, derepF, dadaR, derepR) seqtab <- makeSequenceTable(merged_amplicons) seqtab.nochim <- removeBimeraDenovo(seqtab, verbose=TRUE) asv_seqs <- colnames(seqtab.nochim) asv_headers <- vector(dim(seqtab.nochim)[2], mode="character") for (i in 1:dim(seqtab.nochim)[2]){ asv_headers[i] <- paste(">ASV", i, sep="_") } asv_fasta <- c(rbind(asv_headers,asv_seqs)) write(asv_fasta, "ASVs.fa") taxa <- assignTaxonomy(seqtab.nochim, "[database_file]", tryRC=TRUE) write.csv(taxa, ‘[name].csv') asv_tab <- t(seqtab.nochim) row.names(asv_tab) <- asv_headers write.csv(asv_tab, "[name]_counts.csv") getN <- function(x) sum(getUniques(x)) sumtab <- data.frame(row.names=sample.names,dadaf=sapply(dadaF, \ getN,dadar=sapply(dadaR, getN), merged=sapply(merged_amplicons, \ getN),nonchim=rowSums(seqtab.nochim)) write.csv(sumtab, "[name].csv")

##Obtain unique ASVs for all markers, then map the sequencing read data bwa index [Path/To/Unique/ASVs] bwa mem -Y -M [Path/To/Unique/ASVs/File] [Path/To/Forward/Reads]\ [Path/To/Reverse/Reads] > [name].sam samtools view -F4 -bT [Path/To/Unique/ASVs/File] [name].sam \ samtools sort -n - | samtools fixmate -m - - | samtools sort \ -o [name].sorted.bam samtools index [name].sorted.bam 78

Appendix 2. Preparation of Sheather’s solution for density gradient purification of Cyclospora cayetanensis oocysts

The following method outlines the preparation of Sheather’s solution required to perform density gradient purification on clinical fecal specimens containing C. cayetanensis oocysts. The following solutions expire one week after the solutions were made.

1) Stock Sheather’s Solution a. Weigh out 250 g sucrose. b. In a sterile 1 L beaker, add 160 ml of milli-Q H2O and a sterile stir-bar. c. While stirring on gentle heat, add sucrose slowly (~50 g at a time until dissolved). Ensure the solution does not boil. d. Once fully dissolved, cool the solution to room temperature. e. Once cool, measure the specific gravity using a hydrometer. Adjust the specific gravity accordingly by either adding water or sucrose until the solution measures between 1.240 g/l and 1.270 g/. 2) Prepare a 1:2 dilution (1.103 g/l density) of Sheather’s/Tween/PBS (STP) a. In a sterile 500 ml beaker, measure 75 ml of the stock Sheather’s solution. b. Add 150 ml of Phosphate Buffer Saline (PBS) c. Add 2.25 ml of Tween 80 d. Thoroughly mix and ensure the solution is at the correct density by measuring with a hydrometer. Store at room temperature. 3) Prepare a 1:4 dilution (1.064 g/l density) of STP a. In a sterile 500 ml beaker, measure 50 ml of Sheather’s solution b. Add 200 ml of PBS c. Add 2.25 ml of Tween 80 d. Thoroughly mix and ensure the solution is at the correct density by measuring with a hydrometer. Store at room temperature.

79

Appendix 3. Whole genome amplification study of Canadian Cyclospora cayetanensis specimens

Introduction

The main goal of this study was to amplify the whole genome from semi-purified and enriched C. cayetanensis oocysts. Nascimento et al. (2016) estimated that a standard diagnostic fecal sample contains only picogram amounts of C. cayetanensis DNA. This means that even for a large sample (~5 g) with a high parasite load, low DNA amounts (~2-3 ng) are recovered after semi-purifying and enriching. This low DNA quantity makes it difficult to perform whole genome sequencing with the majority of library amplification kits provided by Oxford Nanopore MinION, as >400 ng of DNA are required to generate long sequencing read lengths that are equivalent to the fragment size of DNA obtained. Although our methods have failed to amplify the genome of C. cayetanensis, the steps taken during this study are outlined below.

Materials and Methods

We attempted two whole genome amplification methods: Qiagen Repli-G Mini Kit (Qiagen, Hilden, Germany) that uses random hexamer primers and multiple displacement amplification (MDA) and selective whole genome amplification which uses primers that are designed in house specifically for C. cayetanensis and multiple displacement amplification using Phi29 polymerase (New England Biolabs, Ipswich, MA, USA). Genomic DNA was purified using the AMPure XP Beads (Beckman Coulter, Brea, CA, USA) and quantified using the QubitTM Fluorometer with the Qubit dsDNA High Sensitivity Assay Kit (Invitrogen by Life Technologies, Carlsbad, CA, USA).

Qiagen Repli-G Mini Kit

We used one nanogram of purified genomic DNA as the initial starting material. The manufacturer’s protocol (Amplification of Purified Genomic DNA using the Repli-g Mini Kit) was followed, with no deviations.

80

Selective Whole Genome Amplification (SWGA)

Primers were designed using the swga (Selective Whole Genome Amplification) program developed by Clarke et al. (2017). Briefly, the current C. cayetanensis reference genome assembly CcayRef3 (GCF_002999335.1) was used as the target genome and the background genomes was Homo sapiens GRCh38.p13 (GCF_000001405.39). Default parameters were used when running the program and the primer set (Table A3.1) was generated based on the set that generated the lowest mean distance of primers binding to the foreground genome. All primers were synthesized by Sigma-Aldrich (Sigma-Aldrich, St. Louis, MO, USA) and were modified to contain two phosphorothioate bonds (represented by *) on the 3’ end to prevent degradation from the Phi29 polymerase exonuclease activity.

Table A3.1. SWGA primer set for amplifying the genome of Cyclospora cayetanensis Primer Name Sequence (5’-3’) swga_cyc_p1 ATTCGT*C*G swga_cyc_p2 CGATAG*C*G swga_cyc_p3 CGGATA*C*G swga_cyc_p4 CCTAG*A*A swga_cyc_p5 CGTCTT*C*G swga_cyc_p6 TTCTA*G*G swga_cyc_p7 CTTCGT*C*G swga_cyc_p8 TCGTCG*A*T swga_cyc_p9 TTTCGT*C*G swga_cyc_p10 TTTACG*C*G * represents phosphorothioate bonds

The selective whole genome amplification procedure followed the previously published protocol (Cowell et al., 2017). To summarize, 1X of Phi29 Reaction Buffer (New England Biolabs, Massachusetts, United States), 4 mM of dNTPs (Invitrogen by Life Technologies, Carlsbad, CA, 81

USA), 300 ng/µl of bovine serum albumin (BSA) heat shock fraction (Sigma-Aldrich, St. Louis, MO, USA), 0.35 µM of each primer (10 total), and 30 U of Phi29 Polymerase (New England Biolabs, Ipswich, MA, USA) were added to a final reaction concentration of 50 µl. The amplification conditions began at 35 °C for 10 min and the temperature was slowly ramped down to 30 °C in 10 min intervals per degree Celsius. The amplification then took place at 30 °C for 16 h and then the polymerase was denatured by setting the reaction to 65 °C for 10 minutes. The reaction was then kept at 4 °C for two days prior to library preparation.

Sequencing and Data Analysis

The resulting whole genome amplification products (Repli-G and SWGA) were purified using the AMPure XP Beads (Beckman Coulter, Brea, CA, USA) following the manufacturer’s protocol, with no deviations. The purified products were subjected to the qPCR assay developed by Murphy et al. (2018) to estimate the increase in copy number of the 18S rDNA gene pre- and post- amplification. DNA concentration was estimated with the QubitTM Fluorometer using the QubitTM dsDNA Broad Range Assay Kit (Invitrogen by Life Technologies, Carlsbad, CA, USA). Both Repli-G and SWGA products were sequenced on the Illumina MiSeq, following the protocol described in Section 3.3.5 of this thesis. The Repli-G product was sequenced on the Oxford Nanopore MinION using the Ligation Sequencing Kit (SQK-LSK109, Oxford Nanopore Technologies, Oxford, UK). The manufacturer’s instructions were followed with no deviations. The library was sequenced on the R9.4.1 flow cell for 48 hours. The initial quality of the demultiplexed Illumina sequencing reads were evaluated using FastQC (version 0.11.9) before quality filtering the reads through BBDuk (version 37.62) from the BBTools suite. Reads were trimmed on both ends to remove bases with Phred scores less than 20 and reads measuring less than 50 bp were removed. The trimmed and filtered reads were mapped to the C. cayetanensis reference genome CcayRef3 (GCF_002999335.1) and mapping statistics were calculated using SAMtools (Li et al., 2009) and graphed using Python v3.6.

82

Results

We found that the Repli-g kit increased the DNA more (350X) than the SWGA method (1415X). Both methods had similar fold-increase for the 18S rDNA gene (Table A3.2).

Table A3.2. Comparison of DNA quantity before and after whole genome amplification Amplification DNA amount (ng) 18S rDNA copies (copy # / µl*) Method Before After Before After Amplification Amplification Amplification Amplification

Repli-g 1 3650 4.3 x 104 2.2 x 106

SWGA 1 1415 1.9 x 106 8.7 x 107 *In a 50 µl reaction

The genome coverage of the sequenced whole genome amplification products on the Illumina MiSeq is uneven (Figure A1). In comparison to the unamplified gDNA (see Chapter 3) (Figure A1a), the Repli-g WGA product (Figure A1b) displayed very low genome coverage, with high sequencing reads depths in select regions of the contig. The SWGA product (Figure A1c) had greater contig coverage compared to the Repli-g kit; however, many regions of the contig did not have sequencing coverage.

83

Figure A1. Read coverage for contig_1 (GCF_002999335.1) of Cyclospora cayetanensis. Canadian C. cayetanensis specimen sequencing using the Nextera DNA Flex Kit on the Illumina MiSeq with (a) unamplified gDNA (see Chapter 3), (b) Repli-g whole genome amplification, and (c) selective whole genome amplification.

Discussion

The two methods attempting whole genome amplification of the C. cayetanensis genome were not successful as both achieved insufficient genome coverage, despite increasing the amount of DNA >1000-fold and C. cayetanensis 18S rDNA gene copy number ~50-fold. One possible reason to explain why the genome did not amplify as expected is that insufficient starting material was available. For the selective whole genome amplification experiments previously accomplished on the Apicomplexan Plasmodium, starting DNA concentrations ranged from a minimum of 5 ng (Oyola et al., 2016; Benavente et al., 2019; Ibrahim et al., 2020) to a maximum of 70 ng (Cowell et al., 2017). Starting DNA concentrations of below 5 ng have not been tested for the selective whole genome amplification procedure, and therefore a starting concentration of 1 ng of purified DNA could have been a contributing factor to the whole genome amplification failure. It is 84

important to note after semi-purifying and enriching C. cayetanensis oocysts from clinical specimens, we would often obtain between 2-3 ng of DNA, far below the minimum input from the previous studies noted above. Another possible explanation to explain these results is insufficient DNA quality or inhibitors present within the DNA samples. The Qiagen Repli-g Mini handbook states that the DNA input should only be 1-10 ng if it is of sufficient quality. The long-read sequencing data from the Repli-g whole genome amplification product showed the majority of reads were chimeras (data not shown). Previous research has concluded that chimeric reads are common with sequencing data from multiple displacement amplification products due to their highly branched structure and replication mechanism (Lasken et al., 2007; Tu et al., 2015). There are bioinformatics tools, such as ChimeraMiner (Lu et al., 2019), that have the capability to identify such chimeras from long-read MDA products exist, so it is still possible to generate whole genome assemblies from long-read sequencing data of MDA products, if sufficient genome coverage is acquired. Due to the limited amounts of C. cayetanensis DNA present in clinical fecal samples, and the lack of a culturing method, a whole genome amplification procedure is usually needed to generate sufficient DNA for long-read sequencing. We found that the selective whole genome amplification products appeared to have greater amplification coverage than the general whole genome amplification procedure performed with the Qiagen Repli-g kit. However, further work is required to complete the C. cayetanensis genome to more accurately determine a primer set that will evenly bind against the genome. This is challenging as the C. cayetanensis genome contains many repeat- rich regions (Liu et al., 2016). Other standard whole genome amplification kits should be assessed to see if C. cayetanensis genome can be amplified evenly through this method. Continuing research in providing a method that will amplify C. cayetanensis’ whole genome evenly will be critical to completing the parasite’s genome.

85

References Benavente, E. D., Gomes, A. R., De Silva, J. R., Grigg, M., Walker, H., Barber, B. E., William, T., Yeo, T. W., de Sessions, P. F., Ramaprasad, A., Ibrahim, A., Charleston, J., Hibberd, M. L., Pain, A., Moon, R. W., Auburn, S., Ling, L. Y., Anstey, N. M., Clark, T. G., & Campino, S. (2019). Whole genome sequencing of amplified DNA from unprocessed blood reveals genetic exchange events between Malaysian Peninsular and Borneo subpopulations. Scientific reports, 9(1), 9873. https://doi.org/10.1038/s41598-019- 46398-z

Clarke, E. L., Sundararaman, S. A., Seifert, S. N., Bushman, F. D., Hahn, B. H., & Brisson, D. (2017). swga: a primer design toolkit for selective whole genome amplification. Bioinformatics (Oxford, England), 33(14), 2071–2077. https://doi.org/10.1093/bioinformatics/btx118

Cowell, A. N., Loy, D. E., Sundararaman, S. A., Valdivia, H., Fisch, K., Lescano, A. G., Baldeviano, G. C., Durand, S., Gerbasi, V., Sutherland, C. J., Nolder, D., Vinetz, J. M., Hahn, B. H., & Winzeler, E. A. (2017). Selective Whole-Genome Amplification Is a Robust Method That Enables Scalable Whole-Genome Sequencing of from Unprocessed Clinical Samples. mBio, 8(1), e02257-16. https://doi.org/10.1128/mBio.02257-16

Ibrahim, A., Diez Benavente, E., Nolder, D. et al. Selective whole genome amplification of DNA from clinical samples reveals insights into population structure. Sci Rep 10, 10832 (2020). https://doi.org/10.1038/s41598-020-67568-4

Lasken, R. S., & Stockwell, T. B. (2007). Mechanism of chimera formation during the Multiple Displacement Amplification reaction. BMC biotechnology, 7, 19. https://doi.org/10.1186/1472-6750-7-19

Liu, S., Wang, L., Zheng, H., Xu, Z., Roellig, D. M., Li, N., Frace, M. A., Tang, K., Arrowood, M. J., Moss, D. M., Zhang, L., Feng, Y., & Xiao, L. (2016). Comparative genomics reveals Cyclospora cayetanensis possesses coccidia-like metabolism and invasion components but unique surface antigens. BMC Genomics, 17(1), 316. https://doi.org/10.1186/s12864-016- 2632-3

Lu, N., Li, J., Bi, C., Guo, J., Tao, Y., Luan, K., Tu, J., & Lu, Z. (2019). ChimeraMiner: An Improved Chimeric Read Detection Pipeline and Its Application in Single Cell Sequencing. International journal of molecular sciences, 20(8), 1953. https://doi.org/10.3390/ijms20081953

Murphy, H. R., Cinar, H. N., Gopinath, G., Noe, K. E., Chatman, L. D., Miranda, N. E., et al. (2018). Interlaboratory validation of an improved method for detection of Cyclospora cayetanensis in produce using a real-time PCR assay. Food Microbiol. 69, 170–178. doi: 10.1016/j.fm.2017.08.008

86

Oyola, S. O., Ariani, C. V., Hamilton, W. L., Kekre, M., Amenga-Etego, L. N., Ghansah, A., Rutledge, G. G., Redmond, S., Manske, M., Jyothi, D., Jacob, C. G., Otto, T. D., Rockett, K., Newbold, C. I., Berriman, M., & Kwiatkowski, D. P. (2016). Whole genome sequencing of Plasmodium falciparum from dried blood spots using selective whole genome amplification. Malaria journal, 15(1), 597. https://doi.org/10.1186/s12936-016-1641-7

Tu, J., Guo, J., Li, J., Gao, S., Yao, B., & Lu, Z. (2015). Systematic Characteristic Exploration of the Chimeras Generated in Multiple Displacement Amplification through Next Generation Sequencing Data Reanalysis. PloS one, 10(10), e0139857. https://doi.org/10.1371/journal.pone.0139857

87

Appendix 4. Script for generating whole genome assemblies for Cyclospora cayetanensis

The following includes the commands used to generate whole genome assemblies for C. cayetanensis (short-read and hybrid). Note the square brackets [ ] imply parameters that need to be changed prior to running the commands.

##Short-Read Assemblies (Illumina MiSeq Data)

##1) Check quality with FastQC fastqc [Path/To/Forward/Fastq/Reads] [Path/To/Reverse/Fastq/Reads] \ -f fastq -o [Output/Path]

##2) Trimming and Quality Filtering with BBTools bbduk.sh -Xmx1g \ in=[Path/To/Forward/Fastq/Reads] in2= [Path/To/Reverse/Fastq/Reads] \ out1=[Path/To/Output/Trimmed/Forward/Reads] \ out2=[Path/To/Output/Trimmed/Reverse/Reads] \ qtrim=rl trimq=20 minlength=50

##3) Commands for Generating Short Read Assemblies ##ABySS abyss-pe name=[UniqueName] k=[k-mer size] \ in='[Path/To/Trimmed/Forward/Fastq/Reads] \ [Path/To/Trimmed/Reverse/Fastq/Reads]'

##IDBA_UD idba_ud -r [Path/To/Fasta/Reads] -o [Path/To/Output/Directory]

##MaSuRCA masurca [configFile] ./assemble.sh

##SPAdes spades.py --careful -1 [Path/To/Trimmed/Forward/Reads] \ -2 [Path/To/Trimmed/Reverse/Reads] -t \ [num_threads] \ -o [Path/To/Output/Folder]

##4) BLAST contigs created for each assembly and remove contigs not Cyclospora makeblastdb –in [Path/To/Cyclospora/Ref/Genome] –dbtype ‘nucl’ \ blastn -outfmt '6 qseqid sseqid qlen slen qstart qend sstart send length \ mismatch pident qcovhsp qcovs sstrand evalue bitscore' \ -db [Path/To/Cyclospora/Ref/Genome] -query [Path/To/Denovo/Assembly/.fa] \ -qcov_hsp_perc 80 -max_hsps 1 -num_alignments 1 -perc_identity 90 \ > [name].txt

88

##5) Map Reads to Cyclospora only contigs from whole genome assembly minimap2 -ax sr [Path/To/Cyclo/Assembly] \ [Path/To/Trimmed/Forward/Fastq/Reads] [Path/To/Trimmed/Reverse/Fastq/Reads] \ | samtools view –S –b - > [name].bam samtools view –b –F 4 [name].bam > [name]_mapped.bam samtools sort –n [name]_mapped.bam –o [name]_mapped_sorted.bam samtools fastq -@ 8 [name]_mapped_sorted.bam -1 [name]_mapped_R1.fastq \ -2 [name]_mapped_R2.fastq

##Use the mapped reads to remake assembly until no changes are observed

##Hybrid Assemblies (Oxford Nanopore MinION AND Illumina MiSeq Data)

##1)Basecalling and Demultiplexing of MinION Reads with Guppy guppy_basecaller -i [Path/To/Fast5/Files] -s [Path/To/Output/Folder] \ --flowcell [FLO-MIN106] --kit [SQK-RPB004] --qscore_filtering \ --min_qscore 7 -x auto guppy_barcoder -i [Path/To/GuppyBasecaller/Pass] –s \ [Path/To/Output/Demultiplexing/Folder] \ --barcode_kits SQK-RPB004 --trim_barcodes -x auto

##2) Commands to Perform Hybrid Assemblies #HybridSPAdes spades.py --careful -1 [Path/To/Trimmed/Forward/Reads] \ -2 [Path/To/Trimmed/Reverse/Reads] –nanopore [Path/To/Nanopore/Reads] \ -t [num_threads] -o [Path/To/Output/Folder]

#Pilon canu –p [name] genomeSize=44.4m –nanopore-raw [Path/To/Nanopore/Reads] minimap2 –ax sr Path/To/Genome/Assembly] [Path/To/MiSeq/Forward/Reads] \ [Path/To/MiSeq/Reverse/Reads] | samtools sort -o [name]_sorted.bam samtools index [name]_sorted.bam pilon -Xmx200g --genome [Path/To/Genome/Assembly] --frags [name]_sorted.bam \ --output [name]

#Unicycler unicycler -1 [Path/To/MiSeq/Forward/Reads] -2 [Path/To/MiSeq/Reverse/Reads] \ -l [Path/To/Nanopore/Reads] –o [Path/To/Output/Dir]

##3) Map Reads to Cyclospora only contigs from whole genome assembly minimap2 -ax sr [Path/To/Cyclo/Assembly] \ [Path/To/Trimmed/Forward/Fastq/Reads] [Path/To/Trimmed/Reverse/Fastq/Reads]|\ samtools view –S –b - > [name].bam samtools view –b –F 4 [name].bam > [name]_mapped.bam samtools sort –n [name]_mapped.bam –o [name]_mapped_sorted.bam samtools fastq -@ 8 [name]_mapped_sorted.bam -1 [name]_mapped_R1.fastq \ -2 [name]_mapped_R2.fastq minimap2 –ax map-ont [Path/To/Cyclo/Assembly] [Path/To/Nanopore/Reads] | \ samtools view –S –b - > [name].bam 89

samtools view –b –F 4 [name].bam > [name]_mapped.bam samtools sort –n [name]_mapped.bam –o [name]_mapped_sorted.bam samtools fastq -@ 8 [name]_mapped_sorted.bam -1 [name]_mapped_R1.fastq \ -2 [name]_mapped_R2.fastq ##Use the mapped reads to remake assembly until no changes are observed

90