bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1 Title: Strains used in whole organism Plasmodium falciparum vaccine trials differ in 2 genome structure, sequence, and immunogenic potential 3 4 Authors: Kara A. Moser1‡, Elliott F. Drábek1, Ankit Dwivedi1, Jonathan Crabtree1, Emily 5 M. Stucke2, Antoine Dara2, Zalak Shah2, Matthew Adams2, Tao Li3, Priscila T. 6 Rodrigues4, Sergey Koren5, Adam M. Phillippy5, Amed Ouattara2, Kirsten E. Lyke2, Lisa 7 Sadzewicz1, Luke J. Tallon1, Michele D. Spring6, Krisada Jongsakul6, Chanthap Lon6, 8 David L. Saunders6¶, Marcelo U. Ferreira4, Myaing M. Nyunt2§, Miriam K. Laufer2, Mark 9 A. Travassos2, Robert W. Sauerwein7, Shannon Takala-Harrison2, Claire M. Fraser1, B. 10 Kim Lee Sim3, Stephen L. Hoffman3, Christopher V. Plowe2§, Joana C. Silva1,8 11 12 Corresponding author: Joana C. Silva, [email protected]
13 Affiliations: 14 1 Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, 15 MD, 21201 16 2 Center for Vaccine Development and Global Health, University of Maryland School of 17 Medicine, Baltimore, MD, 21201 18 3 Sanaria, Inc. Rockville, Maryland, 20850 19 4 Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo, 20 São Paulo, Brazil 21 5 Genome Informatics Section, Computational and Statistical Genomics Branch, 22 National Human Genome Research Institute, Bethesda, Maryland, USA 20892 23 6 Armed Forces Research Institute of Medical Sciences, Department of Bacterial and 24 Parasitic Diseases, Bangkok, Thailand 25 7 Department of Medical Microbiology, Radboud University Medical Center, Nijmegen, 26 Netherlands 27 8 Department of Microbiology and Immunology, University of Maryland School of 28 Medicine, Baltimore, MD, 21201 29 30 ‡Current address: Institute for Global Health and Infectious Diseases, University of 31 North Carolina Chapel Hill 32 ¶Current address: Warfighter Expeditionary Medicine and Treatment, US Army Medical 33 Material Development Activity 34 §Current address: Duke Global Health Institute, Duke University, Durham, North 35 Carolina, 27708 36 37 Author emails (in order of author list above): 38 [email protected], [email protected], [email protected], 39 [email protected], [email protected], [email protected], 40 [email protected], [email protected], [email protected], 41 [email protected] , [email protected], [email protected], 42 [email protected], [email protected], 43 [email protected], [email protected], 44 [email protected], [email protected], [email protected], 45 [email protected], [email protected], [email protected],
1
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
46 [email protected] , [email protected], 47 [email protected], [email protected], 48 [email protected], [email protected], [email protected], 49 [email protected], [email protected] 50
51
52
53 Abstract
54 Background: Plasmodium falciparum (Pf) whole-organism sporozoite vaccines have
55 provided excellent protection against controlled human malaria infection (CHMI) and
56 naturally transmitted heterogeneous Pf in the field. Initial CHMI studies showed
57 significantly higher durable protection against homologous than heterologous strains,
58 suggesting the presence of strain-specific vaccine-induced protection. However,
59 interpretation of these results and understanding of their relevance to vaccine efficacy
60 (VE) have been hampered by the lack of knowledge on genetic differences between
61 vaccine and CHMI strains, and how these strains are related to parasites in malaria
62 endemic regions.
63 Methods: Whole genome sequencing using long-read (Pacific Biosciences) and short-
64 read (Illumina) sequencing platforms was conducted to generate de novo genome
65 assemblies for the vaccine strain, NF54, and for strains used in heterologous CHMI (7G8
66 from Brazil, NF166.C8 from Guinea, and NF135.C10 from Cambodia). The assemblies
67 were used to characterize sequence polymorphisms and structural variants in each strain
68 relative to the reference Pf 3D7 (a clone of NF54) genome. Strains were compared to
69 each other and to clinical isolates from South America, Sub-Saharan Africa, and
70 Southeast Asia.
2
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
71 Results: While few variants were detected between 3D7 and NF54, we identified tens of
72 thousands of variants between NF54 and the three heterologous strains both genome-
73 wide and within regulatory and immunologically important regions, including in pre-
74 erythrocytic antigens that may be key for sporozoite vaccine-induced protection.
75 Additionally, these variants directly contribute to diversity in immunologically important
76 regions of the genomes as detected through in silico CD8+ T cell epitope predictions. Of
77 all heterologous strains, NF135.C10 consistently had the highest number of unique
78 predicted epitope sequences when compared to NF54, while NF166.C8 had the lowest.
79 Comparison to global clinical isolates revealed that these four strains are representative
80 of their geographic region of origin despite long-term culture adaptation; of note,
81 NF135.C10 is from an admixed population, and not part of recently-formed drug resistant
82 subpopulations present in the Greater Mekong Sub-region.
83 Conclusions: These results are assisting the interpretation of VE of whole-organism
84 vaccines against homologous and heterologous CHMI, and may be useful in informing
85 the choice of strains for inclusion in region-specific or multi-strain vaccines.
86
87 Keywords: P. falciparum, malaria, genome assembly, PfSPZ Vaccine, whole-sporozoite
88 vaccine
89
90
3
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
91 Main Text
92 Introduction
93 The flattening levels of mortality and morbidity due to malaria in recent years (1),
94 which follow a decade in which malaria mortality was cut in half (2), highlight the pressing
95 need for new tools to control the disease. A highly efficacious vaccine against
96 Plasmodium falciparum, the deadliest malaria parasite, would be a critical development
97 for control and elimination efforts. Several variations of a highly promising pre-
98 erythrocytic, whole organism malaria vaccine based on P. falciparum sporozoites (PfSPZ)
99 are under development, all based on the same P. falciparum strain, NF54 (3), thought to
100 be of West African origin, and which use different mechanisms for attenuation of PfSPZ.
101 Of these vaccine candidates, Sanaria® PfSPZ Vaccine, based on radiation-attenuated
102 sporozoites, has progressed furthest in clinical trial testing (4–10). Other whole-organism
103 vaccine candidates, including chemo-attenuated (Sanaria® PfSPZ-CVac), transgenic
104 and genetically attenuated sporozoites, are in earlier stages of development (11–13).
105 PfSPZ Vaccine showed 100% short-term protection against homologous
106 controlled human malaria infection (CHMI) in an initial phase 1 clinical trial (6), and
107 subsequent trials have confirmed that high levels of protection can be achieved against
108 both short-term (8) and long-term (7) homologous CHMI. However, depending on the
109 immunization regimen, sterile protection can be significantly lower (8-83%) against
110 heterologous CHMI using the 7G8 Brazilian clone (8,9), and against infection in malaria-
111 endemic regions with intense seasonal malaria transmission (29% and 52% by
112 proportional and time to event analysis, respectively) (10). Heterologous CHMI in
113 chemoprophylaxis with sporozoite (CPS) studies, in which immunization is by infected
4
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
114 mosquito bite of individuals undergoing malaria chemoprophylaxis, have been conducted
115 with NF135.C10 from Cambodia (14) and NF166.C8 from Guinea (15), and have had
116 lower efficacy than against homologous CHMI (16,17). One explanation for the lower
117 efficacy seen against heterologous P. falciparum strains is the extensive genetic diversity
118 in this parasite species, which is particularly high in genes encoding antigens (18), which
119 combined with low vaccine efficacy against non-vaccine alleles (19–21) reduces overall
120 protective efficacy and complicates the design of broadly efficacious vaccines (22,23).
121 The lack of a detailed genomic characterization of the P. falciparum strains used in CHMI
122 studies, and the unknown genetic basis of the parasite targets of PfSPZ Vaccine- and
123 PfSPZ CVac-induced protection, have precluded a conclusive statement regarding the
124 cause(s) of variable vaccine efficacy outcomes.
125 The current PfSPZ Vaccine strain, NF54, was isolated from a patient in the Netherlands
126 who had never left the country and is considered a case of “airport malaria”; the exact origin of
127 NF54 is unknown (3), but thought to be from Africa (24,25). NF54 is also the isolate from which
128 the P. falciparum 3D7 reference strain was cloned (26), and hence, despite having been
129 separated in culture for over 30 years, NF54 and 3D7 are assumed to be genetically identical,
130 and 3D7 is often used in homologous CHMI (6,8). Several issues hinder the interpretation of both
131 homologous and heterologous CHMI experiments conducted to date. It remains to be confirmed
132 that 3D7 has remained genetically identical to NF54 genome-wide, or that the two are at least
133 identical immunogenically. Indeed, NF54 and 3D7 have several reported phenotypic differences
134 when grown in culture, including the variable ability to produce gametocytes (27). In addition,
135 7G8, NF166.C8, and NF135.C10 have not been rigorously compared to each other or to NF54 to
136 confirm that they are adequate heterologous strains, even though they do appear to have distinct
137 infectivity phenotypes when used as CHMI strains (15). While the entire sporozoite likely offers
138 multiple immunological targets, no high-confidence correlates of protection currently exist. In part
5
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
139 because of the difficulty of studying hepatic parasite forms and their gene expression profiles in
140 humans, it remains unclear which parasite proteins are recognized by the human immune system
141 during that stage, and elicit protection, upon immunization with PfSPZ vaccines. Both humoral
142 and cell-mediated responses have been associated with protection against homologous CHMI
143 (6,7), although studies in rodents and non-human primates point to a requirement for cell-
144 mediated immunity (specifically through tissue-resident CD8+ T cells) in long-term protection
145 (5,9,28,29). In silico identification of CD8+ epitopes in all strains could highlight critical differences
146 of immunological significance between strains. Finally, heterologous CHMI results cannot be a
147 reliable indicator of efficacy against infection in field settings unless the CHMI strains used are
148 characteristic of the geographic region from which they originate. These issues could impact the
149 use of homologous and heterologous CHMI, and the choice of strains for these studies, to predict
150 the efficacy of PfSPZ-based vaccines in the field (30).
151 These knowledge gaps can be addressed through a rigorous description and
152 comparison of the genome sequence of these strains. High-quality de novo assemblies
153 allow characterization of genome composition and structure, as well as the identification
154 of genetic differences between strains. However, the high AT content and repetitive
155 nature of the P. falciparum genome greatly complicates genome assembly methods (31).
156 Recently, long-read sequencing technologies have been used to overcome some of these
157 assembly challenges, as was shown with assemblies for 3D7, 7G8, and several other
158 culture-adapted P. falciparum strains generated using Pacific Biosciences (PacBio)
159 technology (32–34). However, NF166.C8 and NF135.C10 still lack whole-genome
160 assemblies; in addition, while an assembly for 7G8 is available (33), it is important to
161 characterize the specific 7G8 clone used in heterozygous CHMI, from Sanaria’s working
162 bank, as strains can undergo genetic changes over time in culture (35). Here, reference
6
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
163 assemblies for NF54, 7G8, NF166.C8, and NF135.C10 (hereafter referred to as PfSPZ
164 strains) were generated using approaches to take advantage of the resolution power of
165 long-read sequencing data and the low error rate of short-read sequencing platforms.
166 These de novo assemblies allowed for the thorough genetic and genomic characterization
167 of the PfSPZ strains and will aid in the interpretation of results from CHMI studies.
168 Materials and Methods
169 Study design and samples
170 This study characterized and compared the genomes of four P. falciparum strains
171 used in whole organism malaria vaccines and controlled human malaria infections using
172 a combination of long- and short-read whole genome sequencing platforms (see below).
173 In addition, these strains were compared to P. falciparum clinical isolates collected from
174 patients in malaria-endemic regions globally, using short-read whole genome sequencing
175 data. Genetic material for the four PfSPZ strains were provided by Sanaria, Inc. Clinical
176 P. falciparum isolates from Brazil, Mali, Malawi, Myanmar, Thailand, Laos, and Cambodia
177 were collected between 2009 and 2016 from cross-sectional surveys of malaria burden,
178 longitudinal studies of malaria incidence, and drug efficacy studies done in collaboration
179 with the Division of Malaria Research at the University of Maryland Baltimore, or were
180 otherwise provided by collaborators (Supplemental Dataset 1). All samples met the
181 inclusion criteria of the initial study protocol with prior approval from the local ethical
182 review board. Parasite genomic sequencing and analyses were undertaken after prior
183 approval of the University of Maryland School of Medicine Institutional Review Board.
184 These isolates were obtained by venous blood draws; almost all samples were processed
185 using leukocyte depletion methods to improve the parasite-to-human ratio before
7
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
186 sequencing. The exceptions were samples from Brazil and Malawi, which were not
187 leukocyte depleted upon collection. These samples underwent a selective whole genome
188 amplification step before sequencing, modified from (36) (the main modification being a
189 DNA dilution and filtration step using vacuum filtration prior to selective whole genome
190 amplification). Cambodia isolates had been sequenced for a previous study (37). In
191 addition to these samples, samples for which whole genome short-read sequencing has
192 previously been generated were obtained from NCBI’s Short Read Archive (SRA) to
193 supplement malaria-endemic regions not represented in our data set (Peru, Columbia,
194 French Guiana, Guinea, Papua New Guinea) and regions where PfSPZ trials are ongoing
195 (Burkina Faso, Kenya, and Tanzania) (Supplemental Dataset 1).
196 Whole genome sequencing
197 Genetic material for whole genome sequencing of the PfSPZ strains was
198 generated from a cryovial of each strain’s cell bank with the following identifiers: NF54
199 Working Cell Bank (WCB): SAN02-073009; 7G8 WCB: SAN02-
200 021214; NF135.C10 WCB: SAN07-010410; NF166.C8 Mother Cell Bank: SAN30-
201 020613. Each cryovial was thawed and maintained in human O+ red blood cells
202 (RBCs) at 2% hematocrit (Hct) in complete growth medium (RPMI 1649 with L-glutamine
203 and 25 mM HEPES supplemented with 10% human O+ serum and hypoxanthine) in a 6-
204 well plate in 5% O2, 5% CO2 and 90% N2 at
205 37°C. The cultures were then further expanded by adding fresh RBCs every 3-4
206 days and increased culture hematocrit (Hct) to 5% Hct using a standard method (38).
207 The complete growth medium was replaced daily. When the PfSPZ strain culture volume
208 reached 300-400 mL and a parasitemia of more than 1.5%, the culture suspensions
8
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
209 were collected and the parasitized RBCs were pelleted down by centrifugation at 1800
210 rpm for 5 min. Aliquots of 0.5 mL per cryovial of the parasitized RBCs were stored at -
211 80°C prior to extraction of genomic DNA. Genomic DNA was extracted using the Qiagan
212 Blood DNA Midi Kit (Valencia, CA, USA). Pacific Biosciences (PacBio) sequencing was
213 done for each PfSPZ strain. Total DNA was prepared for PacBio sequencing using the
214 DNA Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA). DNA was fragmented
215 with the Covaris E210, and the fragments were size selected to include those >15 Kbp in
216 length. Libraries were prepared per the manufacturer’s protocol. Four SMRT cells were
217 sequenced per library, using P6C4 chemistry and a 120-minute movie on the PacBio RS
218 II (Pacific Biosystems, Menlo Park, CA).
219 Short-read sequencing was done for each PfSPZ strain and for our collection of
220 clinical isolates using the Illumina HiSeq 2500 or 4000 platforms. Prepared genomic DNA,
221 extracted from cultured parasites, leukocyte-depleted samples, or from samples that
222 underwent sWGA (see above), was used to construct DNA libraries for sequencing on
223 the Illumina platform using the KAPA Library Preparation Kit (Kapa Biosystems, Woburn,
224 MA). DNA was fragmented with the Covaris E210 or E220 to ~200 bp. Libraries were
225 prepared using a modified version of manufacturer’s protocol. The DNA was purified
226 between enzymatic reactions and the size selection of the library was performed with
227 AMPure XT beads (Beckman Coulter Genomics, Danvers, MA). When necessary, a PCR
228 amplification step was performed with primers containing an index sequence of six
229 nucleotides in length. Libraries were assessed for concentration and fragment size using
230 the DNA High Sensitivity Assay on the LabChip GX (Perkin Elmer, Waltham, MA). Library
231 concentrations were also assessed by qPCR using the KAPA Library Quantification Kit
9
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
232 (Complete, Universal) (Kapa Biosystems, Woburn, MA). The libraries were pooled and
233 sequenced on a 100 bp paired-end Illumina HiSeq 2500 or 4000 run (Illumina, San Diego,
234 CA).
235 Assembly generation and characterization of PfSPZ strains
236 Canu (v1.3) (39) was used to correct and assemble the PacBio reads
237 (corMaxEvidenceErate=0.15 for AT-rich genomes, and using default parameters
238 otherwise). To optimize downstream assembly correction processes and parameters, the
239 percentage of total differences (both in bp and by proportion of the 3D7 genome not
240 captured by the NF54 assembly) between the NF54 assembly and the 3D7 reference
241 (PlasmoDBv24) was calculated after each round of correction. Quiver (smrtanalysis v2.3)
242 (40) was run iteratively with default parameters to reach a (stable) maximum reduction in
243 percent differences between the two genomes and the assemblies were further corrected
244 with Illumina data using Pilon (v1.13) (41) with the following parameters: --fixbases, --
245 mindepth 5, --K 85, --minmq 0, and --minqual 35.
246 Assemblies were compared to the 3D7 reference (PlasmoDBv24) using
247 MUMmer’s nucmer (42), and the show-snps function was used to generate a list of SNPs
248 and small (< 50 bp) indels between assemblies. Coding and non-coding variants were
249 classified by comparing the show-snps output with the 3D7 gff3 file using custom scripts.
250 Structural variants, defined as indels, deletions, and tandem or repeat expansion and
251 contractions each greater than 50 bp in length were identified using the nucmer-based
252 Assemblytics tool (43) (unique anchor length: 1 Kbp). Translocations were identified by
253 eye through inspection of mummerplots. Translocations and other large structural
254 variants were confirmed using read-mapping data of the PacBio reads against either the
10
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
255 3D7 reference genome or the respective PfSPZ strain assembly using the NGLMR aligner
256 (44) and visualized using Ribbon (45). Reconstructed exon 1 sequences for var genes,
257 encoding P. falciparum erythrocyte membrane protein 1 (PfEMP1) antigens, for each
258 PfSPZ strain were recovered using the ETHA package (46). As a check for var exon 1
259 sequences that were missed during the generation of the strain’s assembly, a targeted
260 read capture and assembly approach was done using a strain’s Illumina data, wherein
261 var-like reads for each PfSPZ strains were identified by mapping reads against a
262 database of known var exon 1 sequences (47) using bowtie2 (48). Reads that mapped to
263 a known exon 1 sequence were then assembled with Spades (v3.9.0) (49), and the
264 assembled products were blasted against the PacBio reads to identify if they were truly
265 missed exon 1 sequences by the de novo assembly process, or if instead they were
266 chimeras reconstructed by the targeted assembly process. To identify homologs of the
267 61 var exon 1 sequences found in the 3D7 genome, each PfSPZ strain’s set of exon 1
268 amino acid sequences were compared to the 3D7 set using blastp.
269 In silico MHC I epitope predictions
270 Given the reported importance of CD8+ T cell responses towards immunity to
271 whole sporozoites, MHC class 1 epitopes of length 9 amino acids (9mers) were predicted
272 with NetMHCpan (v3.0) (50) for each PfSPZ strain, using inferred amino acid sequences
273 in each genome. HLA types common to African countries where PfSPZ or PfSPZ-CVac
274 trials are ongoing were used for epitope predictions based on frequencies in the Allele
275 Frequency Net Database (51) or from the literature (52,53) (Table S1). Shared epitopes
276 between NF54 and the three heterologous PfSPZ strains were calculated by first
277 identifying epitopes in each gene, and then removing duplicate epitope sequence entries
11
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
278 (caused by recognition by multiple HLA types). Identical epitope sequences that were
279 identified in two or more genes were treated as distinct epitope entries, and all unique
280 epitope-gene combinations were included when calculating the number of shared
281 epitopes between strains.
282 Epitopes were predicted in several gene sets. The first was all protein-coding
283 genes in which no premature stop codons were detected across all four PfSPZ strains
284 (n=4,814). The second was a subset of the above that are expressed as proteins during
285 the sporozoite stage of the parasite in the mosquito salivary glands, as supported by
286 proteomic data (n=1,974) (54,55). Lastly, a list of pre-erythrocytic genes of interest was
287 generated, either from a review of the literature, or genes whose products were
288 recognized by sera from protected vaccinees of whole organism malaria vaccine trials
289 (both PfSPZ and PfSPZ-CVac) (n=42) (11,56). These gene sequences were inspected
290 manually in each assembly, and corrected for any remaining PacBio sequencing error,
291 particularly one bp indels in homopolymer runs. Even though the 42 genes listed above
292 were detected through antibody responses, many have also been shown to have T cell
293 epitopes, such as CSP and LSA-1.
294 Read mapping and SNP calling
295 For the full collection of clinical isolates that had whole genome short-read
296 sequencing data (generated either at IGS or downloaded from SRA), reads were aligned
297 to the 3D7 reference genome (PlasmoDBv24) using bowtie2 (v2.2.4) (48). Samples with
298 less than 10 million reads mapping to the reference were excluded, as samples with less
299 than this amount had reduced coverage across the genome. Bam files were processed
300 according to GATK’s Best Practices documentation (57–59). Joint SNP calling was done
12
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
301 using Haplotype Caller (v4.0). Because clinical samples may be polyclonal (that is, more
302 than one parasite strain may be present), diploid calls were initially allowed, followed by
303 calling the major allele at positions with heterozygous calls. If the major allele was
304 supported by >70% of reads at a heterozygous position, the major allele was assigned
305 as the allele at that position (otherwise, the genotype was coded as missing). Additional
306 hard filtering was done to remove potential false positives based on the following filter:
307 DP < 12 || QUAL < 50 || FS > 14.5 || MQ < 20, and variants were further filtered to remove
308 those for which the non-reference allele was not present in at least three samples
309 (frequency less than ~0.5%), and those with more than 10% missing genotype values
310 across all samples.
311 Principal Coordinates Analyses and Admixture Analyses
312 A matrix of pairwise genetic distances was constructed from biallelic non-
313 synonymous SNPs identified from the above pipeline (n=31,761) across all samples
314 (n=654) using a custom Python script, and principal coordinates analyses (PCoAs) were
315 done to explore population structure using cmdscale in R. Additional population structure
316 analyses were done using Admixture (v1.3) (60) on two separate data sets: South
317 America and Africa clinical isolates plus NF54, NF166.C8, and 7G8 (n=461), and
318 Southeast Asia and Oceania plus NF135.C10 (n=193). The data sets were additionally
319 pruned for sites in linkage disequilibrium (window size of 20 Kbp, window step of 2 Kbp,
320 R2 0.1). The final South America/Africa and Southeast Asia/Oceania data set used for
321 the admixture analysis consisted of 16,802 and 5,856 SNPs, respectively. The number of
322 populations, K, was tested for values between K=1 to K=15 and run with 10 replicates for
323 each K. For each population, the cross-validation (CV) error from the replicate with the
13
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
324 highest log-likelihood value was plotted, and the K with the lowest CV value was chosen
325 as the final K.
326 To compare subpopulations identified in our Southeast Asia/Oceania admixture
327 analysis with previously described ancestral, resistant, and admixed subpopulations from
328 Cambodia (61), the above non-synonymous SNP set was used before pruning for LD
329 (n=11,943), and was compared to a non-synonymous SNP dataset (n=21,257) from 167
330 samples used by Dwivedi et al. (62) to describe eight Cambodian subpopulations, in an
331 analysis that included a subset of samples used by Miotto et al. (61) (who first
332 characterized the population structure in Cambodia). There were 5,881 shared non-
333 synonymous SNPs between the two datasets, 1,649 of which were observed in
334 NF135.C10. A pairwise genetic distance matrix (estimated as the proportion of base-pair
335 differences between pairs of samples, not including missing genotypes) was generated
336 from the 5,881 shared SNP set, and a dendrogram was built using Ward minimum
337 variance methods in R (Ward.D2 option of the hclust function).
338 Results
339 Generation of assemblies and validation of the assembly protocol
340 To characterize genome-wide structural and genetic diversity of the PfSPZ strains,
341 genome assemblies were generated de novo using whole genome long-read (PacBio)
342 and short-read (Illumina) sequence data (Methods; Table S2; Table S3). Taking
343 advantage of the parent isolate-clone relationship between NF54 and 3D7, we used NF54
344 as a test case to derive the assembly protocol, by adopting, at each step, approaches
345 that minimized the difference to 3D7. The raw assembly of the NF54 genome, while
346 containing 99.9% of the 3D7 genome in 30 contigs, showed ~78K sequence variants
14
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
347 relative to 3D7 (Figure S1). Many of these differences were small (1-3) base pair (bp)
348 indels (insertions or deletions), particularly in AT-rich regions and homopolymer runs,
349 strongly indicative of remaining sequencing errors (63). To maximize error removal with
350 existing tools, Quiver was run iteratively on the NF54 assembly; two consecutive Quiver
351 runs consistently and substantially reduced the number of differences between NF54 and
352 3D7 (Figure S1). Illumina reads from NF54 were used to further polish the assembly
353 using Pilon. The resulting NF54 assembly is 23.4 Mb in length (Figure 1) and, when
354 compared to the 3D7 reference, contains 8,396 indels and 1,386 single nucleotide
355 polymorphisms (SNPs), a rate of ~59 SNPs per million base pairs, the vast majority of
356 which occurred in noncoding regions (Table 1). In fact, only 17 non-synonymous
357 mutations were detected in 15 intact protein-coding single-copy genes between the two
358 genomes (Supplemental Dataset 2). We applied the same assembly protocol to the
359 remaining PfSPZ strains. The resulting assemblies for NF166.C8, 7G8, and NF135.C10
360 are structurally very complete, with the 14 nuclear chromosomes represented by 30, 20,
361 and 19 nuclear contigs, respectively, with each chromosome in the 3D7 reference
362 represented by one to three contigs (Figure 1). Several shorter contigs in NF54 (67,501
363 bps total), NF166.C8 (224,502 bps total), and NF135.C10 (80,944 bps total) could not be
364 unambiguously assigned to a homologous segment in the 3D7 reference genome; gene
365 annotation showed that these contigs mostly contain members of multi-gene families, and
366 therefore are likely part of sub-telomeric regions. The cumulative lengths of the four
367 assemblies ranged from 22.8 Mbp to 23.5 Mbp (Table 1), indicating variation in genome
368 size among P. falciparum strains. In particular, the 7G8 assembly was several hundred
369 thousand base-pairs smaller than the other three assemblies. To confirm that this was
15
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
370 not an assembly error, we compared 7G8 to a previously published 7G8 PacBio-based
371 assembly (PlasmoDBv41) (33). The two assemblies were extremely close in overall
372 genome structure, differing only by 2 Kbp in cumulative length, and also shared a very
373 similar number of SNP and small indel variants relative to 3D7 (Table S4).
374 Structural variations in the genomes of the PfSPZ strains
375 Many structural variants (defined as indels or tandem repeat contractions or
376 expansions, greater than 50 bp) were identified in each assembly by comparison to the
377 3D7 genome, impacting a cumulative length of 199.0 Kbp in NF166.C8 to 340.9 Kbp in
378 NF135.C10 (Table S5). Many smaller variants fell into coding regions (including known
379 pre-erythrocytic antigens), often representing variation in repeat units (Supplemental
380 Dataset 3). Several larger structural variants (>10 Kbp) exist in 7G8, NF166.C8, and
381 NF135.C10 relative to 3D7. Many of these regions contain members of multi-gene
382 families, and as expected the number of these vary between each assembly (Figure S2).
383 While all 61 3D7 var exon 1 sequences were present as almost identical sequences in
384 the NF54 assembly (with the exception of small indels in several exon 1 sequences), the
385 number of 3D7 exon 1 var sequences with identifiable orthologs in the three heterologous
386 PfSPZ strains was much lower. Only a few of the identified orthologs had amino acid
387 sequence identity higher than 75% (Figure S2), reflecting the exceptional sequence
388 diversity that occurs in this gene family in and among P. falciparum strains. A recently
389 characterized PfEMP1 protein expressed on the surface of NF54 sporozoites
390 (NF54varsporo) was shown to be involved in hepatocyte invasion (Pf3D7_0809100), and
391 antibodies to this PfEMP1 blocked hepatocyte invasion (64). No ortholog to NF54varsporo
392 was identified in the var repertoire of 7G8, NF166.C8, or NF135.C10 (Figure S2). It is
16
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
393 possible that a different, strain-specific, var gene fulfills a similar role in each of the
394 heterologous PfSPZ strains.
395 Several other large structural variants impact regions housing non-multi-gene
396 family members, although none known to be involved in pre-erythrocytic immunity.
397 Examples include a 31 Kbp-long tandem expansion of a region of chromosome 12 in the
398 7G8 assembly (also present in the previously published assembly for 7G8 (33)) and a
399 22.7 Kbp-long repeat expansion of a region of chromosome 5 in NF135.C10, both of
400 which are supported by ~200 PacBio reads. The former is a segmental duplication
401 containing a vacuolar iron transporter (PF3D7_1223700), a putative citrate/oxoglutarate
402 carrier protein (PF3D7_1223800), a putative 50S ribosomal protein L24
403 (PF3D7_1223900), GTP cyclohydrolase I (PF3D7_1224000), and three conserved
404 Plasmodium proteins of unknown function (PF3D7_1223500, PF3D7_1223600,
405 PF3D7_1224100). The expanded region in NF135.C10 represents a tandem expansion
406 of a segment housing the gene encoding the Pf multidrug resistance protein, PfMDR1
407 (PF3D7_0523000), resulting in a total of four copies of this gene in NF135.C10. Other
408 genes in this tandem expansion include those encoding an iron-sulfur assembly protein
409 (PF3D7_0522700), a putative pre-mRNA-splicing factor DUB31 (PF3D7_0522800), a
410 putative zinc finger protein (PF3D7_0522900) and a putative mitochondrial-processing
411 peptidase subunit alpha protein (PF3D7_0523100). In addition, the NF135.C10 assembly
412 contained a large translocation involving chromosomes 7 (3D7 coordinates ~520,000 to
413 ~960,000) and 8 (start to coordinate ~440,000) (Figure S3); the boundaries of the
414 translocated regions contain members of multi-gene families. Independent assemblies
415 and mapping patterns observed by aligning the PacBio reads against the NF135.C10 and
17
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
416 3D7 assemblies support the translocation (Figure S4). Furthermore, the regions of
417 chromosomes 7 and 8 where the translocation occurs are documented recombination
418 hotspots that were identified specifically in isolates from Cambodia, the site of origin of
419 NF135.C10 (65).
420 Several structural differences in genic regions were also identified between the
421 NF54 assembly and the 3D7 genome (Supplemental Dataset 3); if real, these structural
422 variants would have important implications in the interpretation of trials using 3D7 as a
423 homologous CHMI strain. For example, a 1,887 bp tandem expansion was identified in
424 the NF54 assembly on chromosome 10, which overlapped the region containing liver
425 stage antigen 1 (PfLSA-1, PF3D7_1036400). The structure of this gene in the NF54 strain
426 was reported when PfLSA-1 was first characterized, with unique N- and C-terminal
427 regions flanking a repetitive region consisting of several dozen repeats of a 17 amino acid
428 motif (66,67). The CDS of PfLSA-1 in the NF54 assembly was 5,406 bp in length but only
429 3,489 bp-long in the 3D7 reference. To determine if this was an assembly error in the
430 NF54 assembly, the PfLSA-1 locus from a recently published PacBio-based assembly of
431 3D7 (32) was compared to that of NF54. The two sequences were identical, likely
432 indicative of incorrect collapsing of the repeat region of PfLSA-1 in the 3D7 reference;
433 NF54 and 3D7 PacBio-based assemblies had 79 units of the 17-mer amino acid repeat,
434 compared to only 43 in the 3D7 reference sequence, a result further validated by the
435 inconsistent depth of mapped Illumina reads from NF54 between the PfLSA repeat region
436 and its flanking unique regions in the 3D7 reference (Figure S5). Using the same
437 comparative rationale as above, where structural differences relative to the 3D7
438 reference, but common to both PabBio-based 3D7 and NF54 assemblies, are considered
18
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
439 true, all structural variations identified in NF54 likely represent existing structural errors in
440 the 3D7 reference genome, including 15 variants that affect genic regions (Figure S6).
441 Small sequence variants between PfSPZ strains and the reference 3D7 genome
442 Very few small sequence variants were identified in NF54 compared to the 3D7
443 reference; 17 non-synonymous mutations were present in 15 single-copy non-
444 pseudogene-encoding loci (Supplemental Dataset 2). Short indels were detected in 185
445 genes; many of these indels have length that is not multiple of three and occur in
446 homopolymer runs, possibly representing remaining PacBio sequencing error. However,
447 some may be real, as a small indel causing a frameshift in PF3D7_1417400, a putative
448 protein-coding pseudogene that has previously been shown to accumulate premature
449 stop codons in laboratory-adapted strains (68), and some may be of biologic importance,
450 such as those seen in two histone-related proteins (PF3D7_0823300 and
451 PF3D7_1020700). It has been reported that some clones of 3D7, unlike NF54, are unable
452 to consistently produce gametocytes in long-term culture (27,69); no polymorphisms were
453 observed within or directly upstream of PfAP2-G (PF3D7_1222600) (Table S6), which
454 has been identified as a transcriptional regulator of sexual commitment in P. falciparum
455 (70). However, a non-synonymous mutation from arginine to proline (R1286P) was
456 observed in a AP2-coincident C-terminal domain of PfAP2-L (PF3D7_0730300), a gene
457 associated with liver stage development (71); a proline was also present at the
458 homologous position in 7G8, NF166.C8, and NF135.C10. Two additional putative AP2
459 transcription factor genes (PF3D7_0613800 and PF3D7_1317200) contained a three bp
460 deletion in NF54 relative to 3D7. Although knock-downs of PF3D7_1317200 have
461 resulted in a loss of gametocyte production (72), neither indel was located within predicted
19
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
462 AP2 domains, and so their biological relevance remains to be determined. 7G8, NF66.C8,
463 and NF135.C10 also have numerous non-synonymous mutations and indels within
464 putative AP2 genes, a small number of which fell in predicted AP2 domains. Finally,
465 NF135.C10 has an insertion almost 200 base-pairs in length relative to 3D7 in the 3’ end
466 of PfAP2-G; the insertion also carries a premature stop codon, leading to a considerably
467 different C-terminal end for the transcription factor (Figure S7). This allele is also present
468 in previously published assemblies for clones from Southeast Asia (33), including the
469 culture-adapted strain DD2, and variations of this indel (without the in-frame stop codon)
470 are also found in several non-human malaria Plasmodium species) (Figure S7),
471 suggesting an interesting evolutionary trajectory of this sequence.
472 Given that no absolute correlates of protection are known for whole organism P.
473 falciparum vaccines, genetic differences were assessed both across the genome and in
474 pre-erythrocytic genes of interest in the three heterologous CHMI strains. As expected,
475 the number of mutations between 3D7 and these three PfSPZ strains was much higher
476 than observed for NF54, with ~40K-55K SNPs and as many indels in each pairwise
477 comparison, roughly randomly distributed among intergenic regions, silent and non-
478 synonymous sites (Table 1, Figure 2), and corresponding to a pairwise SNP density
479 relative to 3D7 of 1.9, 2.1 and 2.2 SNPs/Kbp for 7G8, NF166.C8 and NF135.C10,
480 respectively. NF135.C10 had the highest number of unique SNPs genome-wide (SNPs
481 not shared with other PfSPZ strains), with 5% more unique SNPs than NF166.C8 and
482 33% more than 7G8 (Figure S8). Similar trends were seen when restricting the analyses
483 to non-synonymous SNPs (5% and 26% more than NF166.C8 and 7G8, respectively).
484 The lower number of unique SNPs in 7G8 may be due in part to the smaller genome size
20
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
485 of this strain. Small indels with length multiple of three (but not two) were also common in
486 coding regions across the genome, as expected from purifying selection on mutations
487 that disrupt the reading frame, whereas indels in non-coding regions were primarily of
488 lengths multiple of two, from unequal crossover or replication slippage in the regions of
489 dinucleotide repeats common in the P. falciparum genome (Figure S9), confirming what
490 has been shown in previous work using short-read whole genome sequencing data (73).
491 SNPs were also common in a panel of 42 pre-erythrocytic genes known or
492 suspected to be implicated in immunity to liver-stage parasites (see Methods; Table S7).
493 While the sequence of all these loci was identical between NF54 and 3D7, there was a
494 wide range in the number of sequence variants per locus between 3D7 and the other
495 three PfSPZ strains, with some genes being more conserved than others. For example,
496 PfCSP showed 8, 7 and 6 non-synonymous mutations in 7G8, NF166.C8, and
497 NF135.C10, respectively, relative to 3D7. However, PfLSA-1 had over 100 non-
498 synonymous mutations in all three heterologous strains relative to 3D7 (many in the
499 repetitive region of this gene), in addition to significant length differences in the internal
500 repeat region (Figure S10).
501 Immunological relevance of genetic variation among PfSPZ strains
502 The sequence variants mentioned above may impact the ability of the immune
503 system primed with NF54 to recognize the other PfSPZ strains, impairing vaccine efficacy
504 against heterologous CHMI. Data from murine and non-human primate models
505 (5,28,29,74) demonstrate that CD8+ T cells are required for protective efficacy; therefore,
506 the identification of shared and unique CD8+ T cell epitopes across the genome in all four
507 PfSPZ strains may help interpret the differential efficacy seen in heterologous relative to
21
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
508 homologous CHMI. To create a comparable set of genes in which we were confident of
509 the amino acid sequence, we subset the 5,548 protein-coding genes in the 3D7 genome
510 annotation to 4,814 genes that (1) did not have premature stop codons, 2) were annotated
511 with an expected number of mRNA fragments, and (3) met the above requirements in all
512 four PfSPZ strains (in practice, this removed most, if not all, members of multi-gene
513 families, such as vars, rifins, stevors) (Figure 3A). Strong-binding MHC class I epitopes
514 in the protein sequences from these loci were identified using in silico epitope predictions
515 based on HLA types common in sub-Saharan Africa populations (Table S1).
516 Similar total number of epitopes (sum of unique epitopes, regardless of the HLA-
517 type, across genes) were identified in the three heterologous CHMI strains, with
518 NF166.C8 having the fewest (492,291) and NF135.C10 having the most (495,285)
519 (Figure 3B). More epitopes were identified in NF54 (n=512,172) than in the three
520 heterologous CHMI strains. This slightly higher number might be due to a more accurate
521 mapping of the 3D7 genome annotation to the NF54 assembly compared to the other
522 three strains (Figure 3A). More than 80% of epitopes detected in each strain were
523 identical (by sequence and by locus) across all four strains, reflecting epitopes predicted
524 in conserved regions of the 4,814 genes used in this analysis. NF54 had the highest
525 proportion of unique epitopes (9,002, or 1.76%) (Figure 3A), i.e., epitopes not shared
526 with the other PfSPZ strains, followed by NF135.C10 (4,123 or 0.83%), 7G8 (3,500 or
527 0.71%), and NF166.C8 (2,878, or 0.58%). After direct comparisons of NF54 to each
528 heterologous PfSPZ strain, NF166 had the lowest proportion of epitopes not shared with
529 NF54 (1.97%, compared to 2.24% and 2.27% for 7G8 and NF135.C10, respectively), as
22
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
530 would follow from a positive correlation between geographic distance and genetic
531 similarity.
532 To focus the analysis on genes that might be important for the protective efficacy
533 of a whole sporozoite vaccine, the above epitope predictions were stratified by two
534 additional gene sets: 1) 1,974 genes that were found to be expressed in the sporozoite
535 stage of the parasite in mosquito salivary glands (54,55), and (2) the 42 pre-erythrocytic
536 genes of interest mentioned above. In the sporozoite-expressed gene set, similar patterns
537 as in the genome-wide gene set above were seen, with NF135.C10 being the
538 heterologous CHMI strain that had the highest number of epitopes not shared with any
539 other strain (1,243, or 0.6%), and NF166 having the smallest (884, or 0.4%) (Figure 3C).
540 Within the 42 pre-erythrocytic gene set, there were 117, 121, and 153 such unique
541 epitopes present in 7G8, NF166.C8, and NF135.C10, respectively (Figure S11, Table
542 S8). Genes where unique epitopes were identified in this gene set were also, on average,
543 more polymorphic (Table S7).
544 Some of these variations in epitope sequences are relevant for the interpretation
545 of the outcome of PfSPZ Vaccine trials. For example, while all four strains are identical in
546 sequence composition in a B cell epitope potentially relevant for protection recently
547 identified in the circumsporozoite protein (PfCSP, PF3D7_0304600) (75), another B cell
548 epitope that partially overlaps it (76) contained an A98G amino acid difference in 7G8 and
549 NF135.C10 relative to NF54 and NF166.C8. There was also variability in CD8+ T cell
550 epitopes recognized in the Th2R region of the protein, some known for decades (77).
551 Specifically, the PfCSP encoded by the 3D7/NF54 allele was predicted to bind to both
552 HLA-A and HLA-C allele types, but the homologous protein segments in NF166.C8 and
23
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
553 NF135.C10 were only recognized by HLA-A allele types, and no epitope was detected
554 (within the HLA types studied) at that position in PfCSP encoded in 7G8 (Figure 4).
555 Expansion of the analyses to additional HLA types revealed an allele (HLA-08:01) that is
556 predicted to bind to the Th2R region of the 7G8-encoded PfCSP; however, HLA-08:01 is
557 much more frequent in European populations (10-15%) than in African populations (1-
558 6%) (51). Therefore, if CD8+ T cell epitopes in the Th2R region of 7G8 are important for
559 protection, which is currently unknown, the level of protection against CHMI with 7G8
560 observed in volunteers of European descent may not be correlated with PfSPZ Vaccine
561 efficacy in Africa. Indels and repeat regions also sometimes affected the number of
562 predicted epitopes in each antigen; for example, an insertion in 7G8 near amino acid
563 residue 1,600 in PfLISP-2 (PF3D7_0405300) contained additional predicted epitopes
564 (Figure S12). Similar patterns in variation in epitope recognition and frequency were
565 found in other pre-erythrocytic genes of interest, including PfLSA-3 (PF3D7_0220000),
566 PfAMA-1 (PF3D7_1133400) and PfTRAP (PF3D7_1335900) (Figure S12).
567 PfSPZ strains and global parasite diversity
568 The four PfSPZ strains have been adapted and kept in culture for extended periods
569 of time. To determine if they are still representative of the malaria-endemic regions from
570 which they were collected, we compared these strains to over 600 recent (2007-2014)
571 clinical isolates from South America, Africa, Southeast Asia, and Oceania (Supplemental
572 Dataset 4), using principal coordinates analysis (PCoA) based on SNP calls generated
573 from Illumina whole genome sequencing data. The results confirmed the existence of
574 global geographic differences in genetic variation previously reported (78,79), including
575 clustering by continent, as well as a separation of East from West Africa and of the
24
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
576 Amazonian region from that West of the Andes (Figure 5). The PfSPZ strains clustered
577 with others from their respective geographic regions, both at the genome-wide level and
578 when restricting the data set to SNPs in the panel of 42 pre-erythrocytic antigens, despite
579 the long-term culturing of some of these strains (Figure 5). An admixture analysis of
580 South American and African clinical isolates confirmed that NF54 and NF166.C8 both
581 have the genomic background characteristic of West Africa, while 7G8 is clearly a South
582 American strain (Figure S13).
583 NF135.C10 was isolated in the early 1990s (14), at a time when resistance to
584 chloroquine and sulfadoxine-pyrimethamine resistance was entrenched and resistance
585 to mefloquine was emerging (80,81), and carries signals from this period of drug pressure.
586 Four copies of MDR-1 were identified in NF135.C10 (Table S9); however, two of these
587 copies appeared to have premature stop codons introduced by SNPs and/or indels,
588 leaving potentially only two functional copies in the genome. While NF135.C10 also had
589 numerous point mutations relative to 3D7 in genes such as PfCRT (conveying
590 chloroquine resistance), and PfDHPS and PfDHR (conveying sulfadoxine-pyrimethamine
591 resistance), NF135.C10 was isolated before the widespread deployment of artemisinin-
592 based combination therapies (ACTs), and had the wild-type allele in the locus that
593 encodes the kelch protein in chromosome 13 (PfK13), with no mutations known to convey
594 artemisinin resistance detected in the propeller region (Table S10).
595 The emergence in Southeast Asia of resistance to antimalarial drugs, including
596 artemisinins and drugs used in artemisinin-based combination treatments, is thought to
597 underlie the complex and dynamic parasite population structure in the region (82).
598 Several relatively homogeneous subpopulations, whose origin is likely linked to the
25
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
599 emergence and rapid spread of drug resistance mutations, exist in parallel with a sensitive
600 subpopulation that reflects the ancestral population in the region (referred to as KH1),
601 and another subpopulation of admixed genomic background (referred to as KHA),
602 possibly the source of the drug-resistant subpopulations or the result of a secondary mix
603 of resistant subpopulations (61,62,83,84). This has been accompanied by reports of
604 individual K13 mutations conferring artemisinin resistance occurring independently on
605 multiple genomic backgrounds (85). To determine the subpopulation to which NF135.C10
606 belongs, an admixture analysis was conducted using isolates from Southeast Asia and
607 Oceania, including NF135.C10. Eleven total populations were detected, of which seven
608 contained Cambodian isolates (Figure 6). Both admixture and hierarchical clustering
609 analyses suggest that NF135.C10 is representative of the previously described admixed
610 KHA subpopulation (61,62) (Figure 6), implying that NF135.C10 is representative of a
611 long-standing admixed population of parasites in Cambodia rather than one of several
612 subpopulations thought to have arisen more recently following the emergence of drug
613 resistance.
614 Discussion
615 Whole organism sporozoite vaccines have provided variable levels of protection in
616 initial clinical trials; the radiation-attenuated PfSPZ Vaccine has been shown to protect
617 >90% of subjects against homologous CHMI at 3 weeks after the last dose in 5 clinical
618 trials in the U.S. (6,8) and Germany (11). However, efficacy has been lower against
619 heterologous CHMI (8,9), and in field studies in a region of intense transmission, in Mali,
620 at 24 weeks (10). Interestingly, for the exact same immunization regimen, protective
621 efficacy by proportional analysis was greater in the field trial in Mali (29%) than it was
26
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
622 against heterologous CHMI with Pf 7G8 in the US at 24 weeks after last dose of vaccine
623 (8%) (8,10). While evidence shows that whole organism-based vaccine efficacy can be
624 improved by adjusting the vaccine dose and schedule (11), further optimization of such
625 vaccines will be facilitated by a thorough understanding of the genotypic and immunologic
626 differences among the PfSPZ strains and between them and parasites in malaria endemic
627 regions.
628 A recent study examined whole genome short-read sequencing data to
629 characterize NF166.C8 and NF135.C10 through SNP calls, and identified a number of
630 non-synonymous mutations at a few loci potentially important for the efficacy of CPS, the
631 foundation for PfSPZ-CVac (17). The analyses described here, using high-quality de novo
632 genome assemblies, expand the analysis to hard-to-call regions, such as those
633 containing gene families, repeats and other low complexity sequences. The added
634 sensitivity enabled the thorough genomic characterization of these and additional
635 vaccine-related strains, and revealed a considerably higher number of sequence variants
636 than can be called using short read data alone, as well as indels and structural variants
637 between assemblies. For example, the insertion close to the 3’ end of PfAP2-G detected
638 in NF135.C10 and shared by DD2 has not, to the best of our knowledge, been reported
639 before, despite the multiple studies highlighting the importance of this gene in sexual
640 commitment in P. falciparum strains, including DD2 (70). Long-read sequencing also
641 confirmed that differences observed between NF54 and 3D7 in a major liver stage
642 antigen, PfLSA-1, result from one of a small number of errors lingering in the reference
643 3D7 genome, which is being continually updated and improved (34). Confirmation that
644 NF54 and 3D7 are identical at this locus is critical when 3D7 has been used as a
27
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
645 homologous CHMI in whole sporozoite, NF54-based vaccine studies. Furthermore, the
646 comprehensive sequence characterization of variant surface antigen-encoding loci, such
647 as PfEMP1-encoding genes, will enable the use of the PfSPZ strains to study the role of
648 these protein families in virulence, naturally acquired immunity and vaccine-induced
649 protection (86).
650 The comprehensive genetic and genomic studies reported herein were designed
651 to provide insight into the outcome of homologous and heterologous CHMI studies, and
652 to determine whether the CHMI strains can be used as a proxy for strains present in the
653 field. Comparison of genome assemblies confirmed that NF54 and 3D7 have remained
654 genetically very similar over time, and that 3D7 is an appropriate homologous CHMI
655 strain. As expected, 7G8, NF166.C8, and NF135.C10 were genetically very distinct from
656 NF54 and 3D7, with thousands of differences across the genome including dozens in
657 known pre-erythrocytic antigens. The identification of sequence variants (both SNPs and
658 indels) within transcriptional regulators, such as the AP2 family, may assist in the study
659 of different growth phenotypes in these strains. NF166.C8 and NF135.C10 merozoites
660 enter the blood stream several days earlier than those of NF54 (15), suggesting that NF54
661 may develop more slowly in hepatocytes than do the other two strains. Therefore,
662 mutations in genes associated with liver-stage development (as was observed with
663 PfAP2-L) may be of interest to explore further. Finally, comparison of the PfSPZ strains
664 to whole genome sequencing data from clinical isolates shows that, at the whole genome
665 level, they are indeed representative of their geographical regions of origin.
666 These results can assist in the interpretation of CHMI studies in multiple ways.
667 First, in the 42 pre-erythrocytic genes of interest, the Brazilian strain 7G8, one of the first
28
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
668 strains to be used in heterologous CHMI studies, and the Guinean strain NF166.C8 share
669 a similar number of epitopes with NF54. Furthermore, both NF166.C8 and NF135.C10
670 encode more unique non-synonymous SNPs genome-wide (19% and 33%, respectively)
671 and epitopes in these genes relative to NF54 than does 7G8. These results show that the
672 practice of equating geographic distance to genetic differentiation is not always valid, and
673 that the interpretation of CHMI studies should rest upon thorough genome-wide
674 comparisons. Lastly, of all PfSPZ strains, NF135.C10 is the most genetically distinct from
675 NF54, including in point mutations, unique epitopes, and its placement within the global
676 parasite population. We do not know whether the peptide sequences of the actual epitope
677 targets of vaccine-induced protective immunity are equally distinct in NF135.C10.
678 However, if proteome-wide genetic divergence is the primary determinant of differences
679 in protection against different parasites, the extent to which NF54-based immunization
680 protects against CHMI with NF135.C10 is important in understanding the ability of PfSPZ
681 Vaccine and other whole-organism malaria vaccines to protect against diverse parasites
682 present world-wide. These conclusions are drawn from genome-wide analyses and from
683 subsets of genes for which a role in whole-sporozoite-induced protection is suspected but
684 not experimentally established. Conclusive statements regarding cross-protection will
685 require the additional knowledge of the genetic basis of whole-organism vaccine
686 protection.
687 Without more information on the epitope targets of protective immunity induced by
688 PfSPZ vaccines, it is difficult to rationally design multi-strain PfSPZ vaccines. However,
689 these data and analysis approach has the potential to be applied to rationally design multi-
690 strain sporozoite-based vaccines once knowledge of those critical epitope sequences is
29
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
691 available, and characterization of a variety of P. falciparum strains may facilitate the
692 development of region-specific or multi-strain vaccines with greater protective efficacy.
693 Support for a genomics-guided approach to guide such next-generation vaccines can be
694 found in other whole organism parasitic vaccines. Field trials testing the efficacy of first-
695 generation whole killed-parasite vaccines against Leishmania had highly variable results
696 (87). While most studies failed to show protection, indicating that killed, whole-cell
697 vaccines for leishmaniasis may not produce the necessary protective response, a trial
698 demonstrating significant protection utilized a multi-strain vaccine, with strains collected
699 from the immediate area of the trial (88), highlighting the importance of understanding the
700 distribution of genetic diversity in pathogen populations. In addition, a highly efficacious
701 non-attenuated, three-strain, whole organism vaccine exists against Theileria parva, a
702 protozoan parasite that causes East coast fever in cattle. This vaccine, named Muguga
703 Cocktail, consists of a mix of three live strains of T. parva that are administered in an
704 infection-and-treatment method, similar to the approach utilized by PfSPZ-CVac. It has
705 been shown recently that two of the strains are genetically very similar, possibly clones
706 of the same isolates (89). Despite this, the vaccine remains highly efficacious and in high
707 demand (90). In addition, the third vaccine strain in the Muguga Cocktail is quite distinct
708 from the other two, with ~5 SNPs/kb (89), or about twice the SNP density seen between
709 NF54 and other PfSPZ strains. These observations suggest that an efficacious multi-
710 strain vaccine against a highly variable parasite species does not need to be contain a
711 large number of strains, but that the inclusion of highly divergent strains may be
712 warranted. These results also speak to the promise of multi-strain vaccines against highly
713 diverse pathogens, including apicomplexans with large genomes and complex life cycles.
30
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
714 Next-generation whole genome sequencing technology has opened many
715 avenues for infectious disease research and holds great promise for informing vaccine
716 design. While most malaria vaccine development has occurred before the implementation
717 of regular use of whole genome sequencing, the tools now available allow the precise
718 characterization and informed selection of vaccine strains early in the development
719 process. These results will greatly assist in the interpretation of clinical trials using these
720 strains for CHMI.
721 References and Notes: 722 723 1. World Malaria Report 2018. Geneva: World Health Organization; 2018 p. 1–210.
724 2. Bhatt S, Weiss DJ, Cameron E, Bisanzio D, Mappin B, Dalrymple U, et al. The 725 effect of malaria control on Plasmodium falciparum in Africa between 2000 and 726 2015. Nature. 2015 Oct 8;526(7572):207–11.
727 3. Delemarre BJ, van der Kaay HJ. Tropical malaria contracted the natural way in the 728 Netherlands. Ned Tijdschr Geneeskd. 1979 Nov 17;123(46):1981–2.
729 4. Hoffman SL, Billingsley PF, James E, Richman A, Loyevsky M, Li T, et al. 730 Development of a metabolically active, non-replicating sporozoite vaccine to 731 prevent Plasmodium falciparum malaria. Hum Vaccin. 2010 Jan;6(1):97–106.
732 5. Epstein JE, Tewari K, Lyke KE, Sim BKL, Billingsley PF, Laurens MB, et al. Live 733 Attenuated Malaria Vaccine Designed to Protect Through Hepatic CD8+ T Cell 734 Immunity. Science. 2011 Oct 28;334(6055):475–80.
735 6. Seder RA, Chang L-J, Enama ME, Zephir KL, Sarwar UN, Gordon IJ, et al. 736 Protection Against Malaria by Intravenous Immunization with a Nonreplicating 737 Sporozoite Vaccine. Science. 2013 Sep 20;341(6152):1359–65.
738 7. Ishizuka AS, Lyke KE, DeZure A, Berry AA, Richie TL, Mendoza FH, et al. 739 Protection against malaria at 1 year and immune correlates following PfSPZ 740 vaccination. Nat Med. 2016 Jun;22(6):614–23.
741 8. Epstein JE, Paolino KM, Richie TL, Sedegah M, Singer A, Ruben AJ, et al. 742 Protection against Plasmodium falciparum malaria by PfSPZ Vaccine. JCI Insight 743 [Internet]. 2017 Jan 12 [cited 2017 Jan 14];2(1). Available from: 744 http://insight.jci.org/articles/view/89154
31
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
745 9. Lyke KE, Ishizuka AS, Berry AA, Chakravarty S, DeZure A, Enama ME, et al. 746 Attenuated PfSPZ Vaccine induces strain-transcending T cells and durable 747 protection against heterologous controlled human malaria infection. Proc Natl Acad 748 Sci. 2017 Mar 7;114(10):2711–6.
749 10. Sissoko MS, Healy SA, Katile A, Omaswa F, Zaidi I, Gabriel EE, et al. Safety and 750 efficacy of PfSPZ Vaccine against Plasmodium falciparum via direct venous 751 inoculation in healthy malaria-exposed adults in Mali: a randomised, double-blind 752 phase 1 trial. Lancet Infect Dis. 2017 May 1;17(5):498–509.
753 11. Mordmüller B, Surat G, Lagler H, Chakravarty S, Ishizuka AS, Lalremruata A, et al. 754 Sterile protection against human malaria by chemoattenuated PfSPZ vaccine. 755 Nature. 2017 Feb 23;542(7642):445–9.
756 12. Vaughan AM, Sack BK, Dankwa D, Minkah N, Nguyen T, Cardamone H, et al. A 757 Plasmodium Parasite with Complete Late Liver Stage Arrest Protects against 758 Preerythrocytic and Erythrocytic Stage Infection in Mice. Infect Immun. 2018;86(5).
759 13. Mendes AM, Machado M, Gonçalves-Rosa N, Reuling IJ, Foquet L, Marques C, et 760 al. A Plasmodium berghei sporozoite-based vaccination platform against human 761 malaria. NPJ Vaccines [Internet]. 2018 [cited 2019 Jun 25];3. Available from: 762 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6109154/
763 14. Teirlinck AC, Roestenberg M, Vegte-Bolmer M van de, Scholzen A, Heinrichs MJL, 764 Siebelink-Stoter R, et al. NF135.C10: A New Plasmodium falciparum Clone for 765 Controlled Human Malaria Infections. J Infect Dis. 2013 Feb 15;207(4):656–60.
766 15. McCall MBB, Wammes LJ, Langenberg MCC, Gemert G-J van, Walk J, Hermsen 767 CC, et al. Infectivity of Plasmodium falciparum sporozoites determines emerging 768 parasitemia in infected volunteers. Sci Transl Med. 2017 Jun 21;9(395):eaag2490.
769 16. Schats R, Bijker EM, van Gemert G-J, Graumans W, van de Vegte-Bolmer M, van 770 Lieshout L, et al. Heterologous Protection against Malaria after Immunization with 771 Plasmodium falciparum Sporozoites. PLoS ONE. 2015 May 1;10(5):e0124243.
772 17. Walk J, Reuling IJ, Behet MC, Meerstein-Kessel L, Graumans W, van Gemert G-J, 773 et al. Modest heterologous protection after Plasmodium falciparum sporozoite 774 immunization: a double-blind randomized controlled clinical trial. BMC Med. 2017 775 Sep 13;15:168.
776 18. Volkman SK, Hartl DL, Wirth DF, Nielsen KM, Choi M, Batalov S, et al. Excess 777 Polymorphisms in Genes for Membrane Proteins in Plasmodium falciparum. 778 Science. 2002 Oct 4;298(5591):216–8.
779 19. Thera MA, Doumbo OK, Coulibaly D, Laurens MB, Ouattara A, Kone AK, et al. A 780 Field Trial to Assess a Blood-Stage Malaria Vaccine. N Engl J Med. 2011 Sep 781 15;365(11):1004–13.
32
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
782 20. Ouattara A, Takala-Harrison S, Thera MA, Coulibaly D, Niangaly A, Saye R, et al. 783 Molecular Basis of Allele-Specific Efficacy of a Blood-Stage Malaria Vaccine: 784 Vaccine Development Implications. J Infect Dis. 2013 Feb;207(3):511–9.
785 21. Neafsey DE, Juraska M, Bedford T, Benkeser D, Valim C, Griggs A, et al. Genetic 786 Diversity and Protective Efficacy of the RTS,S/AS01 Malaria Vaccine. N Engl J 787 Med. 2015 Oct 21;0(0):null.
788 22. Takala SL, Plowe CV. Genetic diversity and malaria vaccine design, testing, and 789 efficacy: Preventing and overcoming “vaccine resistant malaria.” Parasite Immunol. 790 2009 Sep;31(9):560–73.
791 23. Barry AE, Arnott A. Strategies for Designing and Monitoring Malaria Vaccines 792 Targeting Diverse Antigens. Front Immunol [Internet]. 2014 Jul 28;5. Available 793 from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4112938/
794 24. Preston MD, Campino S, Assefa SA, Echeverry DF, Ocholla H, Amambua-Ngwa A, 795 et al. A barcode of organellar genome polymorphisms identifies the geographic 796 origin of Plasmodium falciparum strains. Nat Commun [Internet]. 2014 Jun 13 [cited 797 2015 Jun 22];5. Available from: 798 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4082634/
799 25. Drakeley CJ, Duraisingh MT, Póvoa M, Conway DJ, Targett GAT, Baker DA. 800 Geographical distribution of a variant epitope of Pfs4845, a Plasmodium falciparum 801 transmission-blocking vaccine candidate. Mol Biochem Parasitol. 1996 Oct 802 30;81(2):253–7.
803 26. Walliker D, Quakyi IA, Wellems TE, McCutchan TF, Szarfman A, London WT, et al. 804 Genetic Analysis of the Human Malaria Parasite Plasmodium falciparum. Science. 805 1987 Jun 26;236(4809):1661–6.
806 27. Delves MJ, Straschil U, Ruecker A, Miguel-Blanco C, Marques S, Dufour AC, et al. 807 Routine in vitro culture of P. falciparum gametocytes to evaluate novel 808 transmission-blocking interventions. Nat Protoc. 2016 Sep;11(9):1668–80.
809 28. Schofield L, Villaquiran J, Ferreira A, Schellekens H, Nussenzweig R, 810 Nussenzweig V. γ Interferon, CD8+ T cells and antibodies required for immunity to 811 malaria sporozoites. Nature. 1987 Dec 23;330(6149):664–6.
812 29. Weiss WR, Sedegah M, Beaudoin RL, Miller LH, Good MF. CD8+ T cells 813 (cytotoxic/suppressors) are required for protection in mice immunized with malaria 814 sporozoites. Proc Natl Acad Sci. 1988 Jan 1;85(2):573–6.
815 30. Chattopadhyay R, Pratt D. Role of controlled human malaria infection (CHMI) in 816 malaria vaccine development: A U.S. food & drug administration (FDA) 817 perspective. Vaccine. 2017 May 15;35(21):2767–9.
33
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
818 31. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, et al. Genome 819 sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002 Oct 820 3;419(6906):498–511.
821 32. Vembar SS, Seetin M, Lambert C, Nattestad M, Schatz MC, Baybayan P, et al. 822 Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum 823 genome through long-read (>11 kb), single molecule, real-time sequencing. DNA 824 Res. 2016 Jun 26;dsw022.
825 33. Otto TD, Böhme U, Sanders M, Reid A, Bruske EI, Duffy CW, et al. Long read 826 assemblies of geographically dispersed Plasmodium falciparum isolates reveal 827 highly structured subtelomeres. Wellcome Open Res [Internet]. 2018 May 3 [cited 828 2018 Jul 16];3. Available from: 829 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5964635/
830 34. Böhme U, Otto TD, Sanders M, Newbold CI, Berriman M. Progression of the 831 canonical reference malaria parasite genome from 2002–2019. Wellcome Open 832 Res [Internet]. 2019 Mar 29 [cited 2019 Jun 1];4. Available from: 833 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6484455/
834 35. Yeda R, Ingasia LA, Cheruiyot AC, Okudo C, Chebon LJ, Cheruiyot J, et al. The 835 Genotypic and Phenotypic Stability of Plasmodium falciparum Field Isolates in 836 Continuous In Vitro Culture. PLoS ONE. 2016 Jan 11;11(1):e0143565.
837 36. Oyola SO, Ariani CV, Hamilton WL, Kekre M, Amenga-Etego LN, Ghansah A, et al. 838 Whole genome sequencing of Plasmodium falciparum from dried blood spots using 839 selective whole genome amplification. Malar J. 2016;15:597.
840 37. Agrawal S, Moser KA, Morton L, Cummings MP, Parihar A, Dwivedi A, et al. 841 Association of a Novel Mutation in the Plasmodium falciparum Chloroquine 842 Resistance Transporter With Decreased Piperaquine Sensitivity. J Infect Dis. 2017 843 Aug 15;216(4):468–76.
844 38. Trager W, Jensen JB. Human malaria parasites in continuous culture. Science. 845 1976 Aug 20;193(4254):673–5.
846 39. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: 847 scalable and accurate long-read assembly via adaptive k-mer weighting and repeat 848 separation. Genome Res. 2017 Mar 15;gr.215087.116.
849 40. Chin C-S, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. 850 Nonhybrid, finished microbial genome assemblies from long-read SMRT 851 sequencing data. Nat Methods. 2013 Jun;10(6):563–9.
852 41. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: An 853 Integrated Tool for Comprehensive Microbial Variant Detection and Genome 854 Assembly Improvement. PLoS ONE. 2014 Nov 19;9(11):e112963.
34
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
855 42. Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL. 856 Alignment of whole genomes. Nucleic Acids Res. 1999 Jan 1;27(11):2369–76.
857 43. Nattestad M, Schatz MC. Assemblytics: a web analytics tool for the detection of 858 variants from an assembly. Bioinformatics. 2016 Oct 1;32(19):3021–3.
859 44. Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, Haeseler A, et al. 860 Accurate detection of complex structural variations using single-molecule 861 sequencing. Nat Methods. 2018 Apr 30;1.
862 45. Nattestad M, Chin C-S, Schatz MC. Ribbon: Visualizing complex genome 863 alignments and structural variation. bioRxiv. 2016 Oct 20;082123.
864 46. Dara A, Drábek EF, Travassos MA, Moser KA, Delcher AL, Su Q, et al. New var 865 reconstruction algorithm exposes high var sequence diversity in a single 866 geographic location in Mali. Genome Med. 2017;9:30.
867 47. Rask TS, Hansen DA, Theander TG, Pedersen AG, Lavstsen T. Plasmodium 868 falciparum Erythrocyte Membrane Protein 1 Diversity in Seven Genomes – Divide 869 and Conquer. PLOS Comput Biol. 2010 Sep 16;6(9):e1000933.
870 48. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat 871 Methods. 2012 Apr;9(4):357–9.
872 49. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. 873 SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell 874 Sequencing. J Comput Biol. 2012 May;19(5):455–77.
875 50. Nielsen M, Andreatta M. NetMHCpan-3.0; improved prediction of binding to MHC 876 class I molecules integrating information from multiple receptor and peptide length 877 datasets. Genome Med. 2016 Mar 30;8:33.
878 51. González-Galarza FF, Takeshita LYC, Santos EJM, Kempson F, Maia MHT, Silva 879 ALS da, et al. Allele frequency net 2015 update: new features for HLA epitopes, 880 KIR and disease and HLA adverse drug reaction associations. Nucleic Acids Res. 881 2015 Jan 28;43(D1):D784–8.
882 52. de Pablo R, García-Pacheco J m., Vilches C, Moreno M e., Sanz L, Rementería M 883 c., et al. HLA class I and class II allele distribution in the Bubi population from the 884 island of Bioko (Equatorial Guinea). Tissue Antigens. 1997 Dec 1;50(6):593–601.
885 53. Cao K, Moormann A m., Lyke K e., Masaberg C, Sumba O p., Doumbo O k., et al. 886 Differentiation between African populations is evidenced by the diversity of alleles 887 and haplotypes of HLA class I loci. Tissue Antigens. 2004 Apr 1;63(4):293–325.
888 54. Lasonder E, Janse CJ, Gemert G-J van, Mair GR, Vermunt AMW, Douradinha BG, 889 et al. Proteomic Profiling of Plasmodium Sporozoite Maturation Identifies New
35
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
890 Proteins Essential for Parasite Development and Infectivity. PLOS Pathog. 2008 891 Oct 31;4(10):e1000195.
892 55. Lindner SE, Swearingen KE, Harupa A, Vaughan AM, Sinnis P, Moritz RL, et al. 893 Total and Putative Surface Proteomics of Malaria Parasite Salivary Gland 894 Sporozoites. Mol Cell Proteomics. 2013 May 1;12(5):1127–43.
895 56. Felgner P, Hoffman SL, Seder R. Serum antibody assay for determining protection 896 from malaria, and pre-erythrocytic subunit vaccines [Internet]. US20160216276A1, 897 2016 [cited 2019 May 20]. Available from: 898 https://patents.google.com/patent/US20160216276A1/en
899 57. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The 900 Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation 901 DNA sequencing data. Genome Res. 2010 Sep 1;20(9):1297–303.
902 58. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A 903 framework for variation discovery and genotyping using next-generation DNA 904 sequencing data. Nat Genet. 2011 May;43(5):491–8.
905 59. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine 906 A, et al. From FastQ data to high confidence variant calls: the Genome Analysis 907 Toolkit best practices pipeline. Curr Protoc Bioinforma Ed Board Andreas 908 Baxevanis Al. 2013 Oct 15;11(1110):11.10.1-11.10.33.
909 60. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in 910 unrelated individuals. Genome Res. 2009 Sep 1;19(9):1655–64.
911 61. Miotto O, Almagro-Garcia J, Manske M, MacInnis B, Campino S, Rockett KA, et al. 912 Multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. 913 Nat Genet. 2013 Jun;45(6):648–55.
914 62. Dwivedi A, Reynes C, Kuehn A, Roche DB, Khim N, Hebrard M, et al. Functional 915 analysis of Plasmodium falciparum subpopulations associated with artemisinin 916 resistance in Cambodia. Malar J. 2017 Dec 19;16:493.
917 63. Carneiro MO, Russ C, Ross MG, Gabriel SB, Nusbaum C, DePristo MA. Pacific 918 biosciences sequencing technology for genotyping and variation discovery in 919 human data. BMC Genomics. 2012 Aug 5;13:375.
920 64. Zanghì G, Vembar SS, Baumgarten S, Ding S, Guizetti J, Bryant JM, et al. A 921 Specific PfEMP1 Is Expressed in P. falciparum Sporozoites and Plays a Role in 922 Hepatocyte Infection. Cell Rep. 2018 Mar 13;22(11):2951–63.
923 65. Mu J, Myers RA, Jiang H, Liu S, Ricklefs S, Waisberg M, et al. Plasmodium 924 falciparum genome-wide scans for positive selection, recombination hot spots and 925 resistance to antimalarial drugs. Nat Genet. 2010 Mar;42(3):268–71.
36
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
926 66. Guerin-Marchand C, Druilhe P, Galey B, Londono A, Patarapotikul J, Beaudoin RL, 927 et al. A liver-stage-specific antigen of Plasmodium falciparum characterized by 928 gene cloning. Nature. 1987 Sep 10;329(6135):164–7.
929 67. Fidock DA, Gras-Masse H, Lepers JP, Brahimi K, Benmohamed L, Mellouk S, et al. 930 Plasmodium falciparum liver stage antigen-1 is well conserved and contains potent 931 B and T cell determinants. J Immunol. 1994 Jul 1;153(1):190–204.
932 68. Claessens A, Affara M, Assefa SA, Kwiatkowski DP, Conway DJ. Culture 933 adaptation of malaria parasites selects for convergent loss-of-function mutants. Sci 934 Rep. 2017 Jan 24;7:41303.
935 69. Gebru T, Lalremruata A, Kremsner PG, Mordmüller B, Held J. Life-span of in vitro 936 differentiated Plasmodium falciparum gametocytes. Malar J. 2017 Aug 11;16:330.
937 70. Kafsack BFC, Rovira-Graells N, Clark TG, Bancells C, Crowley VM, Campino SG, 938 et al. A transcriptional switch underlies commitment to sexual development in 939 malaria parasites. Nature. 2014 Mar 13;507(7491):248–52.
940 71. Iwanaga S, Kaneko I, Kato T, Yuda M. Identification of an AP2-family Protein That 941 Is Critical for Malaria Liver Stage Development. PLOS ONE. 2012 Nov 942 7;7(11):e47557.
943 72. Ikadai H, Saliba KS, Kanzok SM, McLean KJ, Tanaka TQ, Cao J, et al. Transposon 944 mutagenesis identifies genes essential for Plasmodium falciparum 945 gametocytogenesis. Proc Natl Acad Sci. 2013 Apr 30;110(18):E1676–84.
946 73. Miles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, et al. Indels, 947 structural variation, and recombination drive genomic diversity in Plasmodium 948 falciparum. Genome Res. 2016 Sep 1;26(9):1288–99.
949 74. Doolan DL, Hoffman SL. The Complexity of Protective Immunity Against Liver- 950 Stage Malaria. J Immunol. 2000 Aug 1;165(3):1453–62.
951 75. Kisalu NK, Idris AH, Weidle C, Flores-Garcia Y, Flynn BJ, Sack BK, et al. A human 952 monoclonal antibody prevents malaria infection by targeting a new site of 953 vulnerability on the parasite. Nat Med. 2018 Apr;24(4):408–16.
954 76. Tan J, Sack BK, Oyen D, Zenklusen I, Piccoli L, Barbieri S, et al. A public antibody 955 lineage that potently inhibits malaria infection by dual binding to the 956 circumsporozoite protein. Nat Med. 2018 May;24(4):401–7.
957 77. Doolan DL, Saul AJ, Good MF. Geographically restricted heterogeneity of the 958 Plasmodium falciparum circumsporozoite protein: relevance for vaccine 959 development. Infect Immun. 1992 Feb;60(2):675–82.
37
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
960 78. Manske M, Miotto O, Campino S, Auburn S, Almagro-Garcia J, Maslen G, et al. 961 Analysis of Plasmodium falciparum diversity in natural infections by deep 962 sequencing. Nature. 2012 Jul 19;487(7407):375–9.
963 79. Yalcindag E, Elguero E, Arnathau C, Durand P, Akiana J, Anderson TJ, et al. 964 Multiple independent introductions of Plasmodium falciparum in South America. 965 Proc Natl Acad Sci. 2012 Jan 10;109(2):511–6.
966 80. Wongsrichanalai C, Pickard AL, Wernsdorfer WH, Meshnick SR. Epidemiology of 967 drug-resistant malaria. Lancet Infect Dis. 2002 Apr 1;2(4):209–18.
968 81. Plowe CV. The evolution of drug-resistant malaria. Trans R Soc Trop Med Hyg. 969 2009 Apr 1;103(Supplement 1):S11–4.
970 82. Parobek CM, Parr JB, Brazeau NF, Lon C, Chaorattanakawee S, Gosi P, et al. 971 Partner-Drug Resistance and Population Substructuring of Artemisinin-Resistant 972 Plasmodium falciparum in Cambodia. Genome Biol Evol. 2017 Jun 1;9(6):1673–86.
973 83. MalariaGEN Plasmodium falciparum Community (last). Genomic epidemiology of 974 artemisinin resistant malaria. eLife. 2016 Mar 4;5:e08714.
975 84. Cerqueira GC, Cheeseman IH, Schaffner SF, Nair S, McDew-White M, Phyo AP, et 976 al. Longitudinal genomic surveillance of Plasmodium falciparum malaria parasites 977 reveals complex genomic architecture of emerging artemisinin resistance. Genome 978 Biol. 2017;18:78.
979 85. Takala-Harrison S, Jacob CG, Arze C, Cummings MP, Silva JC, Dondorp AM, et 980 al. Independent Emergence of Artemisinin Resistance Mutations Among 981 Plasmodium falciparum in Southeast Asia. J Infect Dis. 2015 Mar 1;211(5):670–9.
982 86. Kraemer SM, Smith JD. A family affair: var genes, PfEMP1 binding, and malaria 983 disease. Curr Opin Microbiol. 2006 Aug 1;9(4):374–80.
984 87. Noazin S, Modabber F, Khamesipour A, Smith PG, Moulton LH, Nasseri K, et al. 985 First generation leishmaniasis vaccines: A review of field efficacy trials. Vaccine. 986 2008 Dec 9;26(52):6759–67.
987 88. Armijos RX, Weigel MM, Aviles H, Maldonado R, Racines J. Field Trial of a 988 Vaccine against New World Cutaneous Leishmaniasis in an At-Risk Child 989 Population: Safety, Immunogenicity, and Efficacy during the First 12 Months of 990 Follow-Up. J Infect Dis. 1998 May 1;177(5):1352–7.
991 89. Norling M, Bishop RP, Pelle R, Qi W, Henson S, Drábek EF, et al. The genomes of 992 three stocks comprising the most widely utilized live sporozoite Theileria parva 993 vaccine exhibit very different degrees and patterns of sequence divergence. BMC 994 Genomics. 2015;16:729.
38
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
995 90. Nene V, Kiara H, Lacasta A, Pelle R, Svitek N, Steinaa L. The biology of Theileria 996 parva and control of East Coast fever - Current status and future trends. Ticks Tick- 997 Borne Dis. 2016;7(4):549–64.
998 999 1000 Declarations
1001 Ethics approval and consent to participate: Not applicable
1002 Consent for publication: Not applicable
1003 Availability of data and material: Raw sequencing data and assemblies generated for
1004 this manuscript have been uploaded to NCBI. See Dataset 1 for BioSample
1005 identification numbers for assemblies and whole genome sequence data for all samples
1006 used in these analyses, including publicly available data.
1007 Competing interests: T.L., B.K.L.S., and S.L.H. are salaried employees of Sanaria
1008 Inc., the developer and owner of PfSPZ Vaccine, and the supplier of the PfSPZ
1009 strains sequenced in this study. In addition, S.L.H. and B.K.L.S. have a financial
1010 interest in Sanaria Inc. All other authors declare no competing financial interests.
1011 Funding: National Institutes of Health (NIH) awards U19 AI110820 and R01 AI141900,
1012 and a “Dean’s Challenge Award”, an intramural program from the University of Maryland
1013 School of Medicine, provided funds to sequence the PfSPZ strains and Pf clinical
1014 isolates, and support for K.A.M., E.F.D., A.D., M.A., A.O., K.L., L.S., L.T., M.T., S.T.H.,
1015 C.M.F., C.V.P. and J.C.S.. A.O. and C.V.P. were also supported by the Howard Hughes
1016 Medical Institute. The parasites provided by Sanaria were produced in part by funding
1017 from two SBIR grants from National Institute of Allergy and Infectious Diseases, NIH,
1018 2R44AI058375 and 5R44AI055229-09A1. Collection of clinical isolates that were
1019 sequenced for this study were funded by U19 AI089683, grants from CNPq
1020 (590106/2011-2), grants from the Armed Forces Health Surveillance Center/Global
39
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1021 Emerging Infections Surveillance and Response System (WR1576 and WR2017), the
1022 NIH International Centers of Excellence in Malaria Research program (U19 AI089681)
1023 (Brazil), and the Howard Hughes Medical Institute. S.K. and A.M.P. were supported by
1024 the Intramural Research Program of the National Human Genome Research Institute,
1025 NIH (1ZIAHG200398). P.T.R. was supported by a scholarship from the Conselho
1026 Nacional de Desenvolvimento Científico e Tecnológico (CNPq), which also provided a
1027 senior researcher scholarship to M.U.F.
1028 Authors' contributions: K.A.M. designed the study, performed analyses, and wrote the
1029 paper. J.C.S., C.V.P., and S.L.H. designed the study, contributed samples and funding,
1030 and wrote the paper. E.F.D., A.D., J.C., E.M.S., A.D., S.K., A.M.P, and A.O. contributed
1031 analyses. Z.S., M.A., T.L., L.S., L.T. preformed sample preparation and sequencing.
1032 P.T.R., K.E.L., M.D.S., K.J., C.L., D.L.S., M.U.F., M.N., M.K.L., M.T., R.W.S., S.T.H.,
1033 C.M.F, and B.K.L.S contributed samples, funding, and wrote the paper.
1034 Acknowledgements: We would like to thank the study teams that assisted in sample
1035 collections in Mali, Malawi, Cambodia, Thailand, and Myanmar. We would like to
1036 acknowledge Drs. Terrie Taylor and Don Mathanga for their assistance with collection of
1037 clinical isolates in Malawi.
1038 Authors' information (optional)
1039 Additional author information for M. D. S, K. J., C. L., D. L. S (affiliated with the Armed
1040 Forces Research Institute of Medical Sciences): Material has been reviewed by the
1041 Walter Reed Army Institute of Research. There is no objection to its presentation
1042 and/or publication. The opinions or assertions contained herein are the private views of
1043 the author, and are not to be construed as official, or as reflecting true views of the
40
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1044 Department of the Army or the Department of Defense. The investigators have adhered
1045 to the policies for protection of human subjects as prescribed in AR 70–25.
1046 Tables
1047 Table 1: The PfSPZ strains differ from 3D7 in terms of genome size and base-pair 1048 content. Characteristics of the PacBio assembly for each strain (first four columns), 1049 with the 3D7 reference genome are shown for comparison. Single-nucleotide 1050 polymorphisms (SNPs) and indels in each PfSPZ assembly as compared to 3D7 are 1051 shown for coding and non-coding regions (last five columns). # of SNPs4 Indels Cumulative Strain Nuclear N503 Non- Non- Non- Length2 Syn. Coding Contigs1 coding Syn. coding 3D7 14 23,332,831 1,687,656 - - - - - NF54 28 23,426,028 1,527,116 1,278 25 80 8,166 230 7G8 20 22,801,099 1,455,033 22,694 5,675 15,490 40,668 6,730 NF166.C8 30 23,306,751 1,610,926 27,117 6,636 17,258 40,457 6,802 NF135.C10 19 23,540,633 1,631,396 28,791 6,535 18,141 41,435 7,029 1052 1Contigs: number of pieces of continuous sequence in the final assembly. 1053 2Cumulative length: total length of the contigs. 1054 3N50: length of the contig which, along with all contigs larger than itself, contain 50% of the assembly 1055 (larger numbers indicate a more complete assembly). 1056 4Syn SNPs: synonymous SNPs; Non-Syn SNPs: non-synonymous SNPs. 1057
1058
41
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1059 Figures
1060 1061 Figure 1: PacBio Assemblies for each PfSPZ strain reconstruct entire 1062 chromosomes in one-three continuous pieces. To determine the likely position of 1063 each non-reference contig on the 3D7 reference genome, mummer’s show-tiling program 1064 was used with relaxed settings (-g 100000 -v 50 -i 50) to align contigs to 3D7 1065 chromosomes (top). 3D7 nuclear chromosomes (1-14) are shown in grey, arranged from 1066 smallest to largest, along with organelle genomes (M=mitochondrion, A=apicoplast). 1067 Contigs from each PfSPZ assembly (NF54: black, 7G8: green, NF166.C8: orange, 1068 NF135.C10: hot pink) are shown aligned to their best 3D7 match. A small number of 1069 contigs could not be unambiguously mapped to the 3D7 reference genome (unmapped). 1070
1071
1072
1073
1074
42
bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1075
1076 Figure 2: Distribution of polymorphisms in PfSPZ PacBio assemblies. Single 1077 nucleotide polymorphism (SNP) densities (log SNPs/ 10kb) are shown for each 1078 assembly; the scale [0-3] refers to the range of the log-scaled SNP density graphs - 1079 from 100 to 103. Inner tracks, from outside to inside, are NF54 (black), 7G8 (green), 1080 NF166.C8 (orange), and NF135.C10 (hot pink). The outermost tracks are the 3D7 1081 reference genome nuclear chromosomes (chrm1 to chrm 14, in blue), followed by 3D7 1082 genes on the forward and reverse strand (black tick marks).
43 bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Figure 3: Predicted CD8+ T cell epitopes from amino acid sequences of the four PfSPZ strains. Using amino acid sequences from a subset of the 5,548 protein-coding genes in the 3D7 annotation (PlasmoDBv24) that were deemed to be correctly annotated between all four PfSPZ strains, CD8+ T cell epitopes were predicted in silico using netMHCpan. A. Density plot of amino acid sequence length distributions of 4,814 genes used for in silico epitope predictions. Overall distribution of gene lengths was extremely similar between all four strains, with the exception of the three heterologous CHMI strains having a slightly higher number of amino acid sequences that weren’t quite as long as they were in the NF54 genome, possibly a reflection of difficulty in mapping annotations to genetically diverse genomes. B. Shared and unique epitope- gene combinations in the 4,814 genes. C. Shared and unique epitope-gene combinations in a subset of the 4,814 genes (n=1,974) that have evidence of being expressed in the sporozoite stage of the parasite in the mosquito salivary glands.
44 bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Figure 4: Predicted CD8+ T cell epitopes in the P. falciparum circumsporozoite protein (PfCSP). Protein domain information based on the 3D7 reference sequence of PfCSP is found in the first track, and the following tracks are epitopes predicted in the PfCSP sequences of NF54, 7G8, NF166.C8, and NF135.C10, respectively. Each box in the track is a sequence that was identified as an epitope, and the colors of each box represent the HLA type that identified the epitope.
45 bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Figure 5: Global diversity of clinical isolates and PfSPZ strains. Principal coordinates analyses (PCoA) of clinical isolates (n=654) from malaria endemic regions and PfSPZ strains were conducted using biallelic non-synonymous SNPs across the entire genome (left, n=31,761) and in a panel of 42 pre-erythrocytic genes of interest (right, n=1,060). For the genome-wide dataset, coordinate 1 separated South American and African isolates from Southeast Asian and Papua New Guinean isolates (27.6% of variation explained), coordinate 2 separated African isolates from South American isolates (10.7%), and coordinate 3 separated Southeast Asian isolates from Papua New Guinea (PNG) isolates (3.0%). Similar trends were found for the first two coordinates seen for the pre-erythrocytic gene data set (27.1 and 12.6%, respectively), but coordinate 3 separated isolates from all three regions (3.8%). In both datasets, NF54 (black cross) and NF166.C8 (orange cross) cluster with West African isolates (isolates labeled in red and dark orange colors), 7G8 (bright green cross) cluster with isolates from South America (greens and browns), and NF135.C10 (hot pink cross) clusters with isolates from Southeast Asia (purples and blues).
46 bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
.
Figure 6: NF135.C10 is part of an admixed population of clinical isolates from Southeast Asia. Top: admixture plots for clinical isolates from Myanmar (n=16), Thailand (n=34), Cambodia (n=109), Papua New Guinea (PNG, n=34) and NF135.C10 are shown. Each sample is a column, and the height of the different colors in each column corresponds to the proportion of the genome assigned to each K population by the model. Bottom: hierarchical clustering of the Southeast Asian isolates used in the admixture analysis (n=193; colored by their assigned subpopulation) and previously characterized Cambodian isolates (n=167; black; labels correspond to clusters identified in reference 61) place NF135.C10 (star) with samples from the previously identified KHA admixed population (shown in grey dashed box). Each node ends with a sample, and the y-axis represents distance between clusters.
47