bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Title: Strains used in whole organism Plasmodium falciparum vaccine trials differ in 2 genome structure, sequence, and immunogenic potential 3 4 Authors: Kara A. Moser1‡, Elliott F. Drábek1, Ankit Dwivedi1, Jonathan Crabtree1, Emily 5 M. Stucke2, Antoine Dara2, Zalak Shah2, Matthew Adams2, Tao Li3, Priscila T. 6 Rodrigues4, Sergey Koren5, Adam M. Phillippy5, Amed Ouattara2, Kirsten E. Lyke2, Lisa 7 Sadzewicz1, Luke J. Tallon1, Michele D. Spring6, Krisada Jongsakul6, Chanthap Lon6, 8 David L. Saunders6¶, Marcelo U. Ferreira4, Myaing M. Nyunt2§, Miriam K. Laufer2, Mark 9 A. Travassos2, Robert W. Sauerwein7, Shannon Takala-Harrison2, Claire M. Fraser1, B. 10 Kim Lee Sim3, Stephen L. Hoffman3, Christopher V. Plowe2§, Joana C. Silva1,8 11 12 Corresponding author: Joana C. Silva, [email protected]

13 Affiliations: 14 1 Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, 15 MD, 21201 16 2 Center for Vaccine Development and Global Health, University of Maryland School of 17 Medicine, Baltimore, MD, 21201 18 3 Sanaria, Inc. Rockville, Maryland, 20850 19 4 Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo, 20 São Paulo, Brazil 21 5 Genome Informatics Section, Computational and Statistical Genomics Branch, 22 National Research Institute, Bethesda, Maryland, USA 20892 23 6 Armed Forces Research Institute of Medical Sciences, Department of Bacterial and 24 Parasitic Diseases, Bangkok, Thailand 25 7 Department of Medical Microbiology, Radboud University Medical Center, Nijmegen, 26 Netherlands 27 8 Department of Microbiology and Immunology, University of Maryland School of 28 Medicine, Baltimore, MD, 21201 29 30 ‡Current address: Institute for Global Health and Infectious Diseases, University of 31 North Carolina Chapel Hill 32 ¶Current address: Warfighter Expeditionary Medicine and Treatment, US Army Medical 33 Material Development Activity 34 §Current address: Duke Global Health Institute, Duke University, Durham, North 35 Carolina, 27708 36 37 Author emails (in order of author list above): 38 [email protected], [email protected], [email protected], 39 [email protected], [email protected], [email protected], 40 [email protected], [email protected], [email protected], 41 [email protected] , [email protected], [email protected], 42 [email protected], [email protected], 43 [email protected], [email protected], 44 [email protected], [email protected], [email protected], 45 [email protected], [email protected], [email protected],

1

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

46 [email protected] , [email protected], 47 [email protected], [email protected], 48 [email protected], [email protected], [email protected], 49 [email protected], [email protected] 50

51

52

53 Abstract

54 Background: Plasmodium falciparum (Pf) whole-organism sporozoite vaccines have

55 provided excellent protection against controlled human malaria infection (CHMI) and

56 naturally transmitted heterogeneous Pf in the field. Initial CHMI studies showed

57 significantly higher durable protection against homologous than heterologous strains,

58 suggesting the presence of strain-specific vaccine-induced protection. However,

59 interpretation of these results and understanding of their relevance to vaccine efficacy

60 (VE) have been hampered by the lack of knowledge on genetic differences between

61 vaccine and CHMI strains, and how these strains are related to parasites in malaria

62 endemic regions.

63 Methods: Whole genome sequencing using long-read (Pacific Biosciences) and short-

64 read (Illumina) sequencing platforms was conducted to generate de novo genome

65 assemblies for the vaccine strain, NF54, and for strains used in heterologous CHMI (7G8

66 from Brazil, NF166.C8 from Guinea, and NF135.C10 from Cambodia). The assemblies

67 were used to characterize sequence polymorphisms and structural variants in each strain

68 relative to the reference Pf 3D7 (a clone of NF54) genome. Strains were compared to

69 each other and to clinical isolates from South America, Sub-Saharan Africa, and

70 Southeast Asia.

2

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

71 Results: While few variants were detected between 3D7 and NF54, we identified tens of

72 thousands of variants between NF54 and the three heterologous strains both genome-

73 wide and within regulatory and immunologically important regions, including in pre-

74 erythrocytic antigens that may be key for sporozoite vaccine-induced protection.

75 Additionally, these variants directly contribute to diversity in immunologically important

76 regions of the genomes as detected through in silico CD8+ T cell epitope predictions. Of

77 all heterologous strains, NF135.C10 consistently had the highest number of unique

78 predicted epitope sequences when compared to NF54, while NF166.C8 had the lowest.

79 Comparison to global clinical isolates revealed that these four strains are representative

80 of their geographic region of origin despite long-term culture adaptation; of note,

81 NF135.C10 is from an admixed population, and not part of recently-formed drug resistant

82 subpopulations present in the Greater Mekong Sub-region.

83 Conclusions: These results are assisting the interpretation of VE of whole-organism

84 vaccines against homologous and heterologous CHMI, and may be useful in informing

85 the choice of strains for inclusion in region-specific or multi-strain vaccines.

86

87 Keywords: P. falciparum, malaria, genome assembly, PfSPZ Vaccine, whole-sporozoite

88 vaccine

89

90

3

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

91 Main Text

92 Introduction

93 The flattening levels of mortality and morbidity due to malaria in recent years (1),

94 which follow a decade in which malaria mortality was cut in half (2), highlight the pressing

95 need for new tools to control the disease. A highly efficacious vaccine against

96 Plasmodium falciparum, the deadliest malaria parasite, would be a critical development

97 for control and elimination efforts. Several variations of a highly promising pre-

98 erythrocytic, whole organism malaria vaccine based on P. falciparum sporozoites (PfSPZ)

99 are under development, all based on the same P. falciparum strain, NF54 (3), thought to

100 be of West African origin, and which use different mechanisms for attenuation of PfSPZ.

101 Of these vaccine candidates, Sanaria® PfSPZ Vaccine, based on radiation-attenuated

102 sporozoites, has progressed furthest in clinical trial testing (4–10). Other whole-organism

103 vaccine candidates, including chemo-attenuated (Sanaria® PfSPZ-CVac), transgenic

104 and genetically attenuated sporozoites, are in earlier stages of development (11–13).

105 PfSPZ Vaccine showed 100% short-term protection against homologous

106 controlled human malaria infection (CHMI) in an initial phase 1 clinical trial (6), and

107 subsequent trials have confirmed that high levels of protection can be achieved against

108 both short-term (8) and long-term (7) homologous CHMI. However, depending on the

109 immunization regimen, sterile protection can be significantly lower (8-83%) against

110 heterologous CHMI using the 7G8 Brazilian clone (8,9), and against infection in malaria-

111 endemic regions with intense seasonal malaria transmission (29% and 52% by

112 proportional and time to event analysis, respectively) (10). Heterologous CHMI in

113 chemoprophylaxis with sporozoite (CPS) studies, in which immunization is by infected

4

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

114 mosquito bite of individuals undergoing malaria chemoprophylaxis, have been conducted

115 with NF135.C10 from Cambodia (14) and NF166.C8 from Guinea (15), and have had

116 lower efficacy than against homologous CHMI (16,17). One explanation for the lower

117 efficacy seen against heterologous P. falciparum strains is the extensive genetic diversity

118 in this parasite species, which is particularly high in encoding antigens (18), which

119 combined with low vaccine efficacy against non-vaccine alleles (19–21) reduces overall

120 protective efficacy and complicates the design of broadly efficacious vaccines (22,23).

121 The lack of a detailed genomic characterization of the P. falciparum strains used in CHMI

122 studies, and the unknown genetic basis of the parasite targets of PfSPZ Vaccine- and

123 PfSPZ CVac-induced protection, have precluded a conclusive statement regarding the

124 cause(s) of variable vaccine efficacy outcomes.

125 The current PfSPZ Vaccine strain, NF54, was isolated from a patient in the Netherlands

126 who had never left the country and is considered a case of “airport malaria”; the exact origin of

127 NF54 is unknown (3), but thought to be from Africa (24,25). NF54 is also the isolate from which

128 the P. falciparum 3D7 reference strain was cloned (26), and hence, despite having been

129 separated in culture for over 30 years, NF54 and 3D7 are assumed to be genetically identical,

130 and 3D7 is often used in homologous CHMI (6,8). Several issues hinder the interpretation of both

131 homologous and heterologous CHMI experiments conducted to date. It remains to be confirmed

132 that 3D7 has remained genetically identical to NF54 genome-wide, or that the two are at least

133 identical immunogenically. Indeed, NF54 and 3D7 have several reported phenotypic differences

134 when grown in culture, including the variable ability to produce gametocytes (27). In addition,

135 7G8, NF166.C8, and NF135.C10 have not been rigorously compared to each other or to NF54 to

136 confirm that they are adequate heterologous strains, even though they do appear to have distinct

137 infectivity phenotypes when used as CHMI strains (15). While the entire sporozoite likely offers

138 multiple immunological targets, no high-confidence correlates of protection currently exist. In part

5

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

139 because of the difficulty of studying hepatic parasite forms and their expression profiles in

140 humans, it remains unclear which parasite are recognized by the human immune system

141 during that stage, and elicit protection, upon immunization with PfSPZ vaccines. Both humoral

142 and cell-mediated responses have been associated with protection against homologous CHMI

143 (6,7), although studies in rodents and non-human primates point to a requirement for cell-

144 mediated immunity (specifically through tissue-resident CD8+ T cells) in long-term protection

145 (5,9,28,29). In silico identification of CD8+ epitopes in all strains could highlight critical differences

146 of immunological significance between strains. Finally, heterologous CHMI results cannot be a

147 reliable indicator of efficacy against infection in field settings unless the CHMI strains used are

148 characteristic of the geographic region from which they originate. These issues could impact the

149 use of homologous and heterologous CHMI, and the choice of strains for these studies, to predict

150 the efficacy of PfSPZ-based vaccines in the field (30).

151 These knowledge gaps can be addressed through a rigorous description and

152 comparison of the genome sequence of these strains. High-quality de novo assemblies

153 allow characterization of genome composition and structure, as well as the identification

154 of genetic differences between strains. However, the high AT content and repetitive

155 nature of the P. falciparum genome greatly complicates genome assembly methods (31).

156 Recently, long-read sequencing technologies have been used to overcome some of these

157 assembly challenges, as was shown with assemblies for 3D7, 7G8, and several other

158 culture-adapted P. falciparum strains generated using Pacific Biosciences (PacBio)

159 technology (32–34). However, NF166.C8 and NF135.C10 still lack whole-genome

160 assemblies; in addition, while an assembly for 7G8 is available (33), it is important to

161 characterize the specific 7G8 clone used in heterozygous CHMI, from Sanaria’s working

162 bank, as strains can undergo genetic changes over time in culture (35). Here, reference

6

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

163 assemblies for NF54, 7G8, NF166.C8, and NF135.C10 (hereafter referred to as PfSPZ

164 strains) were generated using approaches to take advantage of the resolution power of

165 long-read sequencing data and the low error rate of short-read sequencing platforms.

166 These de novo assemblies allowed for the thorough genetic and genomic characterization

167 of the PfSPZ strains and will aid in the interpretation of results from CHMI studies.

168 Materials and Methods

169 Study design and samples

170 This study characterized and compared the genomes of four P. falciparum strains

171 used in whole organism malaria vaccines and controlled human malaria infections using

172 a combination of long- and short-read whole genome sequencing platforms (see below).

173 In addition, these strains were compared to P. falciparum clinical isolates collected from

174 patients in malaria-endemic regions globally, using short-read whole genome sequencing

175 data. Genetic material for the four PfSPZ strains were provided by Sanaria, Inc. Clinical

176 P. falciparum isolates from Brazil, Mali, Malawi, Myanmar, Thailand, Laos, and Cambodia

177 were collected between 2009 and 2016 from cross-sectional surveys of malaria burden,

178 longitudinal studies of malaria incidence, and drug efficacy studies done in collaboration

179 with the Division of Malaria Research at the University of Maryland Baltimore, or were

180 otherwise provided by collaborators (Supplemental Dataset 1). All samples met the

181 inclusion criteria of the initial study protocol with prior approval from the local ethical

182 review board. Parasite genomic sequencing and analyses were undertaken after prior

183 approval of the University of Maryland School of Medicine Institutional Review Board.

184 These isolates were obtained by venous blood draws; almost all samples were processed

185 using leukocyte depletion methods to improve the parasite-to-human ratio before

7

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

186 sequencing. The exceptions were samples from Brazil and Malawi, which were not

187 leukocyte depleted upon collection. These samples underwent a selective whole genome

188 amplification step before sequencing, modified from (36) (the main modification being a

189 DNA dilution and filtration step using vacuum filtration prior to selective whole genome

190 amplification). Cambodia isolates had been sequenced for a previous study (37). In

191 addition to these samples, samples for which whole genome short-read sequencing has

192 previously been generated were obtained from NCBI’s Short Read Archive (SRA) to

193 supplement malaria-endemic regions not represented in our data set (Peru, Columbia,

194 French Guiana, Guinea, Papua New Guinea) and regions where PfSPZ trials are ongoing

195 (Burkina Faso, Kenya, and Tanzania) (Supplemental Dataset 1).

196 Whole genome sequencing

197 Genetic material for whole genome sequencing of the PfSPZ strains was

198 generated from a cryovial of each strain’s cell bank with the following identifiers: NF54

199 Working Cell Bank (WCB): SAN02-073009; 7G8 WCB: SAN02-

200 021214; NF135.C10 WCB: SAN07-010410; NF166.C8 Mother Cell Bank: SAN30-

201 020613. Each cryovial was thawed and maintained in human O+ red blood cells

202 (RBCs) at 2% hematocrit (Hct) in complete growth medium (RPMI 1649 with L-glutamine

203 and 25 mM HEPES supplemented with 10% human O+ serum and hypoxanthine) in a 6-

204 well plate in 5% O2, 5% CO2 and 90% N2 at

205 37°C. The cultures were then further expanded by adding fresh RBCs every 3-4

206 days and increased culture hematocrit (Hct) to 5% Hct using a standard method (38).

207 The complete growth medium was replaced daily. When the PfSPZ strain culture volume

208 reached 300-400 mL and a parasitemia of more than 1.5%, the culture suspensions

8

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

209 were collected and the parasitized RBCs were pelleted down by centrifugation at 1800

210 rpm for 5 min. Aliquots of 0.5 mL per cryovial of the parasitized RBCs were stored at -

211 80°C prior to extraction of genomic DNA. Genomic DNA was extracted using the Qiagan

212 Blood DNA Midi Kit (Valencia, CA, USA). Pacific Biosciences (PacBio) sequencing was

213 done for each PfSPZ strain. Total DNA was prepared for PacBio sequencing using the

214 DNA Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA). DNA was fragmented

215 with the Covaris E210, and the fragments were size selected to include those >15 Kbp in

216 length. Libraries were prepared per the manufacturer’s protocol. Four SMRT cells were

217 sequenced per library, using P6C4 chemistry and a 120-minute movie on the PacBio RS

218 II (Pacific Biosystems, Menlo Park, CA).

219 Short-read sequencing was done for each PfSPZ strain and for our collection of

220 clinical isolates using the Illumina HiSeq 2500 or 4000 platforms. Prepared genomic DNA,

221 extracted from cultured parasites, leukocyte-depleted samples, or from samples that

222 underwent sWGA (see above), was used to construct DNA libraries for sequencing on

223 the Illumina platform using the KAPA Library Preparation Kit (Kapa Biosystems, Woburn,

224 MA). DNA was fragmented with the Covaris E210 or E220 to ~200 bp. Libraries were

225 prepared using a modified version of manufacturer’s protocol. The DNA was purified

226 between enzymatic reactions and the size selection of the library was performed with

227 AMPure XT beads (Beckman Coulter Genomics, Danvers, MA). When necessary, a PCR

228 amplification step was performed with primers containing an index sequence of six

229 nucleotides in length. Libraries were assessed for concentration and fragment size using

230 the DNA High Sensitivity Assay on the LabChip GX (Perkin Elmer, Waltham, MA). Library

231 concentrations were also assessed by qPCR using the KAPA Library Quantification Kit

9

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

232 (Complete, Universal) (Kapa Biosystems, Woburn, MA). The libraries were pooled and

233 sequenced on a 100 bp paired-end Illumina HiSeq 2500 or 4000 run (Illumina, San Diego,

234 CA).

235 Assembly generation and characterization of PfSPZ strains

236 Canu (v1.3) (39) was used to correct and assemble the PacBio reads

237 (corMaxEvidenceErate=0.15 for AT-rich genomes, and using default parameters

238 otherwise). To optimize downstream assembly correction processes and parameters, the

239 percentage of total differences (both in bp and by proportion of the 3D7 genome not

240 captured by the NF54 assembly) between the NF54 assembly and the 3D7 reference

241 (PlasmoDBv24) was calculated after each round of correction. Quiver (smrtanalysis v2.3)

242 (40) was run iteratively with default parameters to reach a (stable) maximum reduction in

243 percent differences between the two genomes and the assemblies were further corrected

244 with Illumina data using Pilon (v1.13) (41) with the following parameters: --fixbases, --

245 mindepth 5, --K 85, --minmq 0, and --minqual 35.

246 Assemblies were compared to the 3D7 reference (PlasmoDBv24) using

247 MUMmer’s nucmer (42), and the show-snps function was used to generate a list of SNPs

248 and small (< 50 bp) indels between assemblies. Coding and non-coding variants were

249 classified by comparing the show-snps output with the 3D7 gff3 file using custom scripts.

250 Structural variants, defined as indels, deletions, and tandem or repeat expansion and

251 contractions each greater than 50 bp in length were identified using the nucmer-based

252 Assemblytics tool (43) (unique anchor length: 1 Kbp). Translocations were identified by

253 eye through inspection of mummerplots. Translocations and other large structural

254 variants were confirmed using read-mapping data of the PacBio reads against either the

10

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

255 3D7 reference genome or the respective PfSPZ strain assembly using the NGLMR aligner

256 (44) and visualized using Ribbon (45). Reconstructed exon 1 sequences for var genes,

257 encoding P. falciparum erythrocyte membrane 1 (PfEMP1) antigens, for each

258 PfSPZ strain were recovered using the ETHA package (46). As a check for var exon 1

259 sequences that were missed during the generation of the strain’s assembly, a targeted

260 read capture and assembly approach was done using a strain’s Illumina data, wherein

261 var-like reads for each PfSPZ strains were identified by mapping reads against a

262 database of known var exon 1 sequences (47) using bowtie2 (48). Reads that mapped to

263 a known exon 1 sequence were then assembled with Spades (v3.9.0) (49), and the

264 assembled products were blasted against the PacBio reads to identify if they were truly

265 missed exon 1 sequences by the de novo assembly process, or if instead they were

266 chimeras reconstructed by the targeted assembly process. To identify homologs of the

267 61 var exon 1 sequences found in the 3D7 genome, each PfSPZ strain’s set of exon 1

268 amino acid sequences were compared to the 3D7 set using blastp.

269 In silico MHC I epitope predictions

270 Given the reported importance of CD8+ T cell responses towards immunity to

271 whole sporozoites, MHC class 1 epitopes of length 9 amino acids (9mers) were predicted

272 with NetMHCpan (v3.0) (50) for each PfSPZ strain, using inferred amino acid sequences

273 in each genome. HLA types common to African countries where PfSPZ or PfSPZ-CVac

274 trials are ongoing were used for epitope predictions based on frequencies in the Allele

275 Frequency Net Database (51) or from the literature (52,53) (Table S1). Shared epitopes

276 between NF54 and the three heterologous PfSPZ strains were calculated by first

277 identifying epitopes in each gene, and then removing duplicate epitope sequence entries

11

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

278 (caused by recognition by multiple HLA types). Identical epitope sequences that were

279 identified in two or more genes were treated as distinct epitope entries, and all unique

280 epitope-gene combinations were included when calculating the number of shared

281 epitopes between strains.

282 Epitopes were predicted in several gene sets. The first was all protein-coding

283 genes in which no premature stop codons were detected across all four PfSPZ strains

284 (n=4,814). The second was a subset of the above that are expressed as proteins during

285 the sporozoite stage of the parasite in the mosquito salivary glands, as supported by

286 proteomic data (n=1,974) (54,55). Lastly, a list of pre-erythrocytic genes of interest was

287 generated, either from a review of the literature, or genes whose products were

288 recognized by sera from protected vaccinees of whole organism malaria vaccine trials

289 (both PfSPZ and PfSPZ-CVac) (n=42) (11,56). These gene sequences were inspected

290 manually in each assembly, and corrected for any remaining PacBio sequencing error,

291 particularly one bp indels in homopolymer runs. Even though the 42 genes listed above

292 were detected through antibody responses, many have also been shown to have T cell

293 epitopes, such as CSP and LSA-1.

294 Read mapping and SNP calling

295 For the full collection of clinical isolates that had whole genome short-read

296 sequencing data (generated either at IGS or downloaded from SRA), reads were aligned

297 to the 3D7 reference genome (PlasmoDBv24) using bowtie2 (v2.2.4) (48). Samples with

298 less than 10 million reads mapping to the reference were excluded, as samples with less

299 than this amount had reduced coverage across the genome. Bam files were processed

300 according to GATK’s Best Practices documentation (57–59). Joint SNP calling was done

12

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

301 using Haplotype Caller (v4.0). Because clinical samples may be polyclonal (that is, more

302 than one parasite strain may be present), diploid calls were initially allowed, followed by

303 calling the major allele at positions with heterozygous calls. If the major allele was

304 supported by >70% of reads at a heterozygous position, the major allele was assigned

305 as the allele at that position (otherwise, the genotype was coded as missing). Additional

306 hard filtering was done to remove potential false positives based on the following filter:

307 DP < 12 || QUAL < 50 || FS > 14.5 || MQ < 20, and variants were further filtered to remove

308 those for which the non-reference allele was not present in at least three samples

309 (frequency less than ~0.5%), and those with more than 10% missing genotype values

310 across all samples.

311 Principal Coordinates Analyses and Admixture Analyses

312 A matrix of pairwise genetic distances was constructed from biallelic non-

313 synonymous SNPs identified from the above pipeline (n=31,761) across all samples

314 (n=654) using a custom Python script, and principal coordinates analyses (PCoAs) were

315 done to explore population structure using cmdscale in R. Additional population structure

316 analyses were done using Admixture (v1.3) (60) on two separate data sets: South

317 America and Africa clinical isolates plus NF54, NF166.C8, and 7G8 (n=461), and

318 Southeast Asia and Oceania plus NF135.C10 (n=193). The data sets were additionally

319 pruned for sites in linkage disequilibrium (window size of 20 Kbp, window step of 2 Kbp,

320 R2 0.1). The final South America/Africa and Southeast Asia/Oceania data set used for

321 the admixture analysis consisted of 16,802 and 5,856 SNPs, respectively. The number of

322 populations, K, was tested for values between K=1 to K=15 and run with 10 replicates for

323 each K. For each population, the cross-validation (CV) error from the replicate with the

13

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

324 highest log-likelihood value was plotted, and the K with the lowest CV value was chosen

325 as the final K.

326 To compare subpopulations identified in our Southeast Asia/Oceania admixture

327 analysis with previously described ancestral, resistant, and admixed subpopulations from

328 Cambodia (61), the above non-synonymous SNP set was used before pruning for LD

329 (n=11,943), and was compared to a non-synonymous SNP dataset (n=21,257) from 167

330 samples used by Dwivedi et al. (62) to describe eight Cambodian subpopulations, in an

331 analysis that included a subset of samples used by Miotto et al. (61) (who first

332 characterized the population structure in Cambodia). There were 5,881 shared non-

333 synonymous SNPs between the two datasets, 1,649 of which were observed in

334 NF135.C10. A pairwise genetic distance matrix (estimated as the proportion of base-pair

335 differences between pairs of samples, not including missing genotypes) was generated

336 from the 5,881 shared SNP set, and a dendrogram was built using Ward minimum

337 variance methods in R (Ward.D2 option of the hclust function).

338 Results

339 Generation of assemblies and validation of the assembly protocol

340 To characterize genome-wide structural and genetic diversity of the PfSPZ strains,

341 genome assemblies were generated de novo using whole genome long-read (PacBio)

342 and short-read (Illumina) sequence data (Methods; Table S2; Table S3). Taking

343 advantage of the parent isolate-clone relationship between NF54 and 3D7, we used NF54

344 as a test case to derive the assembly protocol, by adopting, at each step, approaches

345 that minimized the difference to 3D7. The raw assembly of the NF54 genome, while

346 containing 99.9% of the 3D7 genome in 30 contigs, showed ~78K sequence variants

14

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

347 relative to 3D7 (Figure S1). Many of these differences were small (1-3) (bp)

348 indels (insertions or deletions), particularly in AT-rich regions and homopolymer runs,

349 strongly indicative of remaining sequencing errors (63). To maximize error removal with

350 existing tools, Quiver was run iteratively on the NF54 assembly; two consecutive Quiver

351 runs consistently and substantially reduced the number of differences between NF54 and

352 3D7 (Figure S1). Illumina reads from NF54 were used to further polish the assembly

353 using Pilon. The resulting NF54 assembly is 23.4 Mb in length (Figure 1) and, when

354 compared to the 3D7 reference, contains 8,396 indels and 1,386 single nucleotide

355 polymorphisms (SNPs), a rate of ~59 SNPs per million base pairs, the vast majority of

356 which occurred in noncoding regions (Table 1). In fact, only 17 non-synonymous

357 mutations were detected in 15 intact protein-coding single-copy genes between the two

358 genomes (Supplemental Dataset 2). We applied the same assembly protocol to the

359 remaining PfSPZ strains. The resulting assemblies for NF166.C8, 7G8, and NF135.C10

360 are structurally very complete, with the 14 nuclear represented by 30, 20,

361 and 19 nuclear contigs, respectively, with each in the 3D7 reference

362 represented by one to three contigs (Figure 1). Several shorter contigs in NF54 (67,501

363 bps total), NF166.C8 (224,502 bps total), and NF135.C10 (80,944 bps total) could not be

364 unambiguously assigned to a homologous segment in the 3D7 reference genome; gene

365 annotation showed that these contigs mostly contain members of multi-gene families, and

366 therefore are likely part of sub-telomeric regions. The cumulative lengths of the four

367 assemblies ranged from 22.8 Mbp to 23.5 Mbp (Table 1), indicating variation in genome

368 size among P. falciparum strains. In particular, the 7G8 assembly was several hundred

369 thousand base-pairs smaller than the other three assemblies. To confirm that this was

15

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

370 not an assembly error, we compared 7G8 to a previously published 7G8 PacBio-based

371 assembly (PlasmoDBv41) (33). The two assemblies were extremely close in overall

372 genome structure, differing only by 2 Kbp in cumulative length, and also shared a very

373 similar number of SNP and small indel variants relative to 3D7 (Table S4).

374 Structural variations in the genomes of the PfSPZ strains

375 Many structural variants (defined as indels or tandem repeat contractions or

376 expansions, greater than 50 bp) were identified in each assembly by comparison to the

377 3D7 genome, impacting a cumulative length of 199.0 Kbp in NF166.C8 to 340.9 Kbp in

378 NF135.C10 (Table S5). Many smaller variants fell into coding regions (including known

379 pre-erythrocytic antigens), often representing variation in repeat units (Supplemental

380 Dataset 3). Several larger structural variants (>10 Kbp) exist in 7G8, NF166.C8, and

381 NF135.C10 relative to 3D7. Many of these regions contain members of multi-gene

382 families, and as expected the number of these vary between each assembly (Figure S2).

383 While all 61 3D7 var exon 1 sequences were present as almost identical sequences in

384 the NF54 assembly (with the exception of small indels in several exon 1 sequences), the

385 number of 3D7 exon 1 var sequences with identifiable orthologs in the three heterologous

386 PfSPZ strains was much lower. Only a few of the identified orthologs had amino acid

387 sequence identity higher than 75% (Figure S2), reflecting the exceptional sequence

388 diversity that occurs in this gene family in and among P. falciparum strains. A recently

389 characterized PfEMP1 protein expressed on the surface of NF54 sporozoites

390 (NF54varsporo) was shown to be involved in hepatocyte invasion (Pf3D7_0809100), and

391 antibodies to this PfEMP1 blocked hepatocyte invasion (64). No ortholog to NF54varsporo

392 was identified in the var repertoire of 7G8, NF166.C8, or NF135.C10 (Figure S2). It is

16

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

393 possible that a different, strain-specific, var gene fulfills a similar role in each of the

394 heterologous PfSPZ strains.

395 Several other large structural variants impact regions housing non-multi-gene

396 family members, although none known to be involved in pre-erythrocytic immunity.

397 Examples include a 31 Kbp-long tandem expansion of a region of chromosome 12 in the

398 7G8 assembly (also present in the previously published assembly for 7G8 (33)) and a

399 22.7 Kbp-long repeat expansion of a region of chromosome 5 in NF135.C10, both of

400 which are supported by ~200 PacBio reads. The former is a segmental duplication

401 containing a vacuolar iron transporter (PF3D7_1223700), a putative citrate/oxoglutarate

402 carrier protein (PF3D7_1223800), a putative 50S ribosomal protein L24

403 (PF3D7_1223900), GTP cyclohydrolase I (PF3D7_1224000), and three conserved

404 Plasmodium proteins of unknown function (PF3D7_1223500, PF3D7_1223600,

405 PF3D7_1224100). The expanded region in NF135.C10 represents a tandem expansion

406 of a segment housing the gene encoding the Pf multidrug resistance protein, PfMDR1

407 (PF3D7_0523000), resulting in a total of four copies of this gene in NF135.C10. Other

408 genes in this tandem expansion include those encoding an iron-sulfur assembly protein

409 (PF3D7_0522700), a putative pre-mRNA-splicing factor DUB31 (PF3D7_0522800), a

410 putative zinc finger protein (PF3D7_0522900) and a putative mitochondrial-processing

411 peptidase subunit alpha protein (PF3D7_0523100). In addition, the NF135.C10 assembly

412 contained a large translocation involving chromosomes 7 (3D7 coordinates ~520,000 to

413 ~960,000) and 8 (start to coordinate ~440,000) (Figure S3); the boundaries of the

414 translocated regions contain members of multi-gene families. Independent assemblies

415 and mapping patterns observed by aligning the PacBio reads against the NF135.C10 and

17

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

416 3D7 assemblies support the translocation (Figure S4). Furthermore, the regions of

417 chromosomes 7 and 8 where the translocation occurs are documented recombination

418 hotspots that were identified specifically in isolates from Cambodia, the site of origin of

419 NF135.C10 (65).

420 Several structural differences in genic regions were also identified between the

421 NF54 assembly and the 3D7 genome (Supplemental Dataset 3); if real, these structural

422 variants would have important implications in the interpretation of trials using 3D7 as a

423 homologous CHMI strain. For example, a 1,887 bp tandem expansion was identified in

424 the NF54 assembly on chromosome 10, which overlapped the region containing liver

425 stage antigen 1 (PfLSA-1, PF3D7_1036400). The structure of this gene in the NF54 strain

426 was reported when PfLSA-1 was first characterized, with unique N- and C-terminal

427 regions flanking a repetitive region consisting of several dozen repeats of a 17 amino acid

428 motif (66,67). The CDS of PfLSA-1 in the NF54 assembly was 5,406 bp in length but only

429 3,489 bp-long in the 3D7 reference. To determine if this was an assembly error in the

430 NF54 assembly, the PfLSA-1 locus from a recently published PacBio-based assembly of

431 3D7 (32) was compared to that of NF54. The two sequences were identical, likely

432 indicative of incorrect collapsing of the repeat region of PfLSA-1 in the 3D7 reference;

433 NF54 and 3D7 PacBio-based assemblies had 79 units of the 17-mer amino acid repeat,

434 compared to only 43 in the 3D7 reference sequence, a result further validated by the

435 inconsistent depth of mapped Illumina reads from NF54 between the PfLSA repeat region

436 and its flanking unique regions in the 3D7 reference (Figure S5). Using the same

437 comparative rationale as above, where structural differences relative to the 3D7

438 reference, but common to both PabBio-based 3D7 and NF54 assemblies, are considered

18

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

439 true, all structural variations identified in NF54 likely represent existing structural errors in

440 the 3D7 reference genome, including 15 variants that affect genic regions (Figure S6).

441 Small sequence variants between PfSPZ strains and the reference 3D7 genome

442 Very few small sequence variants were identified in NF54 compared to the 3D7

443 reference; 17 non-synonymous mutations were present in 15 single-copy non-

444 pseudogene-encoding loci (Supplemental Dataset 2). Short indels were detected in 185

445 genes; many of these indels have length that is not multiple of three and occur in

446 homopolymer runs, possibly representing remaining PacBio sequencing error. However,

447 some may be real, as a small indel causing a frameshift in PF3D7_1417400, a putative

448 protein-coding pseudogene that has previously been shown to accumulate premature

449 stop codons in laboratory-adapted strains (68), and some may be of biologic importance,

450 such as those seen in two histone-related proteins (PF3D7_0823300 and

451 PF3D7_1020700). It has been reported that some clones of 3D7, unlike NF54, are unable

452 to consistently produce gametocytes in long-term culture (27,69); no polymorphisms were

453 observed within or directly upstream of PfAP2-G (PF3D7_1222600) (Table S6), which

454 has been identified as a transcriptional regulator of sexual commitment in P. falciparum

455 (70). However, a non-synonymous mutation from arginine to proline (R1286P) was

456 observed in a AP2-coincident C-terminal domain of PfAP2-L (PF3D7_0730300), a gene

457 associated with liver stage development (71); a proline was also present at the

458 homologous position in 7G8, NF166.C8, and NF135.C10. Two additional putative AP2

459 transcription factor genes (PF3D7_0613800 and PF3D7_1317200) contained a three bp

460 deletion in NF54 relative to 3D7. Although knock-downs of PF3D7_1317200 have

461 resulted in a loss of gametocyte production (72), neither indel was located within predicted

19

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

462 AP2 domains, and so their biological relevance remains to be determined. 7G8, NF66.C8,

463 and NF135.C10 also have numerous non-synonymous mutations and indels within

464 putative AP2 genes, a small number of which fell in predicted AP2 domains. Finally,

465 NF135.C10 has an insertion almost 200 base-pairs in length relative to 3D7 in the 3’ end

466 of PfAP2-G; the insertion also carries a premature stop codon, leading to a considerably

467 different C-terminal end for the transcription factor (Figure S7). This allele is also present

468 in previously published assemblies for clones from Southeast Asia (33), including the

469 culture-adapted strain DD2, and variations of this indel (without the in-frame stop codon)

470 are also found in several non-human malaria Plasmodium species) (Figure S7),

471 suggesting an interesting evolutionary trajectory of this sequence.

472 Given that no absolute correlates of protection are known for whole organism P.

473 falciparum vaccines, genetic differences were assessed both across the genome and in

474 pre-erythrocytic genes of interest in the three heterologous CHMI strains. As expected,

475 the number of mutations between 3D7 and these three PfSPZ strains was much higher

476 than observed for NF54, with ~40K-55K SNPs and as many indels in each pairwise

477 comparison, roughly randomly distributed among intergenic regions, silent and non-

478 synonymous sites (Table 1, Figure 2), and corresponding to a pairwise SNP density

479 relative to 3D7 of 1.9, 2.1 and 2.2 SNPs/Kbp for 7G8, NF166.C8 and NF135.C10,

480 respectively. NF135.C10 had the highest number of unique SNPs genome-wide (SNPs

481 not shared with other PfSPZ strains), with 5% more unique SNPs than NF166.C8 and

482 33% more than 7G8 (Figure S8). Similar trends were seen when restricting the analyses

483 to non-synonymous SNPs (5% and 26% more than NF166.C8 and 7G8, respectively).

484 The lower number of unique SNPs in 7G8 may be due in part to the smaller genome size

20

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

485 of this strain. Small indels with length multiple of three (but not two) were also common in

486 coding regions across the genome, as expected from purifying selection on mutations

487 that disrupt the reading frame, whereas indels in non-coding regions were primarily of

488 lengths multiple of two, from unequal crossover or replication slippage in the regions of

489 dinucleotide repeats common in the P. falciparum genome (Figure S9), confirming what

490 has been shown in previous work using short-read whole genome sequencing data (73).

491 SNPs were also common in a panel of 42 pre-erythrocytic genes known or

492 suspected to be implicated in immunity to liver-stage parasites (see Methods; Table S7).

493 While the sequence of all these loci was identical between NF54 and 3D7, there was a

494 wide range in the number of sequence variants per locus between 3D7 and the other

495 three PfSPZ strains, with some genes being more conserved than others. For example,

496 PfCSP showed 8, 7 and 6 non-synonymous mutations in 7G8, NF166.C8, and

497 NF135.C10, respectively, relative to 3D7. However, PfLSA-1 had over 100 non-

498 synonymous mutations in all three heterologous strains relative to 3D7 (many in the

499 repetitive region of this gene), in addition to significant length differences in the internal

500 repeat region (Figure S10).

501 Immunological relevance of genetic variation among PfSPZ strains

502 The sequence variants mentioned above may impact the ability of the immune

503 system primed with NF54 to recognize the other PfSPZ strains, impairing vaccine efficacy

504 against heterologous CHMI. Data from murine and non-human primate models

505 (5,28,29,74) demonstrate that CD8+ T cells are required for protective efficacy; therefore,

506 the identification of shared and unique CD8+ T cell epitopes across the genome in all four

507 PfSPZ strains may help interpret the differential efficacy seen in heterologous relative to

21

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

508 homologous CHMI. To create a comparable set of genes in which we were confident of

509 the amino acid sequence, we subset the 5,548 protein-coding genes in the 3D7 genome

510 annotation to 4,814 genes that (1) did not have premature stop codons, 2) were annotated

511 with an expected number of mRNA fragments, and (3) met the above requirements in all

512 four PfSPZ strains (in practice, this removed most, if not all, members of multi-gene

513 families, such as vars, rifins, stevors) (Figure 3A). Strong-binding MHC class I epitopes

514 in the protein sequences from these loci were identified using in silico epitope predictions

515 based on HLA types common in sub-Saharan Africa populations (Table S1).

516 Similar total number of epitopes (sum of unique epitopes, regardless of the HLA-

517 type, across genes) were identified in the three heterologous CHMI strains, with

518 NF166.C8 having the fewest (492,291) and NF135.C10 having the most (495,285)

519 (Figure 3B). More epitopes were identified in NF54 (n=512,172) than in the three

520 heterologous CHMI strains. This slightly higher number might be due to a more accurate

521 mapping of the 3D7 genome annotation to the NF54 assembly compared to the other

522 three strains (Figure 3A). More than 80% of epitopes detected in each strain were

523 identical (by sequence and by locus) across all four strains, reflecting epitopes predicted

524 in conserved regions of the 4,814 genes used in this analysis. NF54 had the highest

525 proportion of unique epitopes (9,002, or 1.76%) (Figure 3A), i.e., epitopes not shared

526 with the other PfSPZ strains, followed by NF135.C10 (4,123 or 0.83%), 7G8 (3,500 or

527 0.71%), and NF166.C8 (2,878, or 0.58%). After direct comparisons of NF54 to each

528 heterologous PfSPZ strain, NF166 had the lowest proportion of epitopes not shared with

529 NF54 (1.97%, compared to 2.24% and 2.27% for 7G8 and NF135.C10, respectively), as

22

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

530 would follow from a positive correlation between geographic distance and genetic

531 similarity.

532 To focus the analysis on genes that might be important for the protective efficacy

533 of a whole sporozoite vaccine, the above epitope predictions were stratified by two

534 additional gene sets: 1) 1,974 genes that were found to be expressed in the sporozoite

535 stage of the parasite in mosquito salivary glands (54,55), and (2) the 42 pre-erythrocytic

536 genes of interest mentioned above. In the sporozoite-expressed gene set, similar patterns

537 as in the genome-wide gene set above were seen, with NF135.C10 being the

538 heterologous CHMI strain that had the highest number of epitopes not shared with any

539 other strain (1,243, or 0.6%), and NF166 having the smallest (884, or 0.4%) (Figure 3C).

540 Within the 42 pre-erythrocytic gene set, there were 117, 121, and 153 such unique

541 epitopes present in 7G8, NF166.C8, and NF135.C10, respectively (Figure S11, Table

542 S8). Genes where unique epitopes were identified in this gene set were also, on average,

543 more polymorphic (Table S7).

544 Some of these variations in epitope sequences are relevant for the interpretation

545 of the outcome of PfSPZ Vaccine trials. For example, while all four strains are identical in

546 sequence composition in a B cell epitope potentially relevant for protection recently

547 identified in the circumsporozoite protein (PfCSP, PF3D7_0304600) (75), another B cell

548 epitope that partially overlaps it (76) contained an A98G amino acid difference in 7G8 and

549 NF135.C10 relative to NF54 and NF166.C8. There was also variability in CD8+ T cell

550 epitopes recognized in the Th2R region of the protein, some known for decades (77).

551 Specifically, the PfCSP encoded by the 3D7/NF54 allele was predicted to bind to both

552 HLA-A and HLA-C allele types, but the homologous protein segments in NF166.C8 and

23

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

553 NF135.C10 were only recognized by HLA-A allele types, and no epitope was detected

554 (within the HLA types studied) at that position in PfCSP encoded in 7G8 (Figure 4).

555 Expansion of the analyses to additional HLA types revealed an allele (HLA-08:01) that is

556 predicted to bind to the Th2R region of the 7G8-encoded PfCSP; however, HLA-08:01 is

557 much more frequent in European populations (10-15%) than in African populations (1-

558 6%) (51). Therefore, if CD8+ T cell epitopes in the Th2R region of 7G8 are important for

559 protection, which is currently unknown, the level of protection against CHMI with 7G8

560 observed in volunteers of European descent may not be correlated with PfSPZ Vaccine

561 efficacy in Africa. Indels and repeat regions also sometimes affected the number of

562 predicted epitopes in each antigen; for example, an insertion in 7G8 near amino acid

563 residue 1,600 in PfLISP-2 (PF3D7_0405300) contained additional predicted epitopes

564 (Figure S12). Similar patterns in variation in epitope recognition and frequency were

565 found in other pre-erythrocytic genes of interest, including PfLSA-3 (PF3D7_0220000),

566 PfAMA-1 (PF3D7_1133400) and PfTRAP (PF3D7_1335900) (Figure S12).

567 PfSPZ strains and global parasite diversity

568 The four PfSPZ strains have been adapted and kept in culture for extended periods

569 of time. To determine if they are still representative of the malaria-endemic regions from

570 which they were collected, we compared these strains to over 600 recent (2007-2014)

571 clinical isolates from South America, Africa, Southeast Asia, and Oceania (Supplemental

572 Dataset 4), using principal coordinates analysis (PCoA) based on SNP calls generated

573 from Illumina whole genome sequencing data. The results confirmed the existence of

574 global geographic differences in genetic variation previously reported (78,79), including

575 clustering by continent, as well as a separation of East from West Africa and of the

24

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

576 Amazonian region from that West of the Andes (Figure 5). The PfSPZ strains clustered

577 with others from their respective geographic regions, both at the genome-wide level and

578 when restricting the data set to SNPs in the panel of 42 pre-erythrocytic antigens, despite

579 the long-term culturing of some of these strains (Figure 5). An admixture analysis of

580 South American and African clinical isolates confirmed that NF54 and NF166.C8 both

581 have the genomic background characteristic of West Africa, while 7G8 is clearly a South

582 American strain (Figure S13).

583 NF135.C10 was isolated in the early 1990s (14), at a time when resistance to

584 chloroquine and sulfadoxine-pyrimethamine resistance was entrenched and resistance

585 to mefloquine was emerging (80,81), and carries signals from this period of drug pressure.

586 Four copies of MDR-1 were identified in NF135.C10 (Table S9); however, two of these

587 copies appeared to have premature stop codons introduced by SNPs and/or indels,

588 leaving potentially only two functional copies in the genome. While NF135.C10 also had

589 numerous point mutations relative to 3D7 in genes such as PfCRT (conveying

590 chloroquine resistance), and PfDHPS and PfDHR (conveying sulfadoxine-pyrimethamine

591 resistance), NF135.C10 was isolated before the widespread deployment of artemisinin-

592 based combination therapies (ACTs), and had the wild-type allele in the locus that

593 encodes the kelch protein in chromosome 13 (PfK13), with no mutations known to convey

594 artemisinin resistance detected in the propeller region (Table S10).

595 The emergence in Southeast Asia of resistance to antimalarial drugs, including

596 artemisinins and drugs used in artemisinin-based combination treatments, is thought to

597 underlie the complex and dynamic parasite population structure in the region (82).

598 Several relatively homogeneous subpopulations, whose origin is likely linked to the

25

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

599 emergence and rapid spread of drug resistance mutations, exist in parallel with a sensitive

600 subpopulation that reflects the ancestral population in the region (referred to as KH1),

601 and another subpopulation of admixed genomic background (referred to as KHA),

602 possibly the source of the drug-resistant subpopulations or the result of a secondary mix

603 of resistant subpopulations (61,62,83,84). This has been accompanied by reports of

604 individual K13 mutations conferring artemisinin resistance occurring independently on

605 multiple genomic backgrounds (85). To determine the subpopulation to which NF135.C10

606 belongs, an admixture analysis was conducted using isolates from Southeast Asia and

607 Oceania, including NF135.C10. Eleven total populations were detected, of which seven

608 contained Cambodian isolates (Figure 6). Both admixture and hierarchical clustering

609 analyses suggest that NF135.C10 is representative of the previously described admixed

610 KHA subpopulation (61,62) (Figure 6), implying that NF135.C10 is representative of a

611 long-standing admixed population of parasites in Cambodia rather than one of several

612 subpopulations thought to have arisen more recently following the emergence of drug

613 resistance.

614 Discussion

615 Whole organism sporozoite vaccines have provided variable levels of protection in

616 initial clinical trials; the radiation-attenuated PfSPZ Vaccine has been shown to protect

617 >90% of subjects against homologous CHMI at 3 weeks after the last dose in 5 clinical

618 trials in the U.S. (6,8) and Germany (11). However, efficacy has been lower against

619 heterologous CHMI (8,9), and in field studies in a region of intense transmission, in Mali,

620 at 24 weeks (10). Interestingly, for the exact same immunization regimen, protective

621 efficacy by proportional analysis was greater in the field trial in Mali (29%) than it was

26

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

622 against heterologous CHMI with Pf 7G8 in the US at 24 weeks after last dose of vaccine

623 (8%) (8,10). While evidence shows that whole organism-based vaccine efficacy can be

624 improved by adjusting the vaccine dose and schedule (11), further optimization of such

625 vaccines will be facilitated by a thorough understanding of the genotypic and immunologic

626 differences among the PfSPZ strains and between them and parasites in malaria endemic

627 regions.

628 A recent study examined whole genome short-read sequencing data to

629 characterize NF166.C8 and NF135.C10 through SNP calls, and identified a number of

630 non-synonymous mutations at a few loci potentially important for the efficacy of CPS, the

631 foundation for PfSPZ-CVac (17). The analyses described here, using high-quality de novo

632 genome assemblies, expand the analysis to hard-to-call regions, such as those

633 containing gene families, repeats and other low complexity sequences. The added

634 sensitivity enabled the thorough genomic characterization of these and additional

635 vaccine-related strains, and revealed a considerably higher number of sequence variants

636 than can be called using short read data alone, as well as indels and structural variants

637 between assemblies. For example, the insertion close to the 3’ end of PfAP2-G detected

638 in NF135.C10 and shared by DD2 has not, to the best of our knowledge, been reported

639 before, despite the multiple studies highlighting the importance of this gene in sexual

640 commitment in P. falciparum strains, including DD2 (70). Long-read sequencing also

641 confirmed that differences observed between NF54 and 3D7 in a major liver stage

642 antigen, PfLSA-1, result from one of a small number of errors lingering in the reference

643 3D7 genome, which is being continually updated and improved (34). Confirmation that

644 NF54 and 3D7 are identical at this locus is critical when 3D7 has been used as a

27

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

645 homologous CHMI in whole sporozoite, NF54-based vaccine studies. Furthermore, the

646 comprehensive sequence characterization of variant surface antigen-encoding loci, such

647 as PfEMP1-encoding genes, will enable the use of the PfSPZ strains to study the role of

648 these protein families in virulence, naturally acquired immunity and vaccine-induced

649 protection (86).

650 The comprehensive genetic and genomic studies reported herein were designed

651 to provide insight into the outcome of homologous and heterologous CHMI studies, and

652 to determine whether the CHMI strains can be used as a proxy for strains present in the

653 field. Comparison of genome assemblies confirmed that NF54 and 3D7 have remained

654 genetically very similar over time, and that 3D7 is an appropriate homologous CHMI

655 strain. As expected, 7G8, NF166.C8, and NF135.C10 were genetically very distinct from

656 NF54 and 3D7, with thousands of differences across the genome including dozens in

657 known pre-erythrocytic antigens. The identification of sequence variants (both SNPs and

658 indels) within transcriptional regulators, such as the AP2 family, may assist in the study

659 of different growth phenotypes in these strains. NF166.C8 and NF135.C10 merozoites

660 enter the blood stream several days earlier than those of NF54 (15), suggesting that NF54

661 may develop more slowly in hepatocytes than do the other two strains. Therefore,

662 mutations in genes associated with liver-stage development (as was observed with

663 PfAP2-L) may be of interest to explore further. Finally, comparison of the PfSPZ strains

664 to whole genome sequencing data from clinical isolates shows that, at the whole genome

665 level, they are indeed representative of their geographical regions of origin.

666 These results can assist in the interpretation of CHMI studies in multiple ways.

667 First, in the 42 pre-erythrocytic genes of interest, the Brazilian strain 7G8, one of the first

28

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

668 strains to be used in heterologous CHMI studies, and the Guinean strain NF166.C8 share

669 a similar number of epitopes with NF54. Furthermore, both NF166.C8 and NF135.C10

670 encode more unique non-synonymous SNPs genome-wide (19% and 33%, respectively)

671 and epitopes in these genes relative to NF54 than does 7G8. These results show that the

672 practice of equating geographic distance to genetic differentiation is not always valid, and

673 that the interpretation of CHMI studies should rest upon thorough genome-wide

674 comparisons. Lastly, of all PfSPZ strains, NF135.C10 is the most genetically distinct from

675 NF54, including in point mutations, unique epitopes, and its placement within the global

676 parasite population. We do not know whether the peptide sequences of the actual epitope

677 targets of vaccine-induced protective immunity are equally distinct in NF135.C10.

678 However, if proteome-wide genetic divergence is the primary determinant of differences

679 in protection against different parasites, the extent to which NF54-based immunization

680 protects against CHMI with NF135.C10 is important in understanding the ability of PfSPZ

681 Vaccine and other whole-organism malaria vaccines to protect against diverse parasites

682 present world-wide. These conclusions are drawn from genome-wide analyses and from

683 subsets of genes for which a role in whole-sporozoite-induced protection is suspected but

684 not experimentally established. Conclusive statements regarding cross-protection will

685 require the additional knowledge of the genetic basis of whole-organism vaccine

686 protection.

687 Without more information on the epitope targets of protective immunity induced by

688 PfSPZ vaccines, it is difficult to rationally design multi-strain PfSPZ vaccines. However,

689 these data and analysis approach has the potential to be applied to rationally design multi-

690 strain sporozoite-based vaccines once knowledge of those critical epitope sequences is

29

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

691 available, and characterization of a variety of P. falciparum strains may facilitate the

692 development of region-specific or multi-strain vaccines with greater protective efficacy.

693 Support for a genomics-guided approach to guide such next-generation vaccines can be

694 found in other whole organism parasitic vaccines. Field trials testing the efficacy of first-

695 generation whole killed-parasite vaccines against Leishmania had highly variable results

696 (87). While most studies failed to show protection, indicating that killed, whole-cell

697 vaccines for leishmaniasis may not produce the necessary protective response, a trial

698 demonstrating significant protection utilized a multi-strain vaccine, with strains collected

699 from the immediate area of the trial (88), highlighting the importance of understanding the

700 distribution of genetic diversity in pathogen populations. In addition, a highly efficacious

701 non-attenuated, three-strain, whole organism vaccine exists against Theileria parva, a

702 protozoan parasite that causes East coast fever in cattle. This vaccine, named Muguga

703 Cocktail, consists of a mix of three live strains of T. parva that are administered in an

704 infection-and-treatment method, similar to the approach utilized by PfSPZ-CVac. It has

705 been shown recently that two of the strains are genetically very similar, possibly clones

706 of the same isolates (89). Despite this, the vaccine remains highly efficacious and in high

707 demand (90). In addition, the third vaccine strain in the Muguga Cocktail is quite distinct

708 from the other two, with ~5 SNPs/kb (89), or about twice the SNP density seen between

709 NF54 and other PfSPZ strains. These observations suggest that an efficacious multi-

710 strain vaccine against a highly variable parasite species does not need to be contain a

711 large number of strains, but that the inclusion of highly divergent strains may be

712 warranted. These results also speak to the promise of multi-strain vaccines against highly

713 diverse pathogens, including apicomplexans with large genomes and complex life cycles.

30

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

714 Next-generation whole genome sequencing technology has opened many

715 avenues for infectious disease research and holds great promise for informing vaccine

716 design. While most malaria vaccine development has occurred before the implementation

717 of regular use of whole genome sequencing, the tools now available allow the precise

718 characterization and informed selection of vaccine strains early in the development

719 process. These results will greatly assist in the interpretation of clinical trials using these

720 strains for CHMI.

721 References and Notes: 722 723 1. World Malaria Report 2018. Geneva: World Health Organization; 2018 p. 1–210.

724 2. Bhatt S, Weiss DJ, Cameron E, Bisanzio D, Mappin B, Dalrymple U, et al. The 725 effect of malaria control on Plasmodium falciparum in Africa between 2000 and 726 2015. Nature. 2015 Oct 8;526(7572):207–11.

727 3. Delemarre BJ, van der Kaay HJ. Tropical malaria contracted the natural way in the 728 Netherlands. Ned Tijdschr Geneeskd. 1979 Nov 17;123(46):1981–2.

729 4. Hoffman SL, Billingsley PF, James E, Richman A, Loyevsky M, Li T, et al. 730 Development of a metabolically active, non-replicating sporozoite vaccine to 731 prevent Plasmodium falciparum malaria. Hum Vaccin. 2010 Jan;6(1):97–106.

732 5. Epstein JE, Tewari K, Lyke KE, Sim BKL, Billingsley PF, Laurens MB, et al. Live 733 Attenuated Malaria Vaccine Designed to Protect Through Hepatic CD8+ T Cell 734 Immunity. Science. 2011 Oct 28;334(6055):475–80.

735 6. Seder RA, Chang L-J, Enama ME, Zephir KL, Sarwar UN, Gordon IJ, et al. 736 Protection Against Malaria by Intravenous Immunization with a Nonreplicating 737 Sporozoite Vaccine. Science. 2013 Sep 20;341(6152):1359–65.

738 7. Ishizuka AS, Lyke KE, DeZure A, Berry AA, Richie TL, Mendoza FH, et al. 739 Protection against malaria at 1 year and immune correlates following PfSPZ 740 vaccination. Nat Med. 2016 Jun;22(6):614–23.

741 8. Epstein JE, Paolino KM, Richie TL, Sedegah M, Singer A, Ruben AJ, et al. 742 Protection against Plasmodium falciparum malaria by PfSPZ Vaccine. JCI Insight 743 [Internet]. 2017 Jan 12 [cited 2017 Jan 14];2(1). Available from: 744 http://insight.jci.org/articles/view/89154

31

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

745 9. Lyke KE, Ishizuka AS, Berry AA, Chakravarty S, DeZure A, Enama ME, et al. 746 Attenuated PfSPZ Vaccine induces strain-transcending T cells and durable 747 protection against heterologous controlled human malaria infection. Proc Natl Acad 748 Sci. 2017 Mar 7;114(10):2711–6.

749 10. Sissoko MS, Healy SA, Katile A, Omaswa F, Zaidi I, Gabriel EE, et al. Safety and 750 efficacy of PfSPZ Vaccine against Plasmodium falciparum via direct venous 751 inoculation in healthy malaria-exposed adults in Mali: a randomised, double-blind 752 phase 1 trial. Lancet Infect Dis. 2017 May 1;17(5):498–509.

753 11. Mordmüller B, Surat G, Lagler H, Chakravarty S, Ishizuka AS, Lalremruata A, et al. 754 Sterile protection against human malaria by chemoattenuated PfSPZ vaccine. 755 Nature. 2017 Feb 23;542(7642):445–9.

756 12. Vaughan AM, Sack BK, Dankwa D, Minkah N, Nguyen T, Cardamone H, et al. A 757 Plasmodium Parasite with Complete Late Liver Stage Arrest Protects against 758 Preerythrocytic and Erythrocytic Stage Infection in Mice. Infect Immun. 2018;86(5).

759 13. Mendes AM, Machado M, Gonçalves-Rosa N, Reuling IJ, Foquet L, Marques C, et 760 al. A Plasmodium berghei sporozoite-based vaccination platform against human 761 malaria. NPJ Vaccines [Internet]. 2018 [cited 2019 Jun 25];3. Available from: 762 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6109154/

763 14. Teirlinck AC, Roestenberg M, Vegte-Bolmer M van de, Scholzen A, Heinrichs MJL, 764 Siebelink-Stoter R, et al. NF135.C10: A New Plasmodium falciparum Clone for 765 Controlled Human Malaria Infections. J Infect Dis. 2013 Feb 15;207(4):656–60.

766 15. McCall MBB, Wammes LJ, Langenberg MCC, Gemert G-J van, Walk J, Hermsen 767 CC, et al. Infectivity of Plasmodium falciparum sporozoites determines emerging 768 parasitemia in infected volunteers. Sci Transl Med. 2017 Jun 21;9(395):eaag2490.

769 16. Schats R, Bijker EM, van Gemert G-J, Graumans W, van de Vegte-Bolmer M, van 770 Lieshout L, et al. Heterologous Protection against Malaria after Immunization with 771 Plasmodium falciparum Sporozoites. PLoS ONE. 2015 May 1;10(5):e0124243.

772 17. Walk J, Reuling IJ, Behet MC, Meerstein-Kessel L, Graumans W, van Gemert G-J, 773 et al. Modest heterologous protection after Plasmodium falciparum sporozoite 774 immunization: a double-blind randomized controlled clinical trial. BMC Med. 2017 775 Sep 13;15:168.

776 18. Volkman SK, Hartl DL, Wirth DF, Nielsen KM, Choi M, Batalov S, et al. Excess 777 Polymorphisms in Genes for Membrane Proteins in Plasmodium falciparum. 778 Science. 2002 Oct 4;298(5591):216–8.

779 19. Thera MA, Doumbo OK, Coulibaly D, Laurens MB, Ouattara A, Kone AK, et al. A 780 Field Trial to Assess a Blood-Stage Malaria Vaccine. N Engl J Med. 2011 Sep 781 15;365(11):1004–13.

32

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

782 20. Ouattara A, Takala-Harrison S, Thera MA, Coulibaly D, Niangaly A, Saye R, et al. 783 Molecular Basis of Allele-Specific Efficacy of a Blood-Stage Malaria Vaccine: 784 Vaccine Development Implications. J Infect Dis. 2013 Feb;207(3):511–9.

785 21. Neafsey DE, Juraska M, Bedford T, Benkeser D, Valim C, Griggs A, et al. Genetic 786 Diversity and Protective Efficacy of the RTS,S/AS01 Malaria Vaccine. N Engl J 787 Med. 2015 Oct 21;0(0):null.

788 22. Takala SL, Plowe CV. Genetic diversity and malaria vaccine design, testing, and 789 efficacy: Preventing and overcoming “vaccine resistant malaria.” Parasite Immunol. 790 2009 Sep;31(9):560–73.

791 23. Barry AE, Arnott A. Strategies for Designing and Monitoring Malaria Vaccines 792 Targeting Diverse Antigens. Front Immunol [Internet]. 2014 Jul 28;5. Available 793 from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4112938/

794 24. Preston MD, Campino S, Assefa SA, Echeverry DF, Ocholla H, Amambua-Ngwa A, 795 et al. A barcode of organellar genome polymorphisms identifies the geographic 796 origin of Plasmodium falciparum strains. Nat Commun [Internet]. 2014 Jun 13 [cited 797 2015 Jun 22];5. Available from: 798 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4082634/

799 25. Drakeley CJ, Duraisingh MT, Póvoa M, Conway DJ, Targett GAT, Baker DA. 800 Geographical distribution of a variant epitope of Pfs4845, a Plasmodium falciparum 801 transmission-blocking vaccine candidate. Mol Biochem Parasitol. 1996 Oct 802 30;81(2):253–7.

803 26. Walliker D, Quakyi IA, Wellems TE, McCutchan TF, Szarfman A, London WT, et al. 804 Genetic Analysis of the Human Malaria Parasite Plasmodium falciparum. Science. 805 1987 Jun 26;236(4809):1661–6.

806 27. Delves MJ, Straschil U, Ruecker A, Miguel-Blanco C, Marques S, Dufour AC, et al. 807 Routine in vitro culture of P. falciparum gametocytes to evaluate novel 808 transmission-blocking interventions. Nat Protoc. 2016 Sep;11(9):1668–80.

809 28. Schofield L, Villaquiran J, Ferreira A, Schellekens H, Nussenzweig R, 810 Nussenzweig V. γ Interferon, CD8+ T cells and antibodies required for immunity to 811 malaria sporozoites. Nature. 1987 Dec 23;330(6149):664–6.

812 29. Weiss WR, Sedegah M, Beaudoin RL, Miller LH, Good MF. CD8+ T cells 813 (cytotoxic/suppressors) are required for protection in mice immunized with malaria 814 sporozoites. Proc Natl Acad Sci. 1988 Jan 1;85(2):573–6.

815 30. Chattopadhyay R, Pratt D. Role of controlled human malaria infection (CHMI) in 816 malaria vaccine development: A U.S. food & drug administration (FDA) 817 perspective. Vaccine. 2017 May 15;35(21):2767–9.

33

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

818 31. Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, et al. Genome 819 sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002 Oct 820 3;419(6906):498–511.

821 32. Vembar SS, Seetin M, Lambert C, Nattestad M, Schatz MC, Baybayan P, et al. 822 Complete telomere-to-telomere de novo assembly of the Plasmodium falciparum 823 genome through long-read (>11 kb), single molecule, real-time sequencing. DNA 824 Res. 2016 Jun 26;dsw022.

825 33. Otto TD, Böhme U, Sanders M, Reid A, Bruske EI, Duffy CW, et al. Long read 826 assemblies of geographically dispersed Plasmodium falciparum isolates reveal 827 highly structured subtelomeres. Wellcome Open Res [Internet]. 2018 May 3 [cited 828 2018 Jul 16];3. Available from: 829 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5964635/

830 34. Böhme U, Otto TD, Sanders M, Newbold CI, Berriman M. Progression of the 831 canonical reference malaria parasite genome from 2002–2019. Wellcome Open 832 Res [Internet]. 2019 Mar 29 [cited 2019 Jun 1];4. Available from: 833 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6484455/

834 35. Yeda R, Ingasia LA, Cheruiyot AC, Okudo C, Chebon LJ, Cheruiyot J, et al. The 835 Genotypic and Phenotypic Stability of Plasmodium falciparum Field Isolates in 836 Continuous In Vitro Culture. PLoS ONE. 2016 Jan 11;11(1):e0143565.

837 36. Oyola SO, Ariani CV, Hamilton WL, Kekre M, Amenga-Etego LN, Ghansah A, et al. 838 Whole genome sequencing of Plasmodium falciparum from dried blood spots using 839 selective whole genome amplification. Malar J. 2016;15:597.

840 37. Agrawal S, Moser KA, Morton L, Cummings MP, Parihar A, Dwivedi A, et al. 841 Association of a Novel Mutation in the Plasmodium falciparum Chloroquine 842 Resistance Transporter With Decreased Piperaquine Sensitivity. J Infect Dis. 2017 843 Aug 15;216(4):468–76.

844 38. Trager W, Jensen JB. Human malaria parasites in continuous culture. Science. 845 1976 Aug 20;193(4254):673–5.

846 39. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM. Canu: 847 scalable and accurate long-read assembly via adaptive k-mer weighting and repeat 848 separation. Genome Res. 2017 Mar 15;gr.215087.116.

849 40. Chin C-S, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. 850 Nonhybrid, finished microbial genome assemblies from long-read SMRT 851 sequencing data. Nat Methods. 2013 Jun;10(6):563–9.

852 41. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: An 853 Integrated Tool for Comprehensive Microbial Variant Detection and Genome 854 Assembly Improvement. PLoS ONE. 2014 Nov 19;9(11):e112963.

34

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

855 42. Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL. 856 Alignment of whole genomes. Nucleic Acids Res. 1999 Jan 1;27(11):2369–76.

857 43. Nattestad M, Schatz MC. Assemblytics: a web analytics tool for the detection of 858 variants from an assembly. Bioinformatics. 2016 Oct 1;32(19):3021–3.

859 44. Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, Haeseler A, et al. 860 Accurate detection of complex structural variations using single-molecule 861 sequencing. Nat Methods. 2018 Apr 30;1.

862 45. Nattestad M, Chin C-S, Schatz MC. Ribbon: Visualizing complex genome 863 alignments and structural variation. bioRxiv. 2016 Oct 20;082123.

864 46. Dara A, Drábek EF, Travassos MA, Moser KA, Delcher AL, Su Q, et al. New var 865 reconstruction algorithm exposes high var sequence diversity in a single 866 geographic location in Mali. Genome Med. 2017;9:30.

867 47. Rask TS, Hansen DA, Theander TG, Pedersen AG, Lavstsen T. Plasmodium 868 falciparum Erythrocyte Membrane Protein 1 Diversity in Seven Genomes – Divide 869 and Conquer. PLOS Comput Biol. 2010 Sep 16;6(9):e1000933.

870 48. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat 871 Methods. 2012 Apr;9(4):357–9.

872 49. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. 873 SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell 874 Sequencing. J Comput Biol. 2012 May;19(5):455–77.

875 50. Nielsen M, Andreatta M. NetMHCpan-3.0; improved prediction of binding to MHC 876 class I molecules integrating information from multiple receptor and peptide length 877 datasets. Genome Med. 2016 Mar 30;8:33.

878 51. González-Galarza FF, Takeshita LYC, Santos EJM, Kempson F, Maia MHT, Silva 879 ALS da, et al. Allele frequency net 2015 update: new features for HLA epitopes, 880 KIR and disease and HLA adverse drug reaction associations. Nucleic Acids Res. 881 2015 Jan 28;43(D1):D784–8.

882 52. de Pablo R, García-Pacheco J m., Vilches C, Moreno M e., Sanz L, Rementería M 883 c., et al. HLA class I and class II allele distribution in the Bubi population from the 884 island of Bioko (Equatorial Guinea). Tissue Antigens. 1997 Dec 1;50(6):593–601.

885 53. Cao K, Moormann A m., Lyke K e., Masaberg C, Sumba O p., Doumbo O k., et al. 886 Differentiation between African populations is evidenced by the diversity of alleles 887 and haplotypes of HLA class I loci. Tissue Antigens. 2004 Apr 1;63(4):293–325.

888 54. Lasonder E, Janse CJ, Gemert G-J van, Mair GR, Vermunt AMW, Douradinha BG, 889 et al. Proteomic Profiling of Plasmodium Sporozoite Maturation Identifies New

35

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

890 Proteins Essential for Parasite Development and Infectivity. PLOS Pathog. 2008 891 Oct 31;4(10):e1000195.

892 55. Lindner SE, Swearingen KE, Harupa A, Vaughan AM, Sinnis P, Moritz RL, et al. 893 Total and Putative Surface Proteomics of Malaria Parasite Salivary Gland 894 Sporozoites. Mol Cell Proteomics. 2013 May 1;12(5):1127–43.

895 56. Felgner P, Hoffman SL, Seder R. Serum antibody assay for determining protection 896 from malaria, and pre-erythrocytic subunit vaccines [Internet]. US20160216276A1, 897 2016 [cited 2019 May 20]. Available from: 898 https://patents.google.com/patent/US20160216276A1/en

899 57. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The 900 Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation 901 DNA sequencing data. Genome Res. 2010 Sep 1;20(9):1297–303.

902 58. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, et al. A 903 framework for variation discovery and genotyping using next-generation DNA 904 sequencing data. Nat Genet. 2011 May;43(5):491–8.

905 59. Van der Auwera GA, Carneiro MO, Hartl C, Poplin R, del Angel G, Levy-Moonshine 906 A, et al. From FastQ data to high confidence variant calls: the Genome Analysis 907 Toolkit best practices pipeline. Curr Protoc Bioinforma Ed Board Andreas 908 Baxevanis Al. 2013 Oct 15;11(1110):11.10.1-11.10.33.

909 60. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in 910 unrelated individuals. Genome Res. 2009 Sep 1;19(9):1655–64.

911 61. Miotto O, Almagro-Garcia J, Manske M, MacInnis B, Campino S, Rockett KA, et al. 912 Multiple populations of artemisinin-resistant Plasmodium falciparum in Cambodia. 913 Nat Genet. 2013 Jun;45(6):648–55.

914 62. Dwivedi A, Reynes C, Kuehn A, Roche DB, Khim N, Hebrard M, et al. Functional 915 analysis of Plasmodium falciparum subpopulations associated with artemisinin 916 resistance in Cambodia. Malar J. 2017 Dec 19;16:493.

917 63. Carneiro MO, Russ C, Ross MG, Gabriel SB, Nusbaum C, DePristo MA. Pacific 918 biosciences sequencing technology for genotyping and variation discovery in 919 human data. BMC Genomics. 2012 Aug 5;13:375.

920 64. Zanghì G, Vembar SS, Baumgarten S, Ding S, Guizetti J, Bryant JM, et al. A 921 Specific PfEMP1 Is Expressed in P. falciparum Sporozoites and Plays a Role in 922 Hepatocyte Infection. Cell Rep. 2018 Mar 13;22(11):2951–63.

923 65. Mu J, Myers RA, Jiang H, Liu S, Ricklefs S, Waisberg M, et al. Plasmodium 924 falciparum genome-wide scans for positive selection, recombination hot spots and 925 resistance to antimalarial drugs. Nat Genet. 2010 Mar;42(3):268–71.

36

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

926 66. Guerin-Marchand C, Druilhe P, Galey B, Londono A, Patarapotikul J, Beaudoin RL, 927 et al. A liver-stage-specific antigen of Plasmodium falciparum characterized by 928 gene cloning. Nature. 1987 Sep 10;329(6135):164–7.

929 67. Fidock DA, Gras-Masse H, Lepers JP, Brahimi K, Benmohamed L, Mellouk S, et al. 930 Plasmodium falciparum liver stage antigen-1 is well conserved and contains potent 931 B and T cell determinants. J Immunol. 1994 Jul 1;153(1):190–204.

932 68. Claessens A, Affara M, Assefa SA, Kwiatkowski DP, Conway DJ. Culture 933 adaptation of malaria parasites selects for convergent loss-of-function mutants. Sci 934 Rep. 2017 Jan 24;7:41303.

935 69. Gebru T, Lalremruata A, Kremsner PG, Mordmüller B, Held J. Life-span of in vitro 936 differentiated Plasmodium falciparum gametocytes. Malar J. 2017 Aug 11;16:330.

937 70. Kafsack BFC, Rovira-Graells N, Clark TG, Bancells C, Crowley VM, Campino SG, 938 et al. A transcriptional switch underlies commitment to sexual development in 939 malaria parasites. Nature. 2014 Mar 13;507(7491):248–52.

940 71. Iwanaga S, Kaneko I, Kato T, Yuda M. Identification of an AP2-family Protein That 941 Is Critical for Malaria Liver Stage Development. PLOS ONE. 2012 Nov 942 7;7(11):e47557.

943 72. Ikadai H, Saliba KS, Kanzok SM, McLean KJ, Tanaka TQ, Cao J, et al. Transposon 944 mutagenesis identifies genes essential for Plasmodium falciparum 945 gametocytogenesis. Proc Natl Acad Sci. 2013 Apr 30;110(18):E1676–84.

946 73. Miles A, Iqbal Z, Vauterin P, Pearson R, Campino S, Theron M, et al. Indels, 947 structural variation, and recombination drive genomic diversity in Plasmodium 948 falciparum. Genome Res. 2016 Sep 1;26(9):1288–99.

949 74. Doolan DL, Hoffman SL. The Complexity of Protective Immunity Against Liver- 950 Stage Malaria. J Immunol. 2000 Aug 1;165(3):1453–62.

951 75. Kisalu NK, Idris AH, Weidle C, Flores-Garcia Y, Flynn BJ, Sack BK, et al. A human 952 monoclonal antibody prevents malaria infection by targeting a new site of 953 vulnerability on the parasite. Nat Med. 2018 Apr;24(4):408–16.

954 76. Tan J, Sack BK, Oyen D, Zenklusen I, Piccoli L, Barbieri S, et al. A public antibody 955 lineage that potently inhibits malaria infection by dual binding to the 956 circumsporozoite protein. Nat Med. 2018 May;24(4):401–7.

957 77. Doolan DL, Saul AJ, Good MF. Geographically restricted heterogeneity of the 958 Plasmodium falciparum circumsporozoite protein: relevance for vaccine 959 development. Infect Immun. 1992 Feb;60(2):675–82.

37

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

960 78. Manske M, Miotto O, Campino S, Auburn S, Almagro-Garcia J, Maslen G, et al. 961 Analysis of Plasmodium falciparum diversity in natural infections by deep 962 sequencing. Nature. 2012 Jul 19;487(7407):375–9.

963 79. Yalcindag E, Elguero E, Arnathau C, Durand P, Akiana J, Anderson TJ, et al. 964 Multiple independent introductions of Plasmodium falciparum in South America. 965 Proc Natl Acad Sci. 2012 Jan 10;109(2):511–6.

966 80. Wongsrichanalai C, Pickard AL, Wernsdorfer WH, Meshnick SR. Epidemiology of 967 drug-resistant malaria. Lancet Infect Dis. 2002 Apr 1;2(4):209–18.

968 81. Plowe CV. The evolution of drug-resistant malaria. Trans R Soc Trop Med Hyg. 969 2009 Apr 1;103(Supplement 1):S11–4.

970 82. Parobek CM, Parr JB, Brazeau NF, Lon C, Chaorattanakawee S, Gosi P, et al. 971 Partner-Drug Resistance and Population Substructuring of Artemisinin-Resistant 972 Plasmodium falciparum in Cambodia. Genome Biol Evol. 2017 Jun 1;9(6):1673–86.

973 83. MalariaGEN Plasmodium falciparum Community (last). Genomic epidemiology of 974 artemisinin resistant malaria. eLife. 2016 Mar 4;5:e08714.

975 84. Cerqueira GC, Cheeseman IH, Schaffner SF, Nair S, McDew-White M, Phyo AP, et 976 al. Longitudinal genomic surveillance of Plasmodium falciparum malaria parasites 977 reveals complex genomic architecture of emerging artemisinin resistance. Genome 978 Biol. 2017;18:78.

979 85. Takala-Harrison S, Jacob CG, Arze C, Cummings MP, Silva JC, Dondorp AM, et 980 al. Independent Emergence of Artemisinin Resistance Mutations Among 981 Plasmodium falciparum in Southeast Asia. J Infect Dis. 2015 Mar 1;211(5):670–9.

982 86. Kraemer SM, Smith JD. A family affair: var genes, PfEMP1 binding, and malaria 983 disease. Curr Opin Microbiol. 2006 Aug 1;9(4):374–80.

984 87. Noazin S, Modabber F, Khamesipour A, Smith PG, Moulton LH, Nasseri K, et al. 985 First generation leishmaniasis vaccines: A review of field efficacy trials. Vaccine. 986 2008 Dec 9;26(52):6759–67.

987 88. Armijos RX, Weigel MM, Aviles H, Maldonado R, Racines J. Field Trial of a 988 Vaccine against New World Cutaneous Leishmaniasis in an At-Risk Child 989 Population: Safety, Immunogenicity, and Efficacy during the First 12 Months of 990 Follow-Up. J Infect Dis. 1998 May 1;177(5):1352–7.

991 89. Norling M, Bishop RP, Pelle R, Qi W, Henson S, Drábek EF, et al. The genomes of 992 three stocks comprising the most widely utilized live sporozoite Theileria parva 993 vaccine exhibit very different degrees and patterns of sequence divergence. BMC 994 Genomics. 2015;16:729.

38

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

995 90. Nene V, Kiara H, Lacasta A, Pelle R, Svitek N, Steinaa L. The biology of Theileria 996 parva and control of East Coast fever - Current status and future trends. Ticks Tick- 997 Borne Dis. 2016;7(4):549–64.

998 999 1000 Declarations

1001 Ethics approval and consent to participate: Not applicable

1002 Consent for publication: Not applicable

1003 Availability of data and material: Raw sequencing data and assemblies generated for

1004 this manuscript have been uploaded to NCBI. See Dataset 1 for BioSample

1005 identification numbers for assemblies and whole genome sequence data for all samples

1006 used in these analyses, including publicly available data.

1007 Competing interests: T.L., B.K.L.S., and S.L.H. are salaried employees of Sanaria

1008 Inc., the developer and owner of PfSPZ Vaccine, and the supplier of the PfSPZ

1009 strains sequenced in this study. In addition, S.L.H. and B.K.L.S. have a financial

1010 interest in Sanaria Inc. All other authors declare no competing financial interests.

1011 Funding: National Institutes of Health (NIH) awards U19 AI110820 and R01 AI141900,

1012 and a “Dean’s Challenge Award”, an intramural program from the University of Maryland

1013 School of Medicine, provided funds to sequence the PfSPZ strains and Pf clinical

1014 isolates, and support for K.A.M., E.F.D., A.D., M.A., A.O., K.L., L.S., L.T., M.T., S.T.H.,

1015 C.M.F., C.V.P. and J.C.S.. A.O. and C.V.P. were also supported by the Howard Hughes

1016 Medical Institute. The parasites provided by Sanaria were produced in part by funding

1017 from two SBIR grants from National Institute of Allergy and Infectious Diseases, NIH,

1018 2R44AI058375 and 5R44AI055229-09A1. Collection of clinical isolates that were

1019 sequenced for this study were funded by U19 AI089683, grants from CNPq

1020 (590106/2011-2), grants from the Armed Forces Health Surveillance Center/Global

39

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1021 Emerging Infections Surveillance and Response System (WR1576 and WR2017), the

1022 NIH International Centers of Excellence in Malaria Research program (U19 AI089681)

1023 (Brazil), and the Howard Hughes Medical Institute. S.K. and A.M.P. were supported by

1024 the Intramural Research Program of the National Human Genome Research Institute,

1025 NIH (1ZIAHG200398). P.T.R. was supported by a scholarship from the Conselho

1026 Nacional de Desenvolvimento Científico e Tecnológico (CNPq), which also provided a

1027 senior researcher scholarship to M.U.F.

1028 Authors' contributions: K.A.M. designed the study, performed analyses, and wrote the

1029 paper. J.C.S., C.V.P., and S.L.H. designed the study, contributed samples and funding,

1030 and wrote the paper. E.F.D., A.D., J.C., E.M.S., A.D., S.K., A.M.P, and A.O. contributed

1031 analyses. Z.S., M.A., T.L., L.S., L.T. preformed sample preparation and sequencing.

1032 P.T.R., K.E.L., M.D.S., K.J., C.L., D.L.S., M.U.F., M.N., M.K.L., M.T., R.W.S., S.T.H.,

1033 C.M.F, and B.K.L.S contributed samples, funding, and wrote the paper.

1034 Acknowledgements: We would like to thank the study teams that assisted in sample

1035 collections in Mali, Malawi, Cambodia, Thailand, and Myanmar. We would like to

1036 acknowledge Drs. Terrie Taylor and Don Mathanga for their assistance with collection of

1037 clinical isolates in Malawi.

1038 Authors' information (optional)

1039 Additional author information for M. D. S, K. J., C. L., D. L. S (affiliated with the Armed

1040 Forces Research Institute of Medical Sciences): Material has been reviewed by the

1041 Walter Reed Army Institute of Research. There is no objection to its presentation

1042 and/or publication. The opinions or assertions contained herein are the private views of

1043 the author, and are not to be construed as official, or as reflecting true views of the

40

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1044 Department of the Army or the Department of Defense. The investigators have adhered

1045 to the policies for protection of human subjects as prescribed in AR 70–25.

1046 Tables

1047 Table 1: The PfSPZ strains differ from 3D7 in terms of genome size and base-pair 1048 content. Characteristics of the PacBio assembly for each strain (first four columns), 1049 with the 3D7 reference genome are shown for comparison. Single-nucleotide 1050 polymorphisms (SNPs) and indels in each PfSPZ assembly as compared to 3D7 are 1051 shown for coding and non-coding regions (last five columns). # of SNPs4 Indels Cumulative Strain Nuclear N503 Non- Non- Non- Length2 Syn. Coding Contigs1 coding Syn. coding 3D7 14 23,332,831 1,687,656 - - - - - NF54 28 23,426,028 1,527,116 1,278 25 80 8,166 230 7G8 20 22,801,099 1,455,033 22,694 5,675 15,490 40,668 6,730 NF166.C8 30 23,306,751 1,610,926 27,117 6,636 17,258 40,457 6,802 NF135.C10 19 23,540,633 1,631,396 28,791 6,535 18,141 41,435 7,029 1052 1Contigs: number of pieces of continuous sequence in the final assembly. 1053 2Cumulative length: total length of the contigs. 1054 3N50: length of the contig which, along with all contigs larger than itself, contain 50% of the assembly 1055 (larger numbers indicate a more complete assembly). 1056 4Syn SNPs: synonymous SNPs; Non-Syn SNPs: non-synonymous SNPs. 1057

1058

41

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1059 Figures

1060 1061 Figure 1: PacBio Assemblies for each PfSPZ strain reconstruct entire 1062 chromosomes in one-three continuous pieces. To determine the likely position of 1063 each non-reference contig on the 3D7 reference genome, mummer’s show-tiling program 1064 was used with relaxed settings (-g 100000 -v 50 -i 50) to align contigs to 3D7 1065 chromosomes (top). 3D7 nuclear chromosomes (1-14) are shown in grey, arranged from 1066 smallest to largest, along with organelle genomes (M=mitochondrion, A=apicoplast). 1067 Contigs from each PfSPZ assembly (NF54: black, 7G8: green, NF166.C8: orange, 1068 NF135.C10: hot pink) are shown aligned to their best 3D7 match. A small number of 1069 contigs could not be unambiguously mapped to the 3D7 reference genome (unmapped). 1070

1071

1072

1073

1074

42

bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1075

1076 Figure 2: Distribution of polymorphisms in PfSPZ PacBio assemblies. Single 1077 nucleotide polymorphism (SNP) densities (log SNPs/ 10kb) are shown for each 1078 assembly; the scale [0-3] refers to the range of the log-scaled SNP density graphs - 1079 from 100 to 103. Inner tracks, from outside to inside, are NF54 (black), 7G8 (green), 1080 NF166.C8 (orange), and NF135.C10 (hot pink). The outermost tracks are the 3D7 1081 reference genome nuclear chromosomes (chrm1 to chrm 14, in blue), followed by 3D7 1082 genes on the forward and reverse strand (black tick marks).

43 bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 3: Predicted CD8+ T cell epitopes from amino acid sequences of the four PfSPZ strains. Using amino acid sequences from a subset of the 5,548 protein-coding genes in the 3D7 annotation (PlasmoDBv24) that were deemed to be correctly annotated between all four PfSPZ strains, CD8+ T cell epitopes were predicted in silico using netMHCpan. A. Density plot of amino acid sequence length distributions of 4,814 genes used for in silico epitope predictions. Overall distribution of gene lengths was extremely similar between all four strains, with the exception of the three heterologous CHMI strains having a slightly higher number of amino acid sequences that weren’t quite as long as they were in the NF54 genome, possibly a reflection of difficulty in mapping annotations to genetically diverse genomes. B. Shared and unique epitope- gene combinations in the 4,814 genes. C. Shared and unique epitope-gene combinations in a subset of the 4,814 genes (n=1,974) that have evidence of being expressed in the sporozoite stage of the parasite in the mosquito salivary glands.

44 bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 4: Predicted CD8+ T cell epitopes in the P. falciparum circumsporozoite protein (PfCSP). Protein domain information based on the 3D7 reference sequence of PfCSP is found in the first track, and the following tracks are epitopes predicted in the PfCSP sequences of NF54, 7G8, NF166.C8, and NF135.C10, respectively. Each box in the track is a sequence that was identified as an epitope, and the colors of each box represent the HLA type that identified the epitope.

45 bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Figure 5: Global diversity of clinical isolates and PfSPZ strains. Principal coordinates analyses (PCoA) of clinical isolates (n=654) from malaria endemic regions and PfSPZ strains were conducted using biallelic non-synonymous SNPs across the entire genome (left, n=31,761) and in a panel of 42 pre-erythrocytic genes of interest (right, n=1,060). For the genome-wide dataset, coordinate 1 separated South American and African isolates from Southeast Asian and Papua New Guinean isolates (27.6% of variation explained), coordinate 2 separated African isolates from South American isolates (10.7%), and coordinate 3 separated Southeast Asian isolates from Papua New Guinea (PNG) isolates (3.0%). Similar trends were found for the first two coordinates seen for the pre-erythrocytic gene data set (27.1 and 12.6%, respectively), but coordinate 3 separated isolates from all three regions (3.8%). In both datasets, NF54 (black cross) and NF166.C8 (orange cross) cluster with West African isolates (isolates labeled in red and dark orange colors), 7G8 (bright green cross) cluster with isolates from South America (greens and browns), and NF135.C10 (hot pink cross) clusters with isolates from Southeast Asia (purples and blues).

46 bioRxiv preprint doi: https://doi.org/10.1101/684175; this version posted June 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

.

Figure 6: NF135.C10 is part of an admixed population of clinical isolates from Southeast Asia. Top: admixture plots for clinical isolates from Myanmar (n=16), Thailand (n=34), Cambodia (n=109), Papua New Guinea (PNG, n=34) and NF135.C10 are shown. Each sample is a column, and the height of the different colors in each column corresponds to the proportion of the genome assigned to each K population by the model. Bottom: hierarchical clustering of the Southeast Asian isolates used in the admixture analysis (n=193; colored by their assigned subpopulation) and previously characterized Cambodian isolates (n=167; black; labels correspond to clusters identified in reference 61) place NF135.C10 (star) with samples from the previously identified KHA admixed population (shown in grey dashed box). Each node ends with a sample, and the y-axis represents distance between clusters.

47