bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Title page

Article title: Evolutionary Pressures and Codon Bias in Low

Complexity Regions of Plasmodia Parasites

Authors: 1*Andrea Cappannini, Sergio Forcelloni*3, 1, 2 Andrea Giansanti

Affiliation: 1 Sapienza, University of Rome, Department of

Physics, P.le A. Moro 5, 00185 Roma, Italy. 2 Istituto Nazionale di Fisica Nucleare, INFN,

Roma1 section. 00185, Roma, Italy. 3 Max Planck Institute of Biochemistry, Munich

Corresponding Author: Andrea Cappannini : [email protected] Sapienza University of Rome, Department of Physics, P.le A. Moro 5, 00185 Roma, Italy.

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1 Abstract 38 consequently, their propensity to undergo indels,

39 2 One of the most debated topics in Evolutionary whereby to cycles of expansion and contraction. The

40 3 Biology concerns Low Complexity Regions of P. first is Replication Slippage (RS) (Guy-Franck et al.

41 4 falciparum, the causative agent of the most virulent 2008; Mosbach et al. 2019; Ellegren, 2004; Gemayel

42 5 and deadly form of human malaria. In this work, we et al. 2012; Saitou, 2018). RS is a mutation process

43 6 analysed the proteome of 22 that occurs during DNA replication. It involves

44 7 including P. falciparum. SEG predicts that proteins denaturation and displacement of DNA strands with

45 8 containing Low Complexity Regions turn out to be the consequent decoupling of complementary bases

46 9 longer than those which are predicted to be completely (Levinson & Gutman, 1987). In a nutshell, the loop out

47 10 complex (without Low Complexity Regions). for the template strand implies a contraction whilst the

48 11 Moreover, using some well-known bioinformatics same for the nascent strand implies an expansion.

49 12 tools such as the Effective Number of Codons, the Pr2 (Gemayel et al. 2012). The other mechanisms are

50 13 and a new index that we have called SPI, we have recombination-like events (Gemayel et al. 2012;

51 14 noticed how proteins that embed Low Complexity Ellegren, 2004; Verstrepen et al. 2005) such as

52 15 Regions are under lower selective pressure than those unequal crossing-over and gene conversion, where

53 16 that do not present this type of locus. By applying the some propose it as a predominant mechanism in

54 17 Relative Synonymous Codon Usage and other tools minisatellite regions (Guy-Frank & Paques, 2000)

55 18 developed ad hoc for this study, we note, instead, how with the predominance of replication slippage in

56 19 the Low Complexity Regions appear to have a non- microsatellite regions (Kokoska et al. 1998). Guy-

57 20 neutral codon bias with respect to the host proteins. Franck and colleagues (2008) extensively surveyed

58 the current landscape of definitions of TRRs based on 21 Introduction 59 the length of the repeating unit, highlighting a lack of

22 60 consensus in the definition of microsatellite where,

23 Due to the critical implications for Human Health, the 61 some propose to consider a repetitive unit length of up

24 study of Single Nucleotide Polymorphisms (SNPs) 62 to 5 or 10 nt either (Guy-Frank et al. 2008; Gemayel

25 and Copy Number Variations (CNVs) is a 63 et al.2012). These loci have high mutational rate if

26 fundamental research field (Zhang et al. 2009). 64 compared to other DNA regions (Brinkmann et al.

27 Tandem repeat regions (TTRs) represent a third type 65 1998) ranging between and per cell 28 of genetic variation (Gemayel et al. 2012), whereby 10 10 66 generation (Gemayel et al. 2012). Past research has

29 proteins host regions of reduced complexity and 67 identified some factors that would contribute to the

30 biased amino acid composition, also called low 68 instability of these loci. Gemayel and colleagues

31 complexity regions (LCR) (Toll- Riera et al. 2012). 69 (2012) have redacted an extensive examination

32 Despite the diversification of the various domains of 70 referring to different papers that have dealt with this

33 life, this type of loci is ubiquitously present (Kumari 71 topic: Legendre et al. (2007) highlighted how the

34 et al. 2015) existing in Bacteria, Archaea and Eukarya 72 presence of multiple repetitive units is the main

35 (Wootton & Federhen, 1996). Taken as a whole, the 73 characteristic that contributes to the instability of these

36 literature indicates two main mechanisms that are 74 regions, followed by the length and the nucleotide

37 most likely to explain the presence of TRRs and, bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

75 purity of the tract, i.e. a coherent succession of 112 fruitful source of genetic investigations, also due to its

76 nucleotides not interrupted by any nitrogenous base 113 particularly high content of LCRs, mostly

77 without counterparts in the repetitive sequence 114 characterized by asparagine (N) residues. The

78 (Saitou, 2018). Wanting to consider a homopolymer 115 functional role of these stretches is not fully

79 sequence (same amino acid) Verstrepen and 116 understood yet. However, several hypotheses have

80 colleagues (2005) have instead highlighted how the 117 been advanced. Among others, Pizzi E. and Frontali C.

81 use of more synonymous codons drastically increases 118 (2001) proposed them to be subjected, to continuous

82 the stability of these loci. The nucleotide content also 119 cycles of expansion and de novo generation

83 affects the instability of these regions where poly A or 120 representing a resource of antigenic variation

84 poly T tracts have been observed to be more stable 121 ultimately leading the parasite to evade the host

85 than the corresponding poly G or poly C tracts (Gragg 122 immune response. Such a hypothesis has been

86 et al. 2002). For a more detailed examination we refer 123 sustained by other investigators in the field (Karlin et

87 to the original work (Gemayel et al. 2012). Low 124 al. 2001; Ferreira et al. 2003; Cortès et al. 2005). On

88 Complexity Regions are thought to be the result of tri- 125 the one hand and in line with a general requirement for

89 nucleotide slippage (Levinson & Gutman, 1987; 126 neo-functionalization, it was proposed that long N-

90 Mularoni et al. 2006). In proteins, the phenotype 127 repetitive stretches influence the local rate of

91 associated with these regions is often contradictory, 128 translation, thus triggering ribosome pausing and

92 whereby LCRs often have pathogenic implications, 129 ultimately behaving as tRNA sponges that assist co-

93 such as in neurodegenerative (Gatchel & Zoghbi, 130 translational folding (Frugier et al. 2009; Filisetti et al.

94 2005) or developmental (Brown & Brown, 2004) 131 2013). On the other hand, Forsdyke D.R. (Forsdyke,

95 pathologies. Nevertheless, LCRs are important for 132 2016) discusses the results obtained with Xue (2003),

96 protein fitness. Shen et al. (2004) showed how the 133 whereby similar N-repetitive stretches might play a

97 Arginine and Serine rich binding sites in the Exonic 134 double role at DNA level. Indeed, since the nucleotide

98 Splicing Enhancer (ESE) contribute to the assembly of 135 content of P.falciparum’s genome is skewed towards

99 pre-spliceosomes, supporting splicing and related 136 AT, the N-repetitive stretches are supposed to stabilize

100 activities. Similarly, the group of Salichs and 137 mRNAs and prevent the corresponding genes to

101 colleagues (2009) noted that histidine-rich sites are 138 undergo deleterious mutations. Undoubtedly, all of the

102 pivotal for sub-cellular localization. More generally, 139 cited works have provided the scientific community

103 incremental evidence highlights how these regions are 140 with important but also conflicting explanations for

104 preferentially inserted only in certain functional 141 the presence of LCRs in P.falciparum. Recent

105 protein classes (Karlin et al. 2001; Alba & Guigo 142 advances have shown how asparagine promotes the

106 2004; Faux et al. 2005) and how they do not sever their 143 formation of amyloid structures following thermal

107 functional domains (Newfeld et al. 1994). P. 144 shocks (Halfmann et al. 2011). As already explained

108 falciparum is a protozoan that belongs to the Phylum 145 by Muralidharan & Goldberg (2013), P. falciparum

109 of . It represents the etiological agent of 146 copes with extremely variable temperatures during its

110 the most severe and lethal form of Human Malaria 147 life cycle, e.g., passing from the relatively low,

111 (Pizzi & Frontali, 2001). This parasite offered a 148 ambient temperature of the mosquito to the human bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

149 body at about 37° C and vice versa. Moreover, Malaria 186 nLCPs. To address the issue of Darwinian Selective

150 is characterized by numerous cycles of fever, 187 Pressures, we selected two common bioinformatics

151 eventually causing the host temperature to reach > 40° 188 tools, i.e. The Effective Number Of Codons (Wright,

152 C. Therefore, the continuous thermal fluctuations can 189 1996), using the improved version of Sun and

153 lead proteins to un- as well as mis-fold that could 190 colleagues (2012), and a Pr2 analysis (Sueoka, 1995;

154 ultimately lead to the death of the parasite. However, 191 Sueoka & Kawanishi, 1999). In addition to these two

155 although Q has biochemical characteristics similar to 192 indices, we have devised a completely new one,

156 N (Strachan et al., 2020) and is similarly prone to form 193 hereafter referred to as SPI (Selective Pressure Index),

157 prions, N reduces proteo-toxicity (Halfman et al. 194 that allows to compare distances from Wright's

158 2011). While showing that Heat Shock Proteins 195 Theoretical Curve. The set of these three tools allowed

159 prevent the formation of aggregates and prion-like 196 us to highlight how, compared to nLCPs, LCPs are

160 fibrils, Muralidharan et al. (2012) and Muralidharan 197 characterized by a more heterogeneous CUB and are

161 and Goldberg (2012) proposed, among other 198 further distinguished from nLCPs by a lower Selective

162 remarkable hypotheses, a mechanism to explain how 199 Pressure. Our results apply consistently to any of the

163 these regions can contribute to the formation of new 200 proteomes analysed and providing new insights about

164 folds and functions. In the present work, we studied 201 Low Complexity Regions.

165 the proteomes of 22 Plasmodia species operating on 202 Methods

166 two fronts: on the one hand, the study of the Codon 203 Data Sources and general methodologies

167 Usage Bias (CUB) of Low Complexity Regions by 204 Plasmodium proteomes from the NCBI GenBank

168 means of the Relative Synonymous Codon Usage 205 (ftp://ftp.ncbi.nih.gov). Brazilian strain of P.vivax was

169 (RSCU, Sharp & Li,1987) and the application of 206 considered. P.praefalciparum, P.alderi and P.o.curtisi

170 Shannon's Entropy (Shannon, 1948) provided us 207 were retrieved from PlasmoDB

171 remarkable insights into the selective pressure acting 208 (https://plasmodb.org). We considered only CDSs that

172 on these regions. On the other hand, based on the 209 start with methionine (ATG) ending with a stop codon

173 presence of at least one Low Complexity Region, we 210 (TAG, TAA or TGA) and having a multiple length of

174 stratified the proteins into LCPs (Low Complexity 211 three.

175 Containing Proteins) and not LCPs (hereafter referred

212 Each CDS was translated in the corresponding amino 176 to as nLCPs), the latter characterized by the absence

213 acid sequence by nt2aa MATLAB function. 177 of LCRs. Although this sub-setting was obtained by

214 Remaining scripts were built by custom and built in 178 SEG predictions (Wootton & Federhen, 1996)

215 MATLAB scripts. For some analyses we relied on 179 performed using default parameters, and the results

216 boxplot visualization strategy. Taking into 180 could be improved by further parameter refinement

217 consideration two boxplots, MATLAB notch function 181 (Batistuzzi et al. 2016), our downstream results

218 indicates that when two notches do not overlap, 182 displayed internal consistency. The study of protein

219 medians of boxes are significantly different at the 5% 183 lengths has highlighted how, in organisms weaving a

220 significance level. (Mathworks). 184 causative relationship between the two structural

185 characteristics, LCPs are intrinsically longer than bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

221 We identified LCRs by SEG algorithm, triggered with 256 codon family (Xia, 2007). Basically, RSCU is a

222 standard parameters (window = 15, K1= 1.5 and K2 = 257 normalized codon frequency that is expected to be 1

223 1.8) that ensures identified sequences to correspond to 258 where there is no codon usage bias, greater that 1 when

224 strongly biased sequences while, at the same time, 259 that codon is overused, minor than 1 when that codon

225 allowing for substantial sequence diversity (Trilla & 260 in underused. RSCU is formally defined as follows:

226 Albà, 2012). 261 = , 227 nLCPs and LCPs often occur in groups with different ∑ 228 numbers (see SM Supplementary Materials). Where 262 where refers to a codon frequency j; 229 distributions do not considerably diverge from a 263 is the degeneracy of the codon family i; 230 normal distribution, we relied on welch-t test (Ruxton 264 is the sum of the occurrences of 231 ,2007; Derrick et al. 2016), otherwise Mann Whitney test ∑ CdnFreq 265 the codons in that codon family that encode for the 232 was used. For multi-comparison tests we relied on welch

266 same amino acid. Worth noting, the maximum value 233 Anova followed by Bonferroni correction (MT1) when

267 that can be expected from a codon depends on the 234 distributions do not deviate considerably from normality,

268 codon family to which it belongs. Therefore, if the 235 otherwise Kruskal-Wallis test followed by Bonferroni

269 236 correction (MT2). degeneracy of a codon family is n then the maximum

270 expectation will be n. We introduced a modification to 237 Tandem Repeat Regions 271 the original version of RSCU proposed by Sharp & Li

238 To extract the DNA of the low complexity regions we 272 (1987), in line with the division of the 6-fold codon

239 used a customized MATLAB script in order to identify 273 families into 4-fold and 2-fold proposed by Sun and

240 the LCRs on the proteins and extract the related DNA 274 colleagues (2013) in their improved version of ENC

241 on the corresponding gene. 275 that we use in our ENC analysis. We neglect on

276 purpose ATG (M) and TGG (W) since being 242 GC, ENC, ENC plot and Pr2 analysis

277 expressed by only one codon there cannot be bias. We 243 We used the same rationale proposed by Forcelloni 278 computed, for each species, RSCU analysis on the 244 and Giansanti (2020) for the implementation of the 279 overall codon content of Low Complexity Regions. 245 ENC (Wright,1993) using the improved

280 Shannon Entropy 246 implementation of Sun et al. (2012), ENC plot, Pr2

247 plots (Sueoka,1995; Sueoka & Kawanishi,1999) and 281 To compare the complexity of Tandem Repeat

248 for the calculation of the GC content. Therefore, we 282 Regions we thus relied on Shannon’ Entropy

249 refer to Forcelloni & Giansanti (2020) for further 283 (Shannon,1948) computed as follows:

250 details concerning these tools. In this work, ENC and

251 Pr2 scores are separately calculated for LCPs and 284

252 nLCPs for each parasite. = − ()

253 RSCU 285 where that is the sum of the = ∑ 254 An RSCU analysis (Sharp & Li, 1987) was carried out. 286 i-th codon in the j-th tandem repeat of length

255 RSCU measures codon usage for each codon in each 287 Entropy is used to represent in a compact way Tandem bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

288 Repeat Regions and their bias towards a few or more 323 subgroup rather than in another. Let us point that

289 codons. 324 extending the knowledge about the various radiations

325 that brought to the occurrence of these subgroups is 290 SPI 326 beyond our scope. P.falciparum has been placed

291 The Effective Number of Codons describes the codon 327 together with the other parasites belonging to the

292 bias of a gene. Moving away from the curve, the 328 monophyletic subgenus termed Laverania (Otto et al.

293 distance from the Wright’s Theoretical Curve provides 329 2018;Liu et al. 2016) that comprehends P.gaboni and

294 an estimate of the extent to which Mutational Bias and 330 P.reichnowi (that infect chimpanzees),

295 Selective Pressure affect the codon bias of that gene 331 P.praefalciparum and P.alderi (that infect gorillas).

296 (Novembre, 2002). On the one hand, the same ENC 332 Noteworthy, P.falciparum is the only that successfully

297 score be associated with different distances from the 333 infects humans (Otto et al. 2018) but despite it was for

298 curve, and the other, the same distance in absolute 334 a long time considered a human specific pathogen it

299 value, as the GC3 varies, can represent a diverse 335 also infect gorillas, rising concerns about possible

300 deviation from the expected value given by Wright's 336 reciprocal host transfer (Prugnolle et al. 2010). As far

301 Theoretical Curve. The distances in absolute value are 337 as the Asian monkey parasites are concerned, we

302 not equivalent. Therefore, we propose the SPI 338 retrieved data for P. vivax that represents a serious

303 (Selective Pressure Index) defined as follows: 339 threat for human health given its extremely wide

340 geographical coverage (Howes et al. 2016). P. 304 () − () () = , ( ) − 341 knowlesi was recently recognized as a human

305 342 pathogen (Rich & Xu, 2011) given a major outbreak

306 where is the value of the Wright’s 343 of human in 2004 (Singh et al. 2013). Until () 307 theoretical curve in correspondence of a , 344 this focus, human contagions were thought rare (Faust 308 is the ENC value for the gene in 345 & Dobson, 2015). Moreover, we considered () 309 correspondence of that value and a = 20. SPI 346 P.cynomolgi and P.coatney (Galinski & Barnwell, 310 weights the shift of the coding sequence from the null 347 2012), P.inui (Galland, 2000) and P.gonderi (Arisue et

311 model of no-codon preference (the situation where a 348 al. 2019). We refer to these parasites as Simian

312 gene lies on the curve) with the situation of extreme 349 Plasmodia. To have a blueprint about evolutionary

313 bias, namely where just one codon is used for each 350 and quantitative diversification of LCRs we

314 amino acid, for that . When a gene lies on the 351 considered also murine rodents plasmodia (Vinckeia 315 theoretical curve SPI is expected to be 0 (no distance 352 Subgenus). We retrieved P.vinckei (Carter &

316 from the WTC). Otherwise, when a gene is extremely 353 Wallinker, 1975), P.petteri, P.chabaudi, P.berghei and

317 subject to selection SPI is expected to be 1 (situation 354 P.yoelii (Garnham 1964). Evidence addresses the

318 of extreme bias). 355 evolutionary origin of P. falciparum with avian

356 plasmodia (Waters et al. 1991; Waters et al. 1993 a-b; 319

357 McCutchan et al. 1996; Escalante & Ayala 1994). To 320 Cladistic and Organism Overview 358 extend our efforts we considered the two available

321 We briefly describe how the parasites are 359 specimens of the Haemamoeba Subgenus (Corradetti

322 phylogenetically related and why we placed them in a bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

360 et al. 1963): P. gallinaceum, proposed as a possible 396 significantly lower correlations than the P.

361 ancestor of P.falciparum (Waters et al. 1991;Escalante 397 falciparum’s group. We report the same also as regards

362 et al. 1998), and P.relictum that is one of the most 398 the contents in Low Complexity residues. Although

363 geographically widespread malaria parasites for birds 399 the Haemamoeba and Laverania Plasmodia do not

364 (Valkiunas, 2005). We refer to these parasites as 400 differ significantly in correlation coefficients, the

365 Haemamoeba Subgenus. Lastly, we considered 401 statistical pauperism of the group could lead to

366 P.ovale wallikeri (P.ovale), the etiological agent of 402 misleading results. In a nutshell, genomic bias does

367 tertian malaria, (Collins & Jeffery, 2005), P.ovale 403 not seem to play a particular selective advantage

368 curtisi (P.o.curtisi) (Kristan et al. 2019) and 404 although we do not deny its role in the choice of

369 P.malariae causing quartan malaria (Collins & Jeffery, 405 synonymous codons.

370 2007). We refer to these parasites as Human Infectious 406 RSCU 371 Plasmodia (HIPs).

407 We wanted to study the Codon Bias of the Low 372 408 Complexity Regions to derive information about the

373 Results 409 selective pressures acting on their synonymous

374 GC – Content Influence 410 codons. To do this, we used Relative Synonymous

411 Codon Usage (RSCU, Sharp & Li, 1987) applied on 375 Having the possibility to make a statistical comparison

412 the overall Codon Content of LCRs. We have 376 among several parasites, we reflected on the role of

413 deliberately neglected ATG (M) and TGG (W) 377 Guanine and Cytosine content, in order to understand

414 because being encoded by only one codon they cannot 378 the degree to which LCRs of P.falciparum, and more

415 have bias. The results are displayed in Fig. 1. Data is 379 generally of the other parasites, are influenced by

416 standardized along each column. If a codon has a value 380 genomic biases. We considered the Laverania

417 above the column mean it is represented in red. 381 Plasmodia as a yardstick for our analysis. In Tab.1 we

418 Otherwise green. Specifics regarding the clustering 382 summarize the statistics from various experiments we

419 algorithm and distance are provided in the image 383 performed during the analysis. The first column

420 caption. The heatmap indicates that Protozoa can be 384 collects the Pearson correlation coefficients between

421 distinguished in two main lines, although there are 385 the amount of Low Complexity Regions contained in

422 differences within each group. The Laverania, 386 proteins and their length. Next, the average Guanine

423 Haemamoeba and Vinckeia Plasmodia mainly prefer 387 and Cytosine content of each protozoan and the

424 codons whose wobble position is occupied by Adenine 388 quantity of low-complexity regions possessed by each

425 or Thymine, when compared to the remaining 389 of them. All the correlations are significant (p < 0.01).

426 subgroups (HIPs and Simian Plasmodia). The latter, 390 We have divided the statistics of each column by

427 on the contrary, have larger RSCU values for codons 391 parasite group. Thus, each column was divided into 5

428 whose wobble position is represented by Cytosine or 392 distributions. MT2 applied to the three distributions,

429 Guanine, if compared to the previous species. We 393 indicates that Laverania, Haemamoeba and Vinckeia

430 wanted to investigate further about the RSCU values 394 subgroups do not differ significantly in Guanine and

431 within each species in order to notice any more 395 Cytosine content. However, the Vinckeia have

432 specific preferences in respect of the wobble positions. bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

433 To do this, we compared the RSCU values of codons 469 understand their structural arrangement and derive a

434 ending with a purine (A or G) and codons ending with 470 rationale for inferring the presence of different

435 a pyrimidine (T or C). The Relative Synonymous 471 selective pressures. Fig. 2(A) illustrates the

436 Codon Usage is a normalized index that presents a 472 distributions of the 5 parasitic groups. Best fit

437 spectrum of different maximum values that depend on 473 parameters and models are left in the image caption.

438 the degeneracy of the codon family. Therefore, we 474 The elbow of the regressions shows that, on average

439 have normalized the RSCU values with respect to the 475 and for the same length, the LCRs of the Laverania

440 number of codons of the family they belong to (Mann 476 Plasmodia are less complex than those of the other

441 Whitney). As expected, Laverania, Vinckeia and 477 parasitic groups, especially HIPs and Simian spp. To

442 Haemamoeba Plasmodia follow the copycat, showing 478 have a better understanding of the phenomenon we

443 a clear preference for A over G and T over C in the 479 have studied the single distributions as shown in Fig.

444 wobble position (p <0.01), as it is also for HIPs 480 2 (B). Preliminary observation of the violin plots

445 (p<0.01). We find the same pressures in P.gonderi (p 481 delineates how Plasmodium species arrange their TRR

446 <0.01). Interestingly, we do not find the same stark 482 differently, which denotes, as entropy increases, a

447 split in many of the Simian Plasmodia, such as 483 greater number of codons along the same TRR. We

448 P.vivax, P.cynomolgi, P.fragile and P.inui whose 484 have applied MT1. The Bonferroni Correction

449 wobble position appears to be contested in a tug-of- 485 indicates that the Laverania Plasmodia use fewer

450 war between both purines and pyrimidines, not finding 486 codons (synonyms / non synonyms) along the

451 significant differences (p>0.05). Evolutionary 487 extension of their LCRs as opposed to other species

452 pressures are slightly different in P.coatney and 488 whose Low Complexity Regions appear more

453 P.knowlesi. The first has preference for G over A and 489 complex. The only parasite for which the comparison

454 T over C (p <0.05). The second presents wobble 490 with Laverania Plasmodia is insignificant is P.berghei.

455 positions better suited to its GC content by preferring 491 Fig. 2 (C) shows the correlation between the average

456 G to A and C to T (p <0.05). RSCU table is provided 492 complexity of TRRs and the average Guanine and

457 in the Supplementary materials (RSCU.xlsx). Overall, 493 Cytosine content of each organism. In general, average

458 we obtained information about the synonymous 494 TRR complexity increases with the GC content even

459 codons used in these organisms. However, RSCU is a 495 if the fluctuations suggest the presence of other forces

460 normalized index which neglects the quantitative 496 in determining the nucleotide composition of these

461 contribution of a codon family which, in contrast, 497 regions.

462 could be selectively avoided. 498 Codon Bias

463 499 We have analyzed from a quantitative point of view

464 500 the codon bias of the Low Complexity Regions since

501 the RSCU cannot return a quantitative information 465 Shannon Entropy and Complexity of Tandem 502 regarding the use of codons (Fig.3). represents 466 Repeats Regions 503 the ratio between the total quantity of a codon and the

467 We investigated the codon composition of the LCRs 504 total number of codons present within the Low

468 of each plasmodium using Shannon Entropy to 505 Complexity Regions of a parasite. All the species, bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

506 although in different proportions, show an increase in 543 which appear to have the longest LCPs among all

507 the codons used less commonly than those with higher 544 parasites and P.berghei which instead appears to have

508 percentages, unlike the Laverania Plasmodia which 545 the shortest, no differences emerge such as to justify

509 have a skewed distribution towards a few species of 546 the overabundance of LCRs typical of the Laverania

510 codons. 547 Plasmodia. Overall, LCPs emerge longer de facto than

548 nLCPs which weaves, in general, a causative link 511 Protein Length 549 between protein length and the presence of Low

512 Given the positive correlations between LCRs and 550 Complexity Regions. st 513 protein length we observed in 1 paragraph, we 551 ENC 514 analyzed the protein length of each Plasmodium by

515 stratifying proteomes into two distributions, i.e., LCPs 552 We relied on the Effective Number of Codons

516 and nLCPs. We compared the two distributions in each 553 (Wright,1990). We utilized the improved

517 organism (Welch t-test) (Fig. 4 (a)), represented in red 554 implementation proposed by Sun and colleagues

518 and green respectively. LCPs emerge to be 555 (2013). ENC is a widely used index that does not

519 significantly longer than nLCPs in each parasite (p << 556 require a reference set as the Codon Adaptation Index

520 0.01). We have reflected on the linear extension of the 557 (CAI) does (Sharp & Li, 1987). We calculated

521 LCRs. Considering LCRs length (See Supplementary 558 separately ENC scores for LCPs and nLCPs. They are

522 Materials), individually, these do not exceed 250 559 represented in red and green, respectively. Laverania,

523 amino acids in length. However, it is not uncommon 560 Vinckeia and Haemamoeba Subgenera (Fig.5 (A) (C)

524 to find more than one Low Complexity Region within 561 (D)) show off similar distributions placed on the left

525 the same protein. Therefore, for the avoidance of 562 side of the ENC plane, highlighting a low GC content

526 doubt, we studied the proportion between LCRs and 563 in Wobble codon position. Interestingly, nLCPs of

527 protein lengths to understand with certainty whether, 564 P.yoelii, despite mainly grouping on the left side of

528 on average, the length of LCPs was not only an artifact 565 the ENC plot, distribute throughout the plane

529 due to the presence of these stretches (Fig. 4 (b)). 566 following the Wright’s Theoretical Curve. HIPs

530 From what emerges, LCRs occupy a fairly small 567 (Fig.5(B)) have a content halfway between 531 space, rarely reaching 6% of the entire linear surface 568 former Subgenera and Simian Plasmodia whose ENC

532 of the polypeptides. We therefore deprived the LCPs 569 plots, in turn, are in line with other works (Gajbhiye et

533 of their Low Complexity Regions and repeated the 570 al. 2017; Yadav & Swati, 2012). Remarkably, P.vivax

534 experiment by comparing LCPs and nLCPs (Fig. 4 (c)) 571 and P.cynomolgi have a portion of their nLCPs

535 (Welch t-test). LCPs emerge intrinsically longer than 572 positioned on the left side of the ENC plane (Fig.6),

536 nLCPs (p << 0.01). Even if the graph shows a lack of 573 that further emphasises their close phylogenetic

537 statistically significant difference between the length 574 relationship (Sanger Institute). All but Simian

538 distributions of LCPs of the various species, we 575 Plasmodia, show ENC values that distribute linearly

539 wanted to apply MT1 to these to understand if there 576 with . We collected the correlation coefficients 540 could be differences such to explain the 577 (Pearson) of ENC vs in Tab. 2 together with 541 overabundance of LCRs typical of the Laverania 578 regression slopes of both nLCPs and LCPs. Given the

542 Plasmodia. Apart from P. gonderi and P.malariae 579 shape of its distributions we added also P.gonderi. All bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

580 the correlations are significant (p<0.01). We compared 617 (p<<0.01) diverging less from the Wright Theoretical

581 slope distributions of nLCPs and LCPs (Mann 618 Curve in each parasite. Medians of each pair differ

582 Whitney). LCPs regressions are on average 619 with 95% confidence. We evaluated the hypothesis

583 represented by steeper slopes (p<0.01). Mann Whitney 620 that a lower selective pressure could favour a greater

584 test indicates a greater influence of pressure on 621 pervasiveness of LCRs in the Laverania Subgenus. We 585 LCPs. In Tab .3 we collected the of regression 622 therefore applied MT1 on LCP distributions. The test 586 curves we calculated for Simian Plasmodia. Models 623 returns contradictory values. In fact, although the

587 and parameters, calculated with 95% confidence 624 Simian group shows the highest SPI values compared

588 bounds, are provided in the caption of Fig.6. 625 to all the other parasites considered here, some

589 Generally, the regression curves of LCPs tend to 626 parasites of the Vinckeia and Haemamoeba groups

590 overlook those of nLCPs. To add statistics to our 627 show values similar to or lower than the P. falciparum

591 analysis, we compared the ENC scores of LCPs and 628 group. Thus, seeing that LCPs of AT-rich groups

592 nLCPs parasite-wise (Fig .7) (Welch t-test). Most of 629 undergo similar patterns of selective pressure, the

593 the comparisons show that the LCPs have a more 630 hypothesis that greater mutational bias increases LCRs

594 redundant codon bias compared to nLCPs (p<<0.01). 631 lapses. For a better understanding we refer to the

595 The test fails only for P.chabaudi, P.yoelii and 632 interactive MATLAB plot left in the Supplementary

596 P.berghei (p>0.05) even if their medians differ with 633 Materials. Duret and Mouchiroud (1999) noted a

597 95% confidence. In a nutshell, LCPs have a more 634 negative correlation between protein length and codon

598 redundant CUB which suggests a stronger action of 635 bias in C.elegans, D.melanogaster, and A.thaliana.

599 mutational bias than Darwinian Selective Pressure 636 We therefore decided to retrace what they did using

600 (positive / negative). 637 the SPI (Fig. 9). In the proposed plan, each data point

638 (Laverania Plasmodia) represents a protein where the 601 SPI 639 length (abscissa) is measured in nucleotides. SPI

602 Analysis of the Effective Number of Codons has 640 values are placed on the ordinate. Regression model

603 revealed some very interesting details related to the 641 (fit MATLAB function) and parameters calculated

604 codon bias of LCPs and nLCPs which, on average, 642 with 95% confidence interval are provided in the

605 seem to differ in the extent to which selective pressure 643 image caption. Again, we point out that the LCPs have

606 and mutational bias affect the choice of their codons. 644 been deprived of their LCRs. The graph highlights

607 The problem of the ENC is that the distances in 645 how the contribution of the mutational bias tends to

608 absolute value from the Wright Theoretical Curve are 646 increase with the length of the protein for both nLCPs

609 not equivalent to each other and therefore two equal 647 and LCPs. Selective Pressure decreases as protein

610 ENC values can represent two different contributions 648 length increases. Similar trends are also reported for

611 of the selective pressure and of the mutational bias. In 649 the other parasitic groups whose graphs and models

612 this regard we have applied the SPI (Fig.8) repeating 650 are provided in Supplementary Materials. Recalling

613 what we have already done in Fig.7 (Welch t-test). The 651 the correlations between LCRs and protein length

614 SPI analysis corroborates what was primarily 652 observed in the first paragraph, these trends contain

615 observed by ENC. Indeed, LCPs, on average, emerge 653 another hidden information, namely that the lower the

616 to be under a lower selective pressure than nLCPs bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

654 selective pressure, the more LCRs are inserted. This 691 vanishing in green clouds. Pr2 plot is characterized by

655 weaves, more clearly, a relationship between 692 a vectorial nature that allows to discriminate among

656 Mutational bias and Low Complexity Regions. 693 various pressures driving protein coding genes in one

694 quadrant of Pr2 plane rather than in another. Thus, the 657 Pr2 695 diversified selective pressures pushing on protein

658 To strengthen our confidence regarding Darwinian 696 coding genes can, mathematically, neglect each other,

659 Pressures shaping CUB of Plasmodium species, a 697 making the centroids under-representative. Therefore,

660 Parity Rule 2 (Pr2) analysis was performed (Fig.10). 698 we operated as Forcelloni & Giansanti (2020),

661 Once more, we calculated Pr2 scores, separately, for 699 calculating the distributions of gene deviations from

662 LCPs and nLCPs represented in red and green, 700 the Pr2 centre as:

663 respectively. Pr2 plots are provided with centroids of

664 nLCP and LCP distributions. Laverania, Vinckeia and 701 665 Haemamoeba Subgenera show distributions located = − 0.5 + − 0.5 + + 666 near the centre of Pr2 plots even if their tails tend to 702 r better indicates the extent of Pr2 violations. In

667 diverge towards the external edges of Pr2 planes, 703 particular, if r = 0 the genes witness a perfect balance

668 confirming both the actions of selective pressure and 704 between natural selection and mutational bias. If r > 0

669 mutational bias. Diversely from Vinckeia Subgenus, 705 natural selection and mutational bias provide different

670 whose parasites have both their centroids placed in the 706 contributions concerning a greater Pr2 violation

671 second quadrant of Pr2 plot, Laverania and 707 (Forcelloni & Giansanti,2020). We compared the r

672 Haemamoeba Subgenera show to have LCP centroids 708 distributions of nLCPs and LCPs, parasite-wise (welch

673 on the first quadrant showing off, to a certain degree, 709 t test) (Fig. 11). Likewise to the other comparisons

674 a preference for A and G in the wobble codon position 710 (ENC and SPI), LCPs tend to stay closer to the parity

675 of their 4-fold codon families. Differently from 711 centre with respect to the nLCPs do (p<<0.01 on

676 Haemamoeba Subgenus, nLCP centroids of Laverania 712 average, P.fragile p<0.01) (Welch t-test). We were

677 Parasites appear in the second quadrant of Pr2 plane 713 further intrigued by Pr2 violation and the length of

678 with consequently a preference for C and G, in 714 protein coding genes. To unveil this information a 3D

679 contrast to a preference for A and G reported for birds 715 space, where (X,Y) plane is the Pr2 plane and the

680 Plasmodia. Similar considerations to Haemamoeba 716 quote is gene length, is used. Given the difficulties of

681 Plasmodia can be drawn for HIPs. Simian Plasmodia 717 layout and the redundancies of the graphs, we show

682 show more rounded distributions, representative of 718 here only Laverania Plasmodia (Fig.12). The rest of

683 their even GC content. Interestingly, remembering the 719 the parasites is anyway provided in SM. We observe

684 nLCPs of P.berghei that distributed throughout the 720 that longer proteins tend to cluster near the centroids

685 ENC plane and the AT rich islands of P.vivax and 721 that in turn do not deviate consistently from the parity

686 P.cynomolgi, these three clusters disappear in the 722 centre, characteristic that is in accordance with what

687 respective Pr2 plot, underlying that, despite the 723 predicted through SPI analysis. So, LCRs are more

688 differences evidenced through the ENC analysis, 4- 724 abundant where the Pr2 deviation is smaller. Overall,

689 fold codon families of these protein coding genes 725 Pr2 analysis sustains what we stressed through SPI

690 undergo to similar selective trends to the other nLCPs, bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

726 analysis highlighting through 4-fold codons the 762 SM). In this sense, we agree with Chaudhry et al.

727 relationship between mutational bias and LCRs. 763 (2017) in decreeing the role of genomic biases as

764 marginal. On the other hand, we do not deny GC 728 Discussion 765 content has a strong influence in the choice of

729 Malaria is one of the most severe public health 766 synonymous codons.

730 problems worldwide being the leading cause of death 767 RSCU and LCRs 731 and disease in many developing countries, with

732 405.000 estimated deceases in 2018 (CDC). 768 We applied RSCU (Sharp & Li, 1987) utilizing LCRs

733 Comparative genomics is a powerful tool to unravel 769 DNA. We identify similar preference in the most used

734 evolutionary changes among organisms, helping to 770 codons of the AT-rich species while Simian Plasmodia

735 identify the biological characteristics that give each 771 display a shuffled repertoire of RSCU values that is in

736 life form its unique attributes (Nature Education). In 772 line with their even GC content. Clustering analysis

737 this study, we have derived a set of information 773 (Ward, 1963) produces relevant observations that are

738 regarding the selective pressures acting on 22 774 consistent with literature. It divides species into two

739 Plasmodium species in order to then be able to 775 main evolutionary lines, emphasizing a mutual

740 advance a hypothesis regarding the nature of LCRs in 776 difference for their G / C or A / T wobble positions.

741 P. falciparum. 777 An evolutionary blueprint of the Protozoa is indeed

778 shined back, conserving the sisterly relationship 742 GC content 779 between P.cynomolgi and P.vivax (Tachibana et al.

743 In line with the literature (Zilversmit et al.2010; 780 2012; Hayakawa et al. 2008). P.gonderi is placed in

744 DePristo et al.2006; Frugier et al. 2009; Filisetti et al. 781 the same lineage of Simian Plasmodia stressing, once

745 2013; Pizzi & Frontali,2001; Xue & Forsdyke, 2003), 782 more, an African origin for these parasites (Arisue et

746 the P. falciparum proteome presented low levels of 783 al. 2019) further emerging to undergo similar

747 Guanine and Cytosine. In this regard, the analysis has 784 pressures in the choice of synonymous codons for their

748 begun to emphasize a marginal relationship between 785 LCRs. Remarkably, the branching of the Laverania

749 genomic bias and quantity Low Complexity Regions 786 family reports consistent data from various other

750 abundance. Our statistical comparisons do not support 787 works (Otto et al. 2018; Silva et al. 2011; Rich et al.

751 the hypothesis that the preponderance in LCRs of the 788 2009) highlighting a common ancestry with Vinckeia

752 Laverania family is driven by genomic bias. In P. 789 (Ramiro et al. 2012) and Haemamoeba (Waters et al.

753 falciparum, Hamilton and colleagues (2016) have 790 1991; Escalante et al. 1998) given the striking

754 shown in vitro a pronounced excess G: C to A: T 791 similarity of the CUB to their LCRs.

755 transitions finally attributing the genomic skew of 792 On the other hand, the heatmap returned a comparison 756 these parasites to mutational bias. Assuming that 793 between the RSCU values among all species blurring 757 similar transitions also occur in the other AT rich 794 the choice within the single ones. The analysis of the 758 parasites, the simple genomic bias does not explain the 795 RSCU normalized values illustrates for the species 759 extreme diversity of these organisms both in quantity 796 belonging to the Laverania, Vinckeia and 760 of LCRs and in quantity of LCPs of which the 797 Haemamoeba Subgenera, constituting what was seen 761 Laverania family seem to be particularly abundant (see bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

798 through the heat map, a clear pressure for wobble 835 Supplementary Materials), the predominance in the

799 positions in A or T. Similar conclusions are drawn for 836 Laverania Subgenus belongs to the AAT codon that

800 P.gonderi and for HIPs despite having been placed in 837 covers an interquartile range between 35% and 80% of

801 the second evolutionary group by the clustering 838 the length of the LCRs. In the other species rich in

802 algorithm. The situation of the species belonging to the 839 Adenine and Thymine, this dominance is contested by

803 rest of the Simian family is different. In fact, the action 840 GAA (E), GAT(D) and AAA(K) codons. Asparagine

804 of the drift (Bulmer,1991) is more pronounced where 841 has proven to be dispensable in P.falciparum

805 most of the time we do not find a stark preference 842 proteasome lid subunit 6 (Rpn6) (Muralidharan et al.

806 between wobble positions in A / G or T / C. So LCRs 843 2010). However, re-proposing the question already

807 of Simian Plasmodia, if compared to the others, 844 posed, in a more general context by Gemayel and

808 appears under a lower selective pressure being these 845 colleagues (2012), whether a repetitive unit is chosen

809 pressures apparently disputed in a tug of war between 846 for its mutability, or whether its mutability is rather

810 multiple synonymous codons. 847 rewarded for obtaining some effect on fitness, as far as

848 asparagine is concerned, we would expect a similar 811 Complexity 849 pervasiveness also in the other AT rich parasites if it

812 We used Shannon's Entropy, comparing the 850 were for the first reason. Furthermore, observing the

813 complexity of the TRRs belonging to these micro- 851 values of the SPI and the Pr2 violations of the LCPs of

814 organisms. We have shown that the LCRs of 852 the AT-rich parasites, the LCPs of these parasites are

815 Laverania Plasmodia are composed of fewer types of 853 under similar evolutionary pressure. Therefore,

816 codons than other parasites; we observed that, with the 854 attributing the massive presence of asparagine in the

817 same length, the Laverania Plasmodia have less 855 Laverania Plasmodia to a lack of selective constraints

818 complex LCRs than the other species analysed here. 856 against its diffusion does not appear correct since, if

819 Furthermore, the correlation between average 857 this were the case, we would expect, again, a much

820 complexity and GC content suggests that genomic bias 858 more abundant presence of N also in the Haemamoeba

821 cannot be the only contributor to the codon bias of 859 and Vinckeia Plasmodia. Hence, the LCRs appear to

822 LCRs. Among the various factors contributing to the 860 be selected from nucleotide level. As previously

823 instability (the proclivity to expand or contract) of a 861 hypothesised by Dalby (2009) we infer asparagine to

824 TRR there is the purity (Ngai & Saitou, 2016; Saitou, 862 be under positive selection.

825 2018; Gemayel et al. 2012; Legendre et al. 2007) of 863 826 the nucleotide tract and its length (Gemayel et al.

827 2012; Legendre et al. 2007); the nature of the codon 864 Codon Bias

828 bias upstream of a repetitive sequence of amino acids 865 RSCU has a broad spectrum of maximum values 829 (Verstrepen et al. 2005); the nucleotide composition 866 which, however, may not reflect a predominant use of 830 (Gragg et al. 2002).As stated, on average, LCRs of 867 a family of codons which, despite high RSCU values, 831 Laverania Plasmodia appear to be shaped by a lower 868 may be little used. For this reason, we wanted to 832 number of codons species as opposed to the other 869 quantitatively study the codon composition of LCRs. 833 species which on average exhibit higher complexity: it 870 The observation of the bar charts allows to consider 834 suggests a major instability. Moreover (see bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

871 some important aspects of the composition of the 907 function can be ascribed. Therefore, we hypothesize

872 LCRs. The first question concerns the species of 908 that longer proteins are selected to host LCRs because

873 codons most used, where the preponderance of 909 there is less chance of disrupting a functional ,

874 asparagine (mostly encoded by AAT) is particularly 910 by squeezing them into interdomains. This would also

875 accentuated in the Laverania family. In fact, the 911 explain the positive correlations between protein

876 codons most present in general in all species are E, S, 912 length and LCRs abundance found in each parasite

877 D and R which represent the most common 913 (Tab.1). It is not clear whether recombination or

878 substitutions for N and K (NCBI-N, NCBI - K). NCBI 914 replication slippage is the predominant mechanism.

879 tables stress the similarity of the chemical -physical 915 The lack of consensus in distinguishing between micro

880 characteristics of these amino acids. Although a more 916 and minisatellite regions (Guy-Frank et al. 2008)

881 in-depth study is necessary to better visualize the 917 makes it difficult to propose a generalized hypothesis

882 composition of LC runs, we observe a certain tendency 918 discriminating between the two mechanisms, where in

883 in all species to preserve the CUB of these loci since 919 general, replication slippage is suggested for

884 amino acidic changes appear to be primarily 920 minisatellites (Guy-Frank & Paques, 2000) and

885 conservative, i.e. that the properties of the polypeptide 921 recombination like phenomena are suggested for

886 do not drastically change (Strachan et al. 2020). 922 minisatellite regions (Kokoska et al. 1998) and less

887 Noteworthy, LCPs of the Simian Plasmodia lie in the 923 complex LCRs (Ellegren,2004). Notably, in P.

888 central part of the ENC plots which denotes a higher 924 falciparum, Zilvermit et al. (2010) suggest unequal

889 GC3 content. In spite of this, it is not uncommon to 925 crossing over events for GC rich LCRs that can be

890 observe that the linear extension of the LCRs of 926 luckily be extended to the other Laverania spp.

891 Simian spp. is composed of codons rich in Adenine as 927 Overall, more evidence is needed to decree with

892 is the case of GAA (E) which rivals its synonym GAG 928 greater reliability that LCRs space functional domains

893 (see Supplementary Materials) and in general as for 929 without their disruption, that is left for future research.

894 other codons going against the GC pressure which 930

895 appears to press on their proteomes.

931 Darwinian Pressures

896

932 Codon Bias and Gene Expression

897 Protein Length

933 To cope with Darwinian selective pressures, we relied

898 We observed that LCPs are longer de facto than 934 on the Effective Number of Codons (Wright, 1996)

899 nLCPs. This implies that the LCRs are preferentially 935 using the improved version of Sun et al (2012), SPI,

900 inserted into the longest proteins of each parasite. Pizzi 936 and PR2 (Sueoka, 1995; Sueoka & Kawanishi, 1999).

901 E. and Frontali C. reported how some of the P. 937 These analyses consistently apply to each proteome

902 falciparum proteins emerged to be longer than the 938 stressing different equilibria between selective

903 orthologues found in other organisms (Pizzi & 939 pressure and mutational bias regarding both LCPs and

904 Frontali, 2001) by virtue of the presence of LCRs as 940 nLCPs. Overall, the set of these tools indicates that the

905 retraced by Xue and Forsdyke (2003) that report how 941 CUB of LCPs is more strongly influenced by

906 these are placed between domains to which a specific 942 mutational bias rather than by selective pressure and bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

943 that the contribution of the latter tends to decrease with 980 The SPI is intrinsically linked to the nature of the

944 the length of the protein. Optimal codon bias is often 981 Effective Number of Codons (Wright, 1996) and it is

945 attributed to high levels of gene expression (Bulmer, 982 suggested to be used coupled with the improved

946 1991; Sharp & Li, 1986). Furthermore, it is assumed 983 version of Sun and colleagues (2012) or any other

947 that a shorter protein requires less time to fold 984 improved version of the ENC such as that of Fuglsang

948 (Williamson, 2017) and that it must require a smaller 985 (2006). MT1 revealed the SPI to be in line with the

949 number of molecular chaperones (Lipman et al 2002) 986 purpose for which it was conceived. Compared to

950 for their surveillance function to be performed 987 Laverania Plasmodia, Simian spp. show a more

951 (Daniyan et al. 2019) with less space reserved for their 988 heterogeneous codon bias (their ENC values are

952 folding ultimately avoiding aberrant interactions with 989 larger, see Fig. 7). LPCs of Simian Plasmodia show

953 other cellular components (Williamson, 2017; Hartl & 990 off the highest SPI values, underlining that a higher

954 Hartl, 2002). This would suggest better translation 991 number of codons used does not necessarily identify a

955 efficiency of nLCPs and higher expression. Duret & 992 lower selective pressure. We agree with Gajbhiye and

956 Mouchiroud (1999) showed this in C.elegans where as 993 colleagues (2017) in decreeing a greater selective

957 selective pressure (protein length) decreases 994 pressure on the proteins of P.vivax compared to those

958 (increases) gene expression increases (decreases) as 995 of P. falciparum, to which the other members of its

959 well. However, transcriptomic examinations have 996 family are also added. Anyway, SPI it is not able to

960 shown that gene expression in P. falciparum is 997 distinguish between positive and negative pressures

961 substantially different between field isolates and in 998 but is only able to establish what the divergence from

962 vitro cultured samples, where this variability is 999 Wright's Theoretical Curve is, studying how closely

963 associated with Copy Number Variations (Mackinnon 1000 the CUB of a gene approximates an extreme Bias

964 et al. 2009). Other studies indicated that gene 1001 situation (20 codons for 20 amino acids). Similarly,

965 expression in P.falciparum varies according to the 1002 Pr2 (Sueoka, 1995; Sueoka & Kawanishi, 1999)

966 applied pharmacological stressor (Hu et al. 2009). 1003 allows to understand what the shift from Parity Rule 2

967 Moreover, in vitro proteomics data show how the 1004 is, returning, like SPI, a departure from a situation of

968 amount of proteins expressed in P. falciparum follows 1005 equilibrium. The genomes of malaria parasites suffer,

969 a very precise time course (Foth et al. 2011). A similar 1006 in respect of many genes, an apparent lack of

970 trend was also observed in P.vivax (Bozdech et al. 1007 orthologues (Gardner et al. 2002; Hu et al. 2009)

971 2008). This suggests that the gene expression of these 1008 which causes a complication for the application of

972 parasites may vary depending on external factors and 1009 methods such as, e.g, Nei and Gojobori’s (1986).

973 depending on the stage of the life cycle in which they 1010 Addressing these problems is therefore left to future

974 are found. Therefore, the association of the two gene 1011 efforts.

975 categories (nLCPs and LCPs) to a lower or higher gene 1012 Selective Pressure on LCPs and LCRs abundance 976 expression is controversial and not so easily

1013 The negative correlation between selective pressure 977 attributable.

1014 and protein length, particularly in the Laverania 978 SPI and Pr2 violations: Positive and Negative 1015 Plasmodia, hides information. In fact, looking to the 979 Selective Pressure 1016 correlation between protein length and LCR bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1017 abundance (Tab. 1), it can be observed that proteins 1054 a buffer role for these amino acid regions. Overall,

1018 with a minor deviation from Wright's Theoretical 1055 these correlations are puzzling and open an

1019 Curve and with a minor violation of Pr2 contain more 1056 exceptionally large number of avenues for new

1020 Low Complexity Regions. As the complexity of LCRs 1057 hypothesis and research.

1021 increases, a functional role is suggested (Zilversmit et 1058 Criticisms, Possible Refinements and Future 1022 al. 2010). However, as instance, contrary to the 1059 Addresses 1023 tendency of long E runs to generate distortions in a

1060 An important concept that is necessary to point, 1024 polypeptide chain (Karlin et al. 2001), in vitro assays

1061 concerns SEG parameters. In fact, standard parameters 1025 highlight how the areas rich in glutamate (E) in

1062 allow to identify regions strongly polarized towards a 1026 P.falciparum help drive the immune response away

1063 certain species of amino acids whilst allowing to find 1027 from the functional domains of the proteins thus

1064 LCRs with a more heterogeneous repertoire of amino 1028 evading the host's immune system (Hou et al.

1065 acids (Trilla & Albà, 2012). On the other hand, 1029 2020).Therefore, hypothesising a functional role even

1066 enlarging the window through which SEG is set 1030 for pure or almost pure amino acid run it is possible

1067 introduces a reduction in the LCRs that are identified, 1031 and classifying pure LCRs as a neutral insertion could

1068 where smaller windows (such as W=6) allow to 1032 be not correct in all the circumstances. A hypothesis

1069 observe a larger number of LCRs (Batistuzzi et al. 1033 of interest is that of functional amyloid, which states

1070 2016). Comparing the statistics provided by our work 1034 that many organisms have evolved to take advantage

1071 with some others (Chaudhry et al. 2018), on a 1035 of the potential tendency of some residues to form this

1072 qualitative level, the identified amino acids are not 1036 type of fibrils (Fowler et al. 2007). A shared opinion

1073 starkly different from those of Chaudhry et al. (2018) 1037 is that intracellular parasites possess simplified

1074 and the proportions between nLCPs and LCPs are 1038 genomes to adapt to the host (Daniyan et al. 2019).

1075 consistent with those found by them (see SM). 1039 These simplifications are believed to lead to the

1076 However, the critical sense suggests validating the 1040 expression of mutated proteins prone to aggregation

1077 subjectivity with which SEG is triggered, finding a 1041 (Daniyan et al. 2019). Despite the tendency of

1078 trade-off between the classical approach with standard 1042 asparagine to form amyloid fibrils (Halfmann et al.

1079 parameters and a more refined one such as that 1043 2011) it is reported that there is no in vivo evidence of

1080 proposed by Batistuzzi and colleagues (2016) through 1044 such aggregates in P. falciparum (Muralidharan et al.

1081 by which to identify any artificial classification in 1045 2010) which suggests that the parasite has learned to

1082 LCPs and nLCPs stratification. Likewise, it is 1046 take advantage or at least to manage this amino acid.

1083 important to point that GC content often aggravates 1047 Fairly recent studies (Halfmann et al. 2011) have

1084 genome assemblies (Chen et al. 2013). Even though 1048 clearly highlighted how LCRs rich in asparagine, in

1085 what is reported in the Supplementary Materials 1049 yeast, reduce the toxicity of proteins. Given the

1086 confers a certain robustness to our observations, 1050 negative correlation between selective pressure and

1087 finding similar results for SPI and protein length in 1051 LCRs abundance, it would be interesting to look for a

1088 other genome assemblies (same species different 1052 possible concomitance of detrimental polymorphisms

1089 data), the risk is not absent. Shannon's Entropy, as 1053 and the presence of asparagine, which would suggest

1090 used in this work, made us understand important bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1128 10.1073/pnas.0807404105. Epub 2008 Oct 13. PMID: 1091 characteristics regarding the composition of LCRs.

1129 18852452; PMCID: PMC2571024. 1092 However, making this implementation more general 1130  Brinkmann B, Klintschar M, Neuhuber F, Hühne J, Rolf B

1093 and more widely usable is left to future work. Finally, 1131 (June 1998). "Mutation rate in human microsatellites: influence

1132 of the structure and length of the tandem repeat". American 1094 a functional enrichment in GO terms could indicate the 1133 Journal of Human Genetics. 62 (6): 1408-15. doi: 10.1086 / 1095 phenotype associated with the proteins where the 1134 301869. PMC 1377148. PMID 9585597

1096 LCRs are inserted, providing with more clues the 1135  Brown LY, Brown SA. Alanine tracts: the expanding story of

1136 human illness and trinucleotide repeats, Trends Genet., 2004, 1097 rationale to infer the directionality Darwinian 1137 vol. 20 (pg. 51-58) 1098 Selective Pressure. 1138  Bulmer M. The selection-mutation-drift theory of synonymous

1139 codon usage. Genetics. 1991;129(3):897-907 1099 Conclusion 1140  Carter, R. and Walliker, D. 1975: New observations on the

1141 malaria parasites of rodents of the Central African Republic 1100 Overall, in this study we tackled the proteome analysis 1142 - Plasmodium vinckei petteri subsp. nov. and Plasmodium

1101 of 22 plasmodia providing important insights into the 1143 chabaudi Landau, 1965. Ann. Trop. Med. Parasitol., 69: 187-

1144 196 1102 pervasiveness of Asparagine Low Complexity 1145  Chaudhry, S.R., Lwin, N., Phelan, D. et al. Comparative 1103 Regions. Extending the present research to other 1146 analysis of low complexity regions in Plasmodia. Sci

1104 Apicomplexa would lead to a deepened knowledge of 1147 Rep 8, 335 (2018). https://doi.org/10.1038/s41598-017-18695-

1148 y 1105 the biology of these parasites, reaching a better 1149  Chen YC, Liu T, Yu CH, Chiang TY, Hwang CC. Effects of 1106 understanding of their evolutionary success. 1150 GC bias in next-generation-sequencing data on de novo genome

1151 assembly. PLoS One. 2013;8(4):e62856. Published 2013 Apr 1107 Contribution 1152 29. doi:10.1371/journal.pone.0062856

1153  Collins WE, Jeffery GM. Plasmodium malariae: parasite and 1108 C.A. and G.A. conceived the study. C.A. conducted 1154 disease. Clin Microbiol Rev. 2007;20(4):579 592. ‐ 1109 the analyses. C.A. and G.A. wrote the main 1155 doi:10.1128/CMR.00027-07

1156  Collins WE, Jeffery GM. Plasmodium ovale: parasite and 1110 manuscript. C.A. A.G. and F.S. read and approved the 1157 disease. Clin Microbiol Rev. 2005;18(3):570 581. ‐ 1111 final manuscript. 1158 doi:10.1128/CMR.18.3.570-581.2005

1159  Corradetti A.; Garnham P.C.C.; Laird M. (1963). "New

1112 References 1160 classification of the avian malaria parasites". Parassitologia. 5:

1161 1–4.

1113 1162  Cortés, A., Mellombo, M., Masciantonio, R., Murphy, V.J.,

1163 Reeder, J.C. and Anders, R.F. (2005) Allele specificity of 1114 1164 naturally acquired antibody responses against Plasmodium

1165 falciparum apical membrane antigen 1. Infect. Immun. 73, 422– 1115  Arisue, N., Hashimoto, T., Kawai, S. et al. Apicoplast 1166 430 1116 phylogeny reveals the position of Plasmodium vivax basal to 1167  Dalby AR. A comparative proteomic analysis of the simple 1117 the Asian primate malaria parasite clade. Sci Rep 9, 7274 1168 amino acid repeat distributions in Plasmodia reveals lineage 1118 (2019). https://doi.org/10.1038/s41598-019-43831-1 1169 specific amino acid selection. PLoS One. 2009 Jul

1119  Batistuzzi FU, Schneider KA, Spencer MK, Fisher D, 1170 14;4(7):e6231. doi: 10.1371/journal.pone.0006231. PMID:

1120 Chaudhry S, Escalante AA. Profiles of low complexity regions 1171 19597555; PMCID: PMC2705789.

1121 in Apicomplexa. BMC Evol Biol. 2016; 16:47. Published 2016 1172  Daniyan MO, Przyborski JM, Shonhai A. Partners in Mischief:

1122 Feb 29. doi:10.1186/s12862-016-0625-0 1173 Functional Networks of Heat Shock Proteins of Plasmodium

1123  Bozdech Z, Mok S, Hu G, Imwong M, Jaidee A, Russell B, 1174 falciparum and Their Influence on Parasite Virulence.

1124 Ginsburg H, Nosten F, Day NP, White NJ, Carlton JM, Preiser 1175 Biomolecules. 2019;9(7):295. Published 2019 Jul 23.

1125 PR. The transcriptome of Plasmodium vivax reveals divergence 1176 doi:10.3390/biom9070295

1126 and diversity of transcriptional regulation in malaria parasites. 1177  DePristo, M. A., Zilversmit, M. M., & Hartl, D. L. (2006). On

1127 Proc Natl Acad Sci U S A. 2008 Oct 21;105(42):16290-5. doi: 1178 the abundance, amino acid composition, and evolutionary bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1179 dynamics of low-complexity regions in proteins. Gene, 378, 1230 Complexity Regions behave as tRNA sponges to help co-

1180 19–30. doi:10.1016/j.gene.2006.03.023 1231 translational folding of plasmodial proteins. FEBS Letters,

1181  Derrick, B; Toher, D; White, P (2016). "Why Welchs test is 1232 584(2), 448–454. doi:10.1016/j.febslet.2009.11.004

1182 Type I error robust" (PDF). The Quantitative Methods for 1233  Gajbhiye, Shivani & Patra, P.K. & Yadav, Dr. Manoj. (2017).

1183 Psychology. 12 (1): 30–38. doi:10.20982/tqmp.12.1.p030 1234 New insights into the factors affecting synonymous codon

1184  Duret, L., & Mouchiroud, D. (1999). Expression pattern and, 1235 usage in human infecting Plasmodium species. Acta Tropica.

1185 surprisingly, gene length shape codon usage in Caenorhabditis, 1236 176. 10.1016/j.actatropica.2017.07.025.

1186 Drosophila, and Arabidopsis. Proceedings of the National 1237  Galinski M.R., Barnwell J., 2012, Nonhuman primate models

1187 Academy of Sciences, 96(8), 4482– 1238 for human malaria research. In:Abee, C.R. (Ed.), Nonhuman

1188 4487. doi:10.1073/pnas.96.8.4482 1239 Primates In Biomedical Research: Diseases, Academic Press

1189  Ellegren H. Microsatellites: simple sequences with complex 1240 Elsevier, pp 299-323

1190 evolution, Nat Rev Genet., 2004, vol. 5 (pg. 435-445) 1241  Galland G.G., Role of the Squirrel Monkey in Parasitic Disease

1191  Escalante AA, Ayala FJ. Phylogeny of the malarial genus 1242 Research, ILAR Journal, Volume 41, Issue 1, 2000, Pages 37–

1192 Plasmodium derived from rRNA gene sequences. Proc Natl 1243 43, https://doi.org/10.1093/ilar.41.1.37

1193 Acad Sci U S A. 1994;91(24):11373-11377. 1244  Garnham PC. The Subgenera of Plasmodium In Mammals. Ann 1194 doi:10.1073/pnas.91.24.11373 1245 Soc Belges Med Trop Parasitol Mycol. 1964;44:267 271. 1195  Escalante, Ananias & Freeland, Denise & Collins, William & ‐ 1246  Gatchel JR, Zoghbi HY. Diseases of unstable repeat expansion: 1196 Lal, Altaf. (1998). The Evolution of Primate Malaria Parasites 1247 mechanisms and common principles, Nat Rev Genet., 2005, 1197 Based on the Gene Encoding Cytochrome b from the Linear 1248 vol. 6 (pg. 743-755) 1198 Mitochondrial Genome. Proceedings of the National Academy 1249  Gemayel R, Cho J, Boeynaems S, Verstrepen KJ. Beyond junk- 1199 of Sciences of the United States of America. 95. 8124-9. 1250 variable tandem repeats as facilitators of rapid evolution of 1200 10.1073/pnas.95.14.8124. 1251 regulatory and coding sequences. Genes (Basel). 2012 Jul 1201  Faust, Christina & Dobson, Andrew. (2015). Primate malarias: 1252 26;3(3):461-80. doi: 10.3390/genes3030461. PMID: 1202 Diversity, distribution and insights for zoonotic Plasmodium. 1253 24704980; PMCID: PMC3899988. 1203 One Health. 1. 10.1016/j.onehlt.2015.10.001. 1254  Gragg H., Harfe B.D., Jinks-Robertson S. Base composition of 1204  Ferreira, M.U., Ribeiro, W.L., Tonon, A.P., Kawamoto, F. and 1255 mononucleotide runs affects DNA polymerase slippage and 1205 Rich, S.M. (2003) Sequence diversity and evolution of the 1256 removal of frameshift intermediates by mismatch repair in 1206 malaria vaccine candidate merozoite surface protein-1 (MSP-1) 1257 Saccharomyces cerevisiae. Mol. Cell. Biol. 2002;22:8756– 1207 of Plasmodium falciparum. Gene 30, 65–75 1258 8762. 1208  Filisetti, D., Th´eobald-Dietrich, A., Mahmoudi, N., Rudinger- 1259  Green H, Wang N. Codon reiteration and the evolution of 1209 Thirion, J., Candolfi, E., Frugier, M. (2013). Aminoacylation of 1260 proteins. Proc Natl Acad Sci U S A. 1994;91(10):4298–4302. 1210 Plasmodium falciparum tRNA Asn and Insights in the 1261 doi:10.1073/pnas.91.10.4298 1211 Synthesis of Asparagine Repeats. Journal of Biological 1262  Guy-Franck R., Kerrest A, Dujon B. Comparative genomics 1212 Chemistry, 288(51), 36361–36371. doi:10.1074/jbc.m113 1263 and molecular dynamics of DNA repeats in . 1213 .522896 1264 Microbiol Mol Biol Rev. 2008;72(4):686–727. 1214  Forcelloni, S., Giansanti, A. Evolutionary Forces and Codon 1265 doi:10.1128/MMBR.00011-08 15 1215 Bias in Different Flavors of Intrinsic Disorder in the Human 1266  Halfmann R, Alberti S, Krishnan R, Lyle N, O’Donnell CW, et 1216 Proteome. J Mol Evol (2019) doi:10.1007/s00239-019-09921- 1267 al. (2011) Opposing effects of glutamine and asparagine govern 1217 4 Fowler 1268 prion formation by intrinsically disordered proteins. Mol Cell 1218  Forsdyke D., Evolutionary Bioinformatics 2016, Springer Ed. 1269 43: 72–84. 1219  Foth BJ, Zhang N, Chaal BK, Sze SK, Preiser PR, Bozdech Z. 1270  Hamilton WL, Claessens A, Otto TD, Kekre M, Fairhurst RM, 1220 Quantitative time-course profiling of parasite and host cell 1271 Rayner JC, Kwiatkowski D. Extreme mutation bias and high 1221 proteins in the human malaria parasite Plasmodium 1272 AT content in Plasmodium falciparum. Nucleic Acids Res. 1222 falciparum. Mol Cell Proteomics. 2011;10(8):M110.006411. 1273 2017 Feb 28;45(4):1889-1901. doi: 10.1093/nar/gkw1259. 1223 doi:10.1074/mcp.M110.006411 1274 PMID: 27994033; PMCID: PMC5389722. 1224  Fowler, D. M., Koulov, A. V., Balch, W. E., & Kelly, J. W. 1275  Hartl FU, Hayer-Hartl M. Molecular chaperones in the cytosol: 1225 (2007). Functional amyloid – from bacteria to humans. Trends 1276 from nascent chain to folded protein. Science. 2002; 295:1852– 1226 in Biochemical Sciences, 32(5), 217– 1277 1858. doi: 10.1126/science.1068408. 1227 224. doi:10.1016/j.tibs.2007.03.003 1278  Hayakawa, T., Culleton, R., Otani, H., Horii, T., & Tanabe, K. 1228  Frugier, M., Bour, T., Ayach, M., Santos, M. A. S., Rudinger- 1279 (2008). Big Bang in the Evolution of Extant Malaria Parasites. 1229 Thirion, J., Th´eobald-Dietrich, A., Pizzi, E. (2009). Low bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1280 Molecular Biology and Evolution, 25(10), 2233–2239. 1331 substitutions., Molecular Biology and Evolution, Volume 3,

1281 doi:10.1093/molbev/msn171 1332 Issue 5, September 1986, Pages 418–

1282  Howes RE, Battle KE, Mendis KN, et al. Global Epidemiology 1333 426, https://doi.org/10.1093/oxfordjournals.molbev.a040410

1283 of Plasmodium vivax. Am J Trop Med Hyg. 2016;95(6 1334  M.J. Gardner, N. Hall, E. Fung, O. White, M. Berriman, R.W.

1284 Suppl):15‐34. doi:10.4269/ajtmh.16-0141 1335 Hyman, et al., Genome sequence of the human malaria parasite

1285  Hu G, Cabrera A, Kono M, Mok S, Chaal BK, Haase S, 1336 Plasmodium falciparum, Nature 419 (2002) 498–511,

1286 Engelberg K, Cheemadan S, Spielmann T, Preiser PR, 1337 http://dx.doi.org/10.1038/nature01097.

1287 Gilberger TW, Bozdech Z. Transcriptional profiling of growth 1338  M.M. Alba, R. Guigo, Comparative analysis of amino acid

1288 perturbations of the human malaria parasite Plasmodium 1339 repeats in rodents and humans, Genome Res. 14 (2004) 549–

1289 falciparum. Nat Biotechnol. 2010 Jan;28(1):91-8. doi: 1340 554

1290 10.1038/nbt.1597. Epub 2009 Dec 27. PMID: 20037583. 1341  Mackinnon MJ, Li J, Mok S, Kortok MM, Marsh K, et al.

1291  James R.K.; Ray, C. George, eds. (2004). Sherris medical 1342 (2009) Comparative Transcriptional and Genomic Analysis

1292 microbiology: an introduction to infectious diseases (4th ed.). 1343 of Plasmodium falciparum Field Isolates. PLOS Pathogens

1293 McGraw-Hill Professional Med/Tech. ISBN 978-0-8385-8529- 1344 5(10): e1000644. https://doi.org/10.1371/journal.ppat.1000644

1294 0 1345  McCutchan TF, Kissinger JC, Touray MG, Rogers MJ, Li J,

1295  John Federhen, Scott. (1993). Statistics of Local Complexity in 1346 Sullivan M, Braga EM, Krettli AU, Miller LH. Comparison of

1296 Amino Acid Sequences and Sequence Databases. Computers 1347 circumsporozoite proteins from avian and mammalian

1297 Chemistry. 17. 149- 163. 10.1016/0097-8485(93)85006-X. 1348 malarias: biological and phylogenetic implications. Proc Natl

1298 Xiao H, Jeang KT. Glutamine-rich domains activate 1349 Acad Sci USA. 1996;93:11889–11894.

1299 transcription in yeast Saccharomyces cerevisiae, J Biol Chem. , 1350  Mitsui H., Arisue N, Sakihama N, Inagaki Y, Horii T,

1300 1998, vol. 273 (pg. 22873-22876) 1351 Hasegawa M, Tanabe K, Hashimoto T., Phylogeny of Asian

1301  Kokoska, R.J.; Stefanovic, L.; Tran, H.T.; Resnick, M.A.; 1352 primate malaria parasites inferred from apicoplast genome-

1302 Gordenin, D.A.; Petes, T.D. Destabilization of yeast micro- and 1353 encoded genes with special emphasis on the positions of

1303 minisatellite DNA sequences by mutations affecting a nuclease 1354 Plasmodium vivax and P. fragile, Gene. 2010 Jan 15;450(1-

1304 involved in Okazaki fragment processing (rad27) and DNA 1355 2):32-8. doi: 10.1016/j.gene.2009.10.001

1305 polymerase delta (pol3-t). Mol. Cell. Biol. 1998, 18, 1356  Mosbach V, Poggi L, Richard GF. Trinucleotide repeat

1306 2779±2788 1357 instability during double-strand break repair: from mechanisms

1307  Kristan M, Thorburn SG, Hafalla JC, Sutherland CJ, Oguike 1358 to gene therapy. Curr Genet. 2019;65(1):17-28.

1308 MC. Mosquito and human hepatocyte with 1359 doi:10.1007/s00294-018-0865-1.

1309 Plasmodium ovale curtisi and Plasmodium ovale 1360  Mularoni L, Veitia RA, Albà MM. Highly constrained proteins

1310 wallikeri. Trans R Soc Trop Med Hyg. 2019;113(10):617‐622. 1361 contain an unexpectedly large number of amino acid tandem

1311 doi:10.1093/trstmh/trz048 1362 repeats. Genomics. 2007 Mar; 89(3):316-25. doi:

1312  Kumari, B., Kumar, R., Kumar, M. (2015). Low complexity 1363 10.1016/j.ygeno.2006.11.011. Epub 2006 Dec 28. PMID:

1313 and disordered regions of proteins have different structural and 1364 17196365.

1314 amino acid preferences. Molecular BioSystems, 11(2), 585– 1365  Muralidharan V, Goldberg DE. Asparagine repeats in

1315 594. doi:10.1039/c4mb00425f 1366 Plasmodium falciparum proteins: good for nothing? PLoS

1316  Levinson G, Gutman G. A, Slipped-strand mispairing: a major 1367 Pathog. 2013;9(8):e1003488. doi:10

1317 mechanism for DNA sequence evolution. (1987). Molecular 1368 .1371/journal.ppat.1003488

1318 Biology and 1369  Muralidharan V, Oksman A, Iwamoto M, Wandless TJ,

1319 Evolution. doi:10.1093/oxfordjournals.molbev.a040442 1370 Goldberg DE. Asparagine repeat function in a Plasmodium

1320  Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova 1371 falciparum protein assessed via a regulatable fluorescent

1321 TA. The relationship of protein conservation and sequence 1372 affinity tag. Proc Natl Acad Sci U S A. 2011 Mar

1322 length. BMC Evol Biol. 2002; 2:20. doi: 10.1186/1471-2148- 1373 15;108(11):4411-6. doi: 10.1073/pnas.1018449108. Epub 2011

1323 2-2 1374 Feb 28. PMID: 21368162; PMCID: PMC3060247.

1324  Liu W, Sundararaman SA, Loy DE, et al. Multigenomic 1375  Muralidharan V, Oksman A, Pal P, Lindquist S, Goldberg DE.

1325 Delineation of Plasmodium Species of the Laverania Subgenus 1376 Plasmodium falciparum heat shock protein 110 stabilizes the

1326 Infecting Wild-Living Chimpanzees and Gorillas. Genome Biol 1377 asparagine repeat-rich parasite proteome during malarial

1327 Evol. 2016;8(6):1929 1939. Published 2016 Jul 2. 1378 fevers. Nat Commun. 2012; 3: 1310. doi: ‐ 1328 doi:10.1093/gbe/evw128 1379 10.1038/ncomms2306

1329  M Nei, T Gojobori, Simple methods for estimating the numbers 1380  N.G. Faux, S.P. Bottomley, A.M. Lesk, J.A. Irving, J.R.

1330 of synonymous and nonsynonymous nucleotide 1381 Morrison, M.G. de la Banda, J.C. Whisstock, Functional bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1382 insights from the distribution and role of homopeptide repeat- 1433  Salichs E, Ledda A, Mularoni L, Alba MM, de la Luna S. 2009.

1383 containing proteins, Genome Res. 15 (2005) 537–551. 1434 Genome-wide analysis of histidine repeats reveals their role in

1384  Ngai, M.Y.; Saitou, N. (2016). The effect of perfection status 1435 the localization of human proteins to the nuclear speckles

1385 on mutation rates of microsatellites in primates. 1436 compartment. PLoS Genet. 5:e1000397

1386 Anthropological Science, 124(2), 85– 1437  Shannon, Claude E. (July 1948). "A Mathematical Theory of

1387 92. doi:10.1537/ase.160124 1438 Communication". Bell System Technical Journal. 27 (3): 379-

1388  Novembre JA (2002) Accounting for background nucleotide 1439 423. doi: 10.1002 / j.1538-7305.1948.tb01338.x. hdl:

1389 composition when measuring codon usage bias. Mol Biol Evol 1440 10338.dmlcz / 101429.

1390 19:1390–1394 1441  Sharp PM, Li WH. The codon Adaptation Index--a measure of

1391  Otto TD, Gilabert A, Crellen T, et al. Genomes of all known 1442 directional synonymous codon usage bias, and its potential

1392 members of a Plasmodium subgenus reveal paths to virulent 1443 applications. Nucleic Acids

1393 human malaria. Nat Microbiol. 2018;3(6):687–697. 1444 Res. 1987;15(3):1281D1295. doi:10.1093/nar/15.3.1281

1394 doi:10.1038/s41564-018-0162-2 1445  Sharp PM, Li WH., 1986 An evolutionary perspective on

1395  Pizzi E., Frontali C. (2001). Low-Complexity Regions in 1446 synonymous codon usage in unicellular organisms. J. Mol.

1396 Plasmodium falciparum Proteins. Genome Research, 11(2), 1447  Shen H, Kan JL, Green MR. 2004. Arginine-serine-rich

1397 218–229. doi:10.1101/gr.gr-1522r 1448 domains bound at splicing enhancers contact the branchpoint to

1398  Prugnolle F, Durand P, Neel C, et al. African great apes are 1449 promote prespliceosome assembly. Mol Cell. 13:367–376.

1399 natural hosts of multiple related malaria species, including 1450  Silva JC, Egan A, Friedman R, Munro JB, Carlton JM, Hughes

1400 Plasmodium falciparum. Proc Natl Acad Sci U S A. 1451 AL. Genome sequences reveal divergence times of malaria

1401 2010;107(4):1458 1463. doi:10.1073/pnas.0914440107 1452 parasite lineages. Parasitology. 2011;138(13):1737-1749. ‐ 1402  Radó-Trilla, N., Albà, M. Dissecting the role of low-complexity 1453 doi:10.1017/S0031182010001575

1403 regions in the evolution of vertebrate proteins. BMC Evol 1454  Singh B, Daneshvar C. Human infections and detection of

1404 Biol 12, 155 (2012). https://doi.org/10.1186/1471-2148-12- 1455 Plasmodium knowlesi. Clin Microbiol Rev. 2013;26(2):165 ‐ 1405 155 1456 184. doi:10.1128/CMR.00079-12

1406  Ramiro, R.S., Reece, S.E. & Obbard, D.J. Molecular evolution 1457  Strachan T., Goodship J., Chinnery P, Genetics and Genomics

1407 and phylogenetics of rodent malaria parasites. BMC Evol Biol 1458 in Medicine, Garland Science

1408 12, 219 (2012). https://doi.org/10.1186/1471-2148-12-219 1459  Sueoka N, Kawanishi Y. DNA G+C content of the third codon

1409  Rich SM, Leendertz FH, Xu G, et al. The origin of malignant 1460 position and codon usage biases of human genes. Gene. 2000

1410 malaria. Proc Natl Acad Sci U S A. 2009;106(35):14902- 1461 Dec 30;261(1):53-62. doi: 10.1016/s0378-1119(00)00480-7.

1411 14907. doi:10.1073/pnas.0907740106 1462 PMID: 11164037.

1412  Rich SM, Xu G. Resolving the phylogeny of malaria parasites. 1463  Sueoka, N. Intrastrand parity rules of DNA base composition

1413 Proc Natl Acad Sci U S A. 2011;108(32):1297312974. 1464 and usage biases of synonymous codons. J Mol Evol 40, 318–

1414 doi:10.1073/pnas.1110141108 1465 325 (1995). https://doi.org/10.1007/BF00163236

1415  Richard, G.F.; Paques, F. Mini- and microsatellite expansions: 1466  Sun X, Yang Q, Xia X. An improved implementation of

1416 The recombination connection. EMBO Rep. 2000, 1, 122±126. 1467 effective number of codons (nc). Mol Biol Evol.

1417  Ruxton, G. D. (2006). "The unequal variance t-test is an 1468 2013;30(1):191 196. doi:10.1093/molbev/mss201 ‐ 1418 underused alternative to Student's t-test and the Mann–Whitney 1469  Tachibana, S.-I., Sullivan, S. A., Kawai, S., Nakamura, S., Kim,

1419 U test". Behavioral Ecology. 17 (4): 688–690. 1470 H. R., Goto, N., Tanabe, K. (2012). Plasmodium cynomolgi

1420 doi:10.1093/beheco/ark016 1471 genome sequences provide insight into Plasmodium vivax and

1421  Ryan, Kenneth James; Ray, C. George, eds. (2004). Sherris 1472 the monkey malaria clade. Nature Genetics, 44(9), 1051–1055.

1422 medical microbiology: an introduction to infectious diseases 1473 doi:10.1038/ng.2375

1423 (4th ed.). McGraw-Hill Professional Med/Tech. ISBN 978-0- 1474  Valkiunas G., Avian Malaria Parasites and Other

1424 8385-8529-0 1475 Haemosporidia, CRC Press, 2000 NW Corporate Blvd., Boca

1425  S. Karlin, L. Brocchieri, A. Bergman, J. Mrazek, A.J. Gentles, 1476 Raton, Florida, 33431 USA. 2005. 932 pp. ISBN 0- 415-30097-

1426 Amino acid runs in eukaryotic proteomes and disease 1477 5.

1427 associations, Proc. Natl. Acad. Sci. USA 99 (2002) 333–338 1478  Verstrepen K.J., Jansen A., Lewitter F., Fink G.R. Intragenic

1428  S.J. Newfeld, H. Tachida, B. Yedvobnick, Drive-selection 1479 tandem repeats generate functional variability. Nat.

1429 equilibrium: homopolymer evolution in the Drosophila gene 1480 Genet. 2005;37:986–990

1430 mastermind, J. Mol. Evol. 38 (1994) 637–641. 1481  Vogel G (November 2013). "The forgotten malaria". Science. nd 1431  Saitou N., Introduction to Evolutionary Genomics, Springer 2 1482 342(6159): 684–7. doi:10.1126/science.342.6159.684

1432 Ed., pg 45-48 (2018) bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

1483  Ward, J. H., Jr. (1963), "Hierarchical Grouping to Optimize an 1528

1484 Objective Function", Journal of the American Statistical

1485 Association, 58, 236–244. 1529

1486  Waters AP, Higgins DG, McCutchan TF. Evolutionary

1487 relatedness of some primate models of Plasmodium. Mol Biol 1530

1488 Evol. 1993b; 10: 914–923.

1489  Waters AP, Higgins DG, McCutchan TF. The Phylogeny of 1531

1490 malaria: a useful study. Parasitol Today. 1993a;9:246–250

1532 1491  Waters, A. P., D. G. Higgins, and T. F. McCutchan. 1991.

1492 Plasmodium falciparum appears to have arisen as a result of

1533 1493 lateral transfer between avian and human hosts. Proceedings of

1494 the National Academy of Sciences USA 88: 3140-3144 1534 1495  Williamson M., How Proteins Work, Garland Science

1496  Wootton JC, Federhen S. Analysis of compositionally biased 1535 1497 regions in sequence databases. Methods Enzymol. 1996;

1498 266:554 571. doi:10.1016/s0076-6879(96)66035-2 ‐ 1499  Wright F. The 'effective number of codons' used in a

1500 gene. Gene. 1990;87(1):23 29. doi:10.1016/0378- ‐ 1501 1119(90)90491-9

1502  Xue, H.Y. and Forsdyke, D.R. (2003) Low-complexity

1503 segments in Plasmodium falciparum proteins are primarily

1504 nucleic acid level adaptations. Mol. Biochem. Parasitol. 128,

1505 21–32.

1506  Xuhua Xia, Bioinformatics and the Cell. Modern

1507 Computational Approaches in Genomics, Proteomics and

1508 Transcriptomics Springer Ed.

1509  Yadav MK, Swati D. Comparative genome analysis of six

1510 malarial parasites using codon usage biasbased

1511 tools. Bioinformation.2012;8(24):1230 1239. ‐ 1512 doi:10.6026/97320630081230

1513  Zhang, F., Gu, W., Hurles, M. E., & Lupski, J. R. (2009). Copy

1514 Number Variation in Human Health, Disease, and Evolution.

1515 Annual Review of Genomics and Human Genetics,10(1), 451–

1516 481. doi: 10.1146/annurev.genom.9.081307.164217

1517  Zilversmit MM, Volkman SK, DePristo MA, Wirth DF,

1518 Awadalla P, Hartl DL. Low-complexity regions in Plasmodium

1519 falciparum: missing links in the evolution of an extreme

1520 genome. Mol Biol Evol. 2010;27(9):2198-2209.

1521 doi:10.1093/molbev/msq108

1522

1523

1524

1525

1526

1527 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig. 1 Heatmap of the RSCU values of the Low Complexity Regions. Codons are placed on the columns. Plasmodia on the rows. Data is standardized along the columns. Ward's Linkage was used. Row and column pairwise distance was calculated with Standardized Euclidean Distance. Data were standardized along rows. The MATLAB clustergram function was used.

1536 1549

1537 1550

1538 1551

1539 1552

1540 1553

1541 1554

1542 1555

1543 1556

1557 1544

1558 1545

1559

1546 1560

1547 1561

1562 1548

1563

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig. 2 (A) Regression of Entropy vs LCRs length. Length of LCRs is measured in nucleotides. Model parameters are provided with 95% coefficient bounds Laverania(x) = a*exp(b*x) + c*exp(d*x) : a = 1.918 (1.885, 1.951), b = -0.0003328 (-0.0004543, -0.0002113), c = -5.435 (-5.793, -5.078), d = - 0.05285 (-0.05514, -0.05056); Simian(x) = a*exp(b*x) + c*exp(d*x) : a = 2.41 (2.368, 2.452), b = -0.0002985 (-0.0004451, -0.0001519), c = -39.41 (-47.79, -31.02), d = -0.1012 (-0.1077, -0.09478); Vinckeia(x) = a*exp(b*x) + c*exp(d*x): a = -1.006e+04 (-4.838e+12, 4.838e+12), b = 0.002788 (-52.56, 52.57),c = 1.006e+04 (-4.838e+12, 4.838e+12),d = 0.002788 (-52.55, 52.56); Haemamoeba(x) = a*exp(b*x) + c*exp(d*x): a = -2.832 (-3.196, -2.469),b = -0.04565 (- 0.05072, -0.04057),c = 2.018 (1.956, 2.08), d = -0.0003467 (-0.0004884, -0.000205); HIPs(x) = a*exp(b*x) + c*exp(d*x): a = -9.49 (-10.81, -8.165), b = - 0.06278 (-0.0674, -0.05817), c = 2.357 (2.291, 2.423), d = -0.0006497 (-0.0008514, -0.0004479) Fig. 2(B)Illustration of the average complexity of each Low Complexity. The mean of each distribution is represented by a data point of the same colour as the outline of the violin plot. The trend of the means and medians is represented by the yellow and red lines respectively. Violin Plot Function has been taken from GitHub-Matlab (C) Trend of the sample averages relative to the distributions of Fig. 2 (B) with respect to the average Guanine and Cytosine content of each parasite.

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig. 3 Percentage representation of the CUB in the various parasitic subgroups. The amount of each codon was divided by the total amount of residues present in the LCRs of each parasite. (A) Laverania Subgenus (B) Simian Plasmodia (C) Haemamoeba Subgenus (D) Subgenus Vinckeia (E) HIPs. Each peak represents half of the bar it refers to.

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. theprotein.host Fig.(A) 4 representedgree in

Pairwisecom (C) n. n. Pairwise comparison of LCPs, deprived of LCRs, and areare nLCPs red.deprivedThe represented nLCPs. LCPs LCRs, in The LCPs, Pairwise of of comparison parison of LCPs and parasite.LCPs in nLCPs each parison of (B)

Relationship between the length of the LCRs and the length of the and of of betweenlength LCRs the length Relationship the bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig. 5 Illustration of ENC plot for Laverania Subgenus : (1) P.reichnowi (2) P.gaboni (3) P.alderi (4) P.praefalciparum (5) P.falciparum; Human Infectious Plasmodia: (1) P.malariae (2) P.ovale (3) P.o.curtisi; Vinckeia Subgenus: (1) P.berghei(2) P.yoelii (3) P.petteri (4) P.chabaudi (5) P.vinckei; Haemamoeba Subgenus: (1) P.gallinaceum (2) P.relictum. Globally ENC distributions cluster in the left region of the ENC plane. LCPs are represented in Red. nLCPs are represented in green.

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig.6 Illustration of the ENC Plots of Simian Plasmodia. We used a f(x) = p1*x^2 + p2*x + p3 model. Best Fit’s parameters (BFPs) are provided with the 95% confidence bounds. In red LCPs, in green nLCPs. In orange the best fit for nLCPs. In light green the best fit for LCPs. (A) P. vivax: nLCPs’ BFPs: p1 = -160.3 (- 164.3, -156.3), p2 = 146.5 (143.3, 149.7),p3 = 19.7 (19.13, 20.27);LCPs’ BFPs: p1 = -197.6 (-211.5, -183.7),p2 = 188.5 (175.5, 201.4),p3 = 10.18 (7.227, 13.14) (B) P.knowlesi: nLCPs’ BFPs: p1 = -334.7 (-352.9, -316.5),p2 = 289.9 (274.4, 305.4),p3 = -7.969 (-11.24, -4.701); LCPs’ BFPs: p1 = -298.4 (-351, - 245.8),p2 = 272.7 (232.2, 313.2), p3 = -5.609 (-13.32, 2.101) (C) P.cynomolgi: nLCPs’ BFPs: p1 = -207 (-214.2, -199.8),p2 = 185.1 (179.6, 190.7),p3 = 12.28 (11.23, 13.32); LCPs’ BFPs: p1 = -259 (-291.5, -226.5),p2 = 245.9 (217.5, 274.3),p3 = -1.82 (-8.019, 4.379) (D) P.inui: nLCPs’ BFPs: p1 = -268.7 (- 284.1, -253.3),p2 = 253.2 (239.1, 267.3),p3 = -5.331 (-8.551, -2.111);LCPs’ BFPs: p1 = -320.5 (-363.3, -277.8),p2 = 310.3 (271.1, 349.6),p3 = -18.31 (- 27.29, -9.327) (E) P.fragile: nLCPs BFPs: p1 = -188.2 (-197.4, -179),p2 = 165.9 (158.1, 173.6),p3 = 17.42 (15.75, 19.09);LCPs’ BFPs: p1 = -416.8 (- 476.9, -356.6),p2 = 379.3 (327.3, 431.2),p3 = -29.75 (-40.93, -18.57); (F) P.coatney: nLCPs’ BFPs: p1 = -280.9 (-293.3, -268.6),p2 = 254.8 (244, 265.6),p3 = -3.065 (-5.406, -0.724);LCPs’ BFPs: p1 = -246.7 (-289.4, -203.9), p2 = 234 (198.5, 269.5),p3 = 0.6524 (-6.597, 7.902) (G) P.gonderi

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig. 7 Pairwise (Welch t-test) comparison of the ENC distributions of LCPs and nLCPs in each parasite. Consistent with what was done in the previous graphs, the LCPs are represented in red. Similarly, nLCPs are represented in green. The medians of each boxplot pair differ with 95% statistical significance. (Mathworks).

Fig. 8 Pairwise (Welch t-test) comparison of the SPI distributions of LCPs and nLCPs in each parasitet LCPs are represented in red. In green nLCPs. The medians of each boxplot pair differ with 95% statistical significance. (Mathworks). bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

- ) ) - 6 6 =

A c = 0.51 = 0.51 = 0.57 = 0.57

a a

: = -0.001 = model. model. ( 0.001252),

b P.reichnowi

) ) = -0.00017 ( B b 0.090 (0.07191,

= c LCPs’ BFPs: LCPs’ nLCPs’ nLCPs’ BFPs c*exp(d*x) : = -0.000183 (-0.000253, d a*exp(b*x) a*exp(b*x) + = 0.51 (0.4397, 0.5836), (0.4397, 0.51 = b = -0.001489 (-0.001727, - (-0.001727, -0.001489 = b a a

: = -0.003121 -0.00256) , (-0.003681, P.praefalciparum b ) ) = 0.14 (0.1177, 0.1604) , C a : LCPs’ BFPs LCPs’

= -7.5e-05 (-0.0001052, -4.61e-05) ,( -4.61e-05) (-0.0001052, -7.5e-05 = d = 0.13 (0.1085, 0.1525) , c nLCPs’ BFPs = -0.00124 (-0.001438, -0.001038) , : b = -0.000229 (-0.0002978, -0.0001602) -0.0001602) = ; -0.000229 (-0.0002978, d d = 0.4437 (0.3759, 0.5114), (0.3759, 0.4437 = = 0.55 , (0.4769, 0.6341) P.alderi P.alderi a a ) ) : : D : LCPs’ BFPs LCPs’ = 0.083 (0.06994, 0.09596) , 0.09596) (0.06994, 0.083 = c = -7.12e-05 (-0.0001015, -4.094e-05) ( -4.094e-05) = -7.12e-05 (-0.0001015, d d

= 0.14 (0.1202, 0.1643), = 0.39 (0.3423, 0.4314), c = -0.00018 (-0.0002396, -0.000124); -0.000124); (-0.0002396, -0.00018 = d

= -0.0028 -0.002413) , (-0.003272, b P.gaboni : P.gaboni nLCPs’ BFPs ) ) E (

LCPs’ BFPs : a = = -8.149e-05 (-0.0001133, -4.968e-05) ( = 0.13 (0.1118, 0.149), (0.1118, 0.13 = = 0.09343), 0.08 (0.06701, d d c

c = -0.0027 (-0.003079, -0.00226); -0.00226); (-0.003079, -0.0027 =

= = 0.55 (0.4898, 0.618) , d d = -0.0016 (-0.001798, -0.001371), (-0.001798, -0.0016 =

a a b : = , -0.002601) -0.0030 (-0.003502, b = -3.842e-05) -6.124e-05 (-8.405e-05, d = 0.50 (0.4461, 0.5571), (0.4461, 0.50 = = 0.53 (0.4595, 0.601) , 0.601) (0.4595, 0.53 = c c

a = 0.084 (0.07013, 0.09811),

= -0.0002504 (-0.0003267, -0.0001741) ; c

d = -0.0015 (-0.001708, -0.001303), (-0.001708, -0.0015 = = -0.0030 (-0.003265, -0.002493), (-0.003265, -0.0030 = b b

= 0.580.6502), (0.5122, a : LCPs’ BFPs: LCPs’ -0.0001093), = -5.822e-05 (-8.891e-05, -2.752e-05) . All the coefficients are provided with their 95% confidence bounds 95%confidence their with provided are coefficients the All . -2.752e-05) (-8.891e-05, -5.822e-05 = d Illustration Illustration of the SPI vs length analysis performed with Laverania parasites. In red LCPs. In green nLCPs. We utilized an = = 0.10 (0.09086, 0.1175), ig. ig. 9 -0.00181, -0.001351), nLCPs’ BFPs F P.falciparum: nLCPs’ best fit parameters (BFPs) 0.0001131) ; 0.0001131) : (0.4451, 0.5775), 0.6328), (0.5096, ( 0.0002284, c 0.16 (0.1342, 0.1881 , , 0.1039)

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig 10 Pr2 plots. (A, B, C, D, E) Laverania Plasmodia; (F, G, H, I, L, M, N) Simian Plasmodia; (O,P) Haemamoeba Plasmodia;(Q,R,S,T,U) Vinckeia Plasmodia; (V, Y,Z) Human Infecting Plasmodia. Moving away from the Pr2 center returns the extent with which Parity Rule 2 is violated in a Protein Coding Sequence. The more a CDS moves away from the center, the more the contribution to the CUB can be attributed to Selective Pressure.

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Fig. 11 Pairwise comparison between Pr2 Violation distributions of LCPs and nLCPs represented in red and green, respectively. Medians in each pair differ with 95% confidence (Mathworks).

Fig 12 Correlation between Pr2 data and protein length. Pr2 plot is represented in the (X,Y) plane while protein length (nucleotides) is placed over the quote

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Tab.1 In the first column, the Pearson correlation coefficients of 22 distributions are collected in which the length of the proteins and the amount of LCR present in them is correlated. The second column collects the average Guanine and Cytosine content of each parasite. The third column collects the number of low complexity residues present within each organism as identified by SEG.

Organism Pearson-correlation GC content LC content (amino – acids coefficient count)

P.falciparum r = 0.70 0.23 130020 P.praefalciparum r = 0.71 0.23 125491 P.reichnowi r = 0.71 0.23 133878 P.alderi r = 0.67 0.22 105572 P.gaboni r = 0.70 0.22 102893 P.vivax r = 0.35 0.48 28753 P.knowlesi r = 0.26 0.40 19693 P.coatney r = 0.32 0.44 15273 P.cynomolgi r = 0.42 0.42 26101 P.inui r = 0.26 0.44 16109 P.fragile r = 0.22 0.44 19378 P.gonderi r = 0.39 0.26 31857 P.yoelii r = 0.33 0.24 29765 P.berghei r = 0.38 0.24 28753 P.petteri r = 0.21 0.27 11737 P.vinckei r = 0.16 0.26 11324 P.chabaudi r = 0.26 0.28 11265 P.gallinaceum r = 0.46 0.21 49707 P.relictum r = 0.37 0.21 39707 P.malariae r = 0.45 0.30 31876 P.ovale r = 0.43 0.36 43101 P.o.curtisi r = 0.30 0.32 32005

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Tab.2 Correlation coefficients between content of each protein and their ENC scores.

Organism nLCP LCP nLCPs-slope LCPs-slope

P.falciparum r = 0.56 r = 0.65 43.9 55.4

P.praefalciparum r = 0.57 r = 0.62 46.8 51.5

P.reichnowi r = 0.62 r = 0.62 49.2 52.9

P.alderi r = 0.61 r = 0.60 47.4 48.1

P.gaboni r = 0.59 r = 0.53 44.2 41.7

P.yoeli r = 0.54 r = 0.52 23.4 44.5

P.berghei r = 0.57 r = 0.62 43.3 51.5

P.petteri r = 0.54 r = 0.47 41.4 41.5

P.vinckei r = 0.57 r = 0.54 41.9 43.8

P.chabaudi r = 0.60 r = 0.51 43.9 39.3

P.gallinaceum r = 0.42 r = 0.43 30.8 37.8

P.relictum r = 0.41 r = 0.41 37.7 44.5

P.malariae r = 0.68 r = 0.72 50.4 63.5 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

Tab. 3 ENC vs values. Simian Plasmodia Organism nLCP LCP

P.vivax = 0.67 = 0.50

P.knowlesi = 0.23 = 0.47

P.inui = 0.20 = 0.35

P.cynomolgi = 0.56 = 0.34

P.coatney = 0.32 = 0.42

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.

bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.