bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Title page
Article title: Evolutionary Pressures and Codon Bias in Low
Complexity Regions of Plasmodia Parasites
Authors: 1*Andrea Cappannini, Sergio Forcelloni*3, 1, 2 Andrea Giansanti
Affiliation: 1 Sapienza, University of Rome, Department of
Physics, P.le A. Moro 5, 00185 Roma, Italy. 2 Istituto Nazionale di Fisica Nucleare, INFN,
Roma1 section. 00185, Roma, Italy. 3 Max Planck Institute of Biochemistry, Munich
Corresponding Author: Andrea Cappannini : [email protected] Sapienza University of Rome, Department of Physics, P.le A. Moro 5, 00185 Roma, Italy.
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1 Abstract 38 consequently, their propensity to undergo indels,
39 2 One of the most debated topics in Evolutionary whereby to cycles of expansion and contraction. The
40 3 Biology concerns Low Complexity Regions of P. first is Replication Slippage (RS) (Guy-Franck et al.
41 4 falciparum, the causative agent of the most virulent 2008; Mosbach et al. 2019; Ellegren, 2004; Gemayel
42 5 and deadly form of human malaria. In this work, we et al. 2012; Saitou, 2018). RS is a mutation process
43 6 analysed the proteome of 22 plasmodium species that occurs during DNA replication. It involves
44 7 including P. falciparum. SEG predicts that proteins denaturation and displacement of DNA strands with
45 8 containing Low Complexity Regions turn out to be the consequent decoupling of complementary bases
46 9 longer than those which are predicted to be completely (Levinson & Gutman, 1987). In a nutshell, the loop out
47 10 complex (without Low Complexity Regions). for the template strand implies a contraction whilst the
48 11 Moreover, using some well-known bioinformatics same for the nascent strand implies an expansion.
49 12 tools such as the Effective Number of Codons, the Pr2 (Gemayel et al. 2012). The other mechanisms are
50 13 and a new index that we have called SPI, we have recombination-like events (Gemayel et al. 2012;
51 14 noticed how proteins that embed Low Complexity Ellegren, 2004; Verstrepen et al. 2005) such as
52 15 Regions are under lower selective pressure than those unequal crossing-over and gene conversion, where
53 16 that do not present this type of locus. By applying the some propose it as a predominant mechanism in
54 17 Relative Synonymous Codon Usage and other tools minisatellite regions (Guy-Frank & Paques, 2000)
55 18 developed ad hoc for this study, we note, instead, how with the predominance of replication slippage in
56 19 the Low Complexity Regions appear to have a non- microsatellite regions (Kokoska et al. 1998). Guy-
57 20 neutral codon bias with respect to the host proteins. Franck and colleagues (2008) extensively surveyed
58 the current landscape of definitions of TRRs based on 21 Introduction 59 the length of the repeating unit, highlighting a lack of
22 60 consensus in the definition of microsatellite where,
23 Due to the critical implications for Human Health, the 61 some propose to consider a repetitive unit length of up
24 study of Single Nucleotide Polymorphisms (SNPs) 62 to 5 or 10 nt either (Guy-Frank et al. 2008; Gemayel
25 and Copy Number Variations (CNVs) is a 63 et al.2012). These loci have high mutational rate if
26 fundamental research field (Zhang et al. 2009). 64 compared to other DNA regions (Brinkmann et al.
27 Tandem repeat regions (TTRs) represent a third type 65 1998) ranging between and per cell 28 of genetic variation (Gemayel et al. 2012), whereby 10 10 66 generation (Gemayel et al. 2012). Past research has
29 proteins host regions of reduced complexity and 67 identified some factors that would contribute to the
30 biased amino acid composition, also called low 68 instability of these loci. Gemayel and colleagues
31 complexity regions (LCR) (Toll- Riera et al. 2012). 69 (2012) have redacted an extensive examination
32 Despite the diversification of the various domains of 70 referring to different papers that have dealt with this
33 life, this type of loci is ubiquitously present (Kumari 71 topic: Legendre et al. (2007) highlighted how the
34 et al. 2015) existing in Bacteria, Archaea and Eukarya 72 presence of multiple repetitive units is the main
35 (Wootton & Federhen, 1996). Taken as a whole, the 73 characteristic that contributes to the instability of these
36 literature indicates two main mechanisms that are 74 regions, followed by the length and the nucleotide
37 most likely to explain the presence of TRRs and, bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
75 purity of the tract, i.e. a coherent succession of 112 fruitful source of genetic investigations, also due to its
76 nucleotides not interrupted by any nitrogenous base 113 particularly high content of LCRs, mostly
77 without counterparts in the repetitive sequence 114 characterized by asparagine (N) residues. The
78 (Saitou, 2018). Wanting to consider a homopolymer 115 functional role of these stretches is not fully
79 sequence (same amino acid) Verstrepen and 116 understood yet. However, several hypotheses have
80 colleagues (2005) have instead highlighted how the 117 been advanced. Among others, Pizzi E. and Frontali C.
81 use of more synonymous codons drastically increases 118 (2001) proposed them to be subjected, to continuous
82 the stability of these loci. The nucleotide content also 119 cycles of expansion and de novo generation
83 affects the instability of these regions where poly A or 120 representing a resource of antigenic variation
84 poly T tracts have been observed to be more stable 121 ultimately leading the parasite to evade the host
85 than the corresponding poly G or poly C tracts (Gragg 122 immune response. Such a hypothesis has been
86 et al. 2002). For a more detailed examination we refer 123 sustained by other investigators in the field (Karlin et
87 to the original work (Gemayel et al. 2012). Low 124 al. 2001; Ferreira et al. 2003; Cortès et al. 2005). On
88 Complexity Regions are thought to be the result of tri- 125 the one hand and in line with a general requirement for
89 nucleotide slippage (Levinson & Gutman, 1987; 126 neo-functionalization, it was proposed that long N-
90 Mularoni et al. 2006). In proteins, the phenotype 127 repetitive stretches influence the local rate of
91 associated with these regions is often contradictory, 128 translation, thus triggering ribosome pausing and
92 whereby LCRs often have pathogenic implications, 129 ultimately behaving as tRNA sponges that assist co-
93 such as in neurodegenerative (Gatchel & Zoghbi, 130 translational folding (Frugier et al. 2009; Filisetti et al.
94 2005) or developmental (Brown & Brown, 2004) 131 2013). On the other hand, Forsdyke D.R. (Forsdyke,
95 pathologies. Nevertheless, LCRs are important for 132 2016) discusses the results obtained with Xue (2003),
96 protein fitness. Shen et al. (2004) showed how the 133 whereby similar N-repetitive stretches might play a
97 Arginine and Serine rich binding sites in the Exonic 134 double role at DNA level. Indeed, since the nucleotide
98 Splicing Enhancer (ESE) contribute to the assembly of 135 content of P.falciparum’s genome is skewed towards
99 pre-spliceosomes, supporting splicing and related 136 AT, the N-repetitive stretches are supposed to stabilize
100 activities. Similarly, the group of Salichs and 137 mRNAs and prevent the corresponding genes to
101 colleagues (2009) noted that histidine-rich sites are 138 undergo deleterious mutations. Undoubtedly, all of the
102 pivotal for sub-cellular localization. More generally, 139 cited works have provided the scientific community
103 incremental evidence highlights how these regions are 140 with important but also conflicting explanations for
104 preferentially inserted only in certain functional 141 the presence of LCRs in P.falciparum. Recent
105 protein classes (Karlin et al. 2001; Alba & Guigo 142 advances have shown how asparagine promotes the
106 2004; Faux et al. 2005) and how they do not sever their 143 formation of amyloid structures following thermal
107 functional domains (Newfeld et al. 1994). P. 144 shocks (Halfmann et al. 2011). As already explained
108 falciparum is a protozoan that belongs to the Phylum 145 by Muralidharan & Goldberg (2013), P. falciparum
109 of Apicomplexa. It represents the etiological agent of 146 copes with extremely variable temperatures during its
110 the most severe and lethal form of Human Malaria 147 life cycle, e.g., passing from the relatively low,
111 (Pizzi & Frontali, 2001). This parasite offered a 148 ambient temperature of the mosquito to the human bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
149 body at about 37° C and vice versa. Moreover, Malaria 186 nLCPs. To address the issue of Darwinian Selective
150 is characterized by numerous cycles of fever, 187 Pressures, we selected two common bioinformatics
151 eventually causing the host temperature to reach > 40° 188 tools, i.e. The Effective Number Of Codons (Wright,
152 C. Therefore, the continuous thermal fluctuations can 189 1996), using the improved version of Sun and
153 lead proteins to un- as well as mis-fold that could 190 colleagues (2012), and a Pr2 analysis (Sueoka, 1995;
154 ultimately lead to the death of the parasite. However, 191 Sueoka & Kawanishi, 1999). In addition to these two
155 although Q has biochemical characteristics similar to 192 indices, we have devised a completely new one,
156 N (Strachan et al., 2020) and is similarly prone to form 193 hereafter referred to as SPI (Selective Pressure Index),
157 prions, N reduces proteo-toxicity (Halfman et al. 194 that allows to compare distances from Wright's
158 2011). While showing that Heat Shock Proteins 195 Theoretical Curve. The set of these three tools allowed
159 prevent the formation of aggregates and prion-like 196 us to highlight how, compared to nLCPs, LCPs are
160 fibrils, Muralidharan et al. (2012) and Muralidharan 197 characterized by a more heterogeneous CUB and are
161 and Goldberg (2012) proposed, among other 198 further distinguished from nLCPs by a lower Selective
162 remarkable hypotheses, a mechanism to explain how 199 Pressure. Our results apply consistently to any of the
163 these regions can contribute to the formation of new 200 proteomes analysed and providing new insights about
164 folds and functions. In the present work, we studied 201 Low Complexity Regions.
165 the proteomes of 22 Plasmodia species operating on 202 Methods
166 two fronts: on the one hand, the study of the Codon 203 Data Sources and general methodologies
167 Usage Bias (CUB) of Low Complexity Regions by 204 Plasmodium proteomes from the NCBI GenBank
168 means of the Relative Synonymous Codon Usage 205 (ftp://ftp.ncbi.nih.gov). Brazilian strain of P.vivax was
169 (RSCU, Sharp & Li,1987) and the application of 206 considered. P.praefalciparum, P.alderi and P.o.curtisi
170 Shannon's Entropy (Shannon, 1948) provided us 207 were retrieved from PlasmoDB
171 remarkable insights into the selective pressure acting 208 (https://plasmodb.org). We considered only CDSs that
172 on these regions. On the other hand, based on the 209 start with methionine (ATG) ending with a stop codon
173 presence of at least one Low Complexity Region, we 210 (TAG, TAA or TGA) and having a multiple length of
174 stratified the proteins into LCPs (Low Complexity 211 three.
175 Containing Proteins) and not LCPs (hereafter referred
212 Each CDS was translated in the corresponding amino 176 to as nLCPs), the latter characterized by the absence
213 acid sequence by nt2aa MATLAB function. 177 of LCRs. Although this sub-setting was obtained by
214 Remaining scripts were built by custom and built in 178 SEG predictions (Wootton & Federhen, 1996)
215 MATLAB scripts. For some analyses we relied on 179 performed using default parameters, and the results
216 boxplot visualization strategy. Taking into 180 could be improved by further parameter refinement
217 consideration two boxplots, MATLAB notch function 181 (Batistuzzi et al. 2016), our downstream results
218 indicates that when two notches do not overlap, 182 displayed internal consistency. The study of protein
219 medians of boxes are significantly different at the 5% 183 lengths has highlighted how, in organisms weaving a
220 significance level. (Mathworks). 184 causative relationship between the two structural
185 characteristics, LCPs are intrinsically longer than bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
221 We identified LCRs by SEG algorithm, triggered with 256 codon family (Xia, 2007). Basically, RSCU is a
222 standard parameters (window = 15, K1= 1.5 and K2 = 257 normalized codon frequency that is expected to be 1
223 1.8) that ensures identified sequences to correspond to 258 where there is no codon usage bias, greater that 1 when
224 strongly biased sequences while, at the same time, 259 that codon is overused, minor than 1 when that codon
225 allowing for substantial sequence diversity (Trilla & 260 in underused. RSCU is formally defined as follows:
226 Albà, 2012). 261 = , 227 nLCPs and LCPs often occur in groups with different ∑ 228 numbers (see SM Supplementary Materials). Where 262 where refers to a codon frequency j; 229 distributions do not considerably diverge from a 263 is the degeneracy of the codon family i; 230 normal distribution, we relied on welch-t test (Ruxton 264 is the sum of the occurrences of 231 ,2007; Derrick et al. 2016), otherwise Mann Whitney test ∑ CdnFreq 265 the codons in that codon family that encode for the 232 was used. For multi-comparison tests we relied on welch
266 same amino acid. Worth noting, the maximum value 233 Anova followed by Bonferroni correction (MT1) when
267 that can be expected from a codon depends on the 234 distributions do not deviate considerably from normality,
268 codon family to which it belongs. Therefore, if the 235 otherwise Kruskal-Wallis test followed by Bonferroni
269 236 correction (MT2). degeneracy of a codon family is n then the maximum
270 expectation will be n. We introduced a modification to 237 Tandem Repeat Regions 271 the original version of RSCU proposed by Sharp & Li
238 To extract the DNA of the low complexity regions we 272 (1987), in line with the division of the 6-fold codon
239 used a customized MATLAB script in order to identify 273 families into 4-fold and 2-fold proposed by Sun and
240 the LCRs on the proteins and extract the related DNA 274 colleagues (2013) in their improved version of ENC
241 on the corresponding gene. 275 that we use in our ENC analysis. We neglect on
276 purpose ATG (M) and TGG (W) since being 242 GC, ENC, ENC plot and Pr2 analysis
277 expressed by only one codon there cannot be bias. We 243 We used the same rationale proposed by Forcelloni 278 computed, for each species, RSCU analysis on the 244 and Giansanti (2020) for the implementation of the 279 overall codon content of Low Complexity Regions. 245 ENC (Wright,1993) using the improved
280 Shannon Entropy 246 implementation of Sun et al. (2012), ENC plot, Pr2
247 plots (Sueoka,1995; Sueoka & Kawanishi,1999) and 281 To compare the complexity of Tandem Repeat
248 for the calculation of the GC content. Therefore, we 282 Regions we thus relied on Shannon’ Entropy
249 refer to Forcelloni & Giansanti (2020) for further 283 (Shannon,1948) computed as follows:
250 details concerning these tools. In this work, ENC and
251 Pr2 scores are separately calculated for LCPs and 284
252 nLCPs for each parasite. = − ( )
253 RSCU 285 where that is the sum of the = ∑ 254 An RSCU analysis (Sharp & Li, 1987) was carried out. 286 i-th codon in the j-th tandem repeat of length
255 RSCU measures codon usage for each codon in each 287 Entropy is used to represent in a compact way Tandem bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
288 Repeat Regions and their bias towards a few or more 323 subgroup rather than in another. Let us point that
289 codons. 324 extending the knowledge about the various radiations
325 that brought to the occurrence of these subgroups is 290 SPI 326 beyond our scope. P.falciparum has been placed
291 The Effective Number of Codons describes the codon 327 together with the other parasites belonging to the
292 bias of a gene. Moving away from the curve, the 328 monophyletic subgenus termed Laverania (Otto et al.
293 distance from the Wright’s Theoretical Curve provides 329 2018;Liu et al. 2016) that comprehends P.gaboni and
294 an estimate of the extent to which Mutational Bias and 330 P.reichnowi (that infect chimpanzees),
295 Selective Pressure affect the codon bias of that gene 331 P.praefalciparum and P.alderi (that infect gorillas).
296 (Novembre, 2002). On the one hand, the same ENC 332 Noteworthy, P.falciparum is the only that successfully
297 score be associated with different distances from the 333 infects humans (Otto et al. 2018) but despite it was for
298 curve, and the other, the same distance in absolute 334 a long time considered a human specific pathogen it
299 value, as the GC3 varies, can represent a diverse 335 also infect gorillas, rising concerns about possible
300 deviation from the expected value given by Wright's 336 reciprocal host transfer (Prugnolle et al. 2010). As far
301 Theoretical Curve. The distances in absolute value are 337 as the Asian monkey parasites are concerned, we
302 not equivalent. Therefore, we propose the SPI 338 retrieved data for P. vivax that represents a serious
303 (Selective Pressure Index) defined as follows: 339 threat for human health given its extremely wide
340 geographical coverage (Howes et al. 2016). P. 304 ( ) − ( ) ( ) = , ( ) − 341 knowlesi was recently recognized as a human
305 342 pathogen (Rich & Xu, 2011) given a major outbreak
306 where is the value of the Wright’s 343 of human infection in 2004 (Singh et al. 2013). Until ( ) 307 theoretical curve in correspondence of a , 344 this focus, human contagions were thought rare (Faust 308 is the ENC value for the gene in 345 & Dobson, 2015). Moreover, we considered ( ) 309 correspondence of that value and a = 20. SPI 346 P.cynomolgi and P.coatney (Galinski & Barnwell, 310 weights the shift of the coding sequence from the null 347 2012), P.inui (Galland, 2000) and P.gonderi (Arisue et
311 model of no-codon preference (the situation where a 348 al. 2019). We refer to these parasites as Simian
312 gene lies on the curve) with the situation of extreme 349 Plasmodia. To have a blueprint about evolutionary
313 bias, namely where just one codon is used for each 350 and quantitative diversification of LCRs we
314 amino acid, for that . When a gene lies on the 351 considered also murine rodents plasmodia (Vinckeia 315 theoretical curve SPI is expected to be 0 (no distance 352 Subgenus). We retrieved P.vinckei (Carter &
316 from the WTC). Otherwise, when a gene is extremely 353 Wallinker, 1975), P.petteri, P.chabaudi, P.berghei and
317 subject to selection SPI is expected to be 1 (situation 354 P.yoelii (Garnham 1964). Evidence addresses the
318 of extreme bias). 355 evolutionary origin of P. falciparum with avian
356 plasmodia (Waters et al. 1991; Waters et al. 1993 a-b; 319
357 McCutchan et al. 1996; Escalante & Ayala 1994). To 320 Cladistic and Organism Overview 358 extend our efforts we considered the two available
321 We briefly describe how the parasites are 359 specimens of the Haemamoeba Subgenus (Corradetti
322 phylogenetically related and why we placed them in a bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
360 et al. 1963): P. gallinaceum, proposed as a possible 396 significantly lower correlations than the P.
361 ancestor of P.falciparum (Waters et al. 1991;Escalante 397 falciparum’s group. We report the same also as regards
362 et al. 1998), and P.relictum that is one of the most 398 the contents in Low Complexity residues. Although
363 geographically widespread malaria parasites for birds 399 the Haemamoeba and Laverania Plasmodia do not
364 (Valkiunas, 2005). We refer to these parasites as 400 differ significantly in correlation coefficients, the
365 Haemamoeba Subgenus. Lastly, we considered 401 statistical pauperism of the group could lead to
366 P.ovale wallikeri (P.ovale), the etiological agent of 402 misleading results. In a nutshell, genomic bias does
367 tertian malaria, (Collins & Jeffery, 2005), P.ovale 403 not seem to play a particular selective advantage
368 curtisi (P.o.curtisi) (Kristan et al. 2019) and 404 although we do not deny its role in the choice of
369 P.malariae causing quartan malaria (Collins & Jeffery, 405 synonymous codons.
370 2007). We refer to these parasites as Human Infectious 406 RSCU 371 Plasmodia (HIPs).
407 We wanted to study the Codon Bias of the Low 372 408 Complexity Regions to derive information about the
373 Results 409 selective pressures acting on their synonymous
374 GC – Content Influence 410 codons. To do this, we used Relative Synonymous
411 Codon Usage (RSCU, Sharp & Li, 1987) applied on 375 Having the possibility to make a statistical comparison
412 the overall Codon Content of LCRs. We have 376 among several parasites, we reflected on the role of
413 deliberately neglected ATG (M) and TGG (W) 377 Guanine and Cytosine content, in order to understand
414 because being encoded by only one codon they cannot 378 the degree to which LCRs of P.falciparum, and more
415 have bias. The results are displayed in Fig. 1. Data is 379 generally of the other parasites, are influenced by
416 standardized along each column. If a codon has a value 380 genomic biases. We considered the Laverania
417 above the column mean it is represented in red. 381 Plasmodia as a yardstick for our analysis. In Tab.1 we
418 Otherwise green. Specifics regarding the clustering 382 summarize the statistics from various experiments we
419 algorithm and distance are provided in the image 383 performed during the analysis. The first column
420 caption. The heatmap indicates that Protozoa can be 384 collects the Pearson correlation coefficients between
421 distinguished in two main lines, although there are 385 the amount of Low Complexity Regions contained in
422 differences within each group. The Laverania, 386 proteins and their length. Next, the average Guanine
423 Haemamoeba and Vinckeia Plasmodia mainly prefer 387 and Cytosine content of each protozoan and the
424 codons whose wobble position is occupied by Adenine 388 quantity of low-complexity regions possessed by each
425 or Thymine, when compared to the remaining 389 of them. All the correlations are significant (p < 0.01).
426 subgroups (HIPs and Simian Plasmodia). The latter, 390 We have divided the statistics of each column by
427 on the contrary, have larger RSCU values for codons 391 parasite group. Thus, each column was divided into 5
428 whose wobble position is represented by Cytosine or 392 distributions. MT2 applied to the three distributions,
429 Guanine, if compared to the previous species. We 393 indicates that Laverania, Haemamoeba and Vinckeia
430 wanted to investigate further about the RSCU values 394 subgroups do not differ significantly in Guanine and
431 within each species in order to notice any more 395 Cytosine content. However, the Vinckeia have
432 specific preferences in respect of the wobble positions. bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
433 To do this, we compared the RSCU values of codons 469 understand their structural arrangement and derive a
434 ending with a purine (A or G) and codons ending with 470 rationale for inferring the presence of different
435 a pyrimidine (T or C). The Relative Synonymous 471 selective pressures. Fig. 2(A) illustrates the
436 Codon Usage is a normalized index that presents a 472 distributions of the 5 parasitic groups. Best fit
437 spectrum of different maximum values that depend on 473 parameters and models are left in the image caption.
438 the degeneracy of the codon family. Therefore, we 474 The elbow of the regressions shows that, on average
439 have normalized the RSCU values with respect to the 475 and for the same length, the LCRs of the Laverania
440 number of codons of the family they belong to (Mann 476 Plasmodia are less complex than those of the other
441 Whitney). As expected, Laverania, Vinckeia and 477 parasitic groups, especially HIPs and Simian spp. To
442 Haemamoeba Plasmodia follow the copycat, showing 478 have a better understanding of the phenomenon we
443 a clear preference for A over G and T over C in the 479 have studied the single distributions as shown in Fig.
444 wobble position (p <0.01), as it is also for HIPs 480 2 (B). Preliminary observation of the violin plots
445 (p<0.01). We find the same pressures in P.gonderi (p 481 delineates how Plasmodium species arrange their TRR
446 <0.01). Interestingly, we do not find the same stark 482 differently, which denotes, as entropy increases, a
447 split in many of the Simian Plasmodia, such as 483 greater number of codons along the same TRR. We
448 P.vivax, P.cynomolgi, P.fragile and P.inui whose 484 have applied MT1. The Bonferroni Correction
449 wobble position appears to be contested in a tug-of- 485 indicates that the Laverania Plasmodia use fewer
450 war between both purines and pyrimidines, not finding 486 codons (synonyms / non synonyms) along the
451 significant differences (p>0.05). Evolutionary 487 extension of their LCRs as opposed to other species
452 pressures are slightly different in P.coatney and 488 whose Low Complexity Regions appear more
453 P.knowlesi. The first has preference for G over A and 489 complex. The only parasite for which the comparison
454 T over C (p <0.05). The second presents wobble 490 with Laverania Plasmodia is insignificant is P.berghei.
455 positions better suited to its GC content by preferring 491 Fig. 2 (C) shows the correlation between the average
456 G to A and C to T (p <0.05). RSCU table is provided 492 complexity of TRRs and the average Guanine and
457 in the Supplementary materials (RSCU.xlsx). Overall, 493 Cytosine content of each organism. In general, average
458 we obtained information about the synonymous 494 TRR complexity increases with the GC content even
459 codons used in these organisms. However, RSCU is a 495 if the fluctuations suggest the presence of other forces
460 normalized index which neglects the quantitative 496 in determining the nucleotide composition of these
461 contribution of a codon family which, in contrast, 497 regions.
462 could be selectively avoided. 498 Codon Bias
463 499 We have analyzed from a quantitative point of view
464 500 the codon bias of the Low Complexity Regions since
501 the RSCU cannot return a quantitative information 465 Shannon Entropy and Complexity of Tandem 502 regarding the use of codons (Fig.3). represents 466 Repeats Regions 503 the ratio between the total quantity of a codon and the
467 We investigated the codon composition of the LCRs 504 total number of codons present within the Low
468 of each plasmodium using Shannon Entropy to 505 Complexity Regions of a parasite. All the species, bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
506 although in different proportions, show an increase in 543 which appear to have the longest LCPs among all
507 the codons used less commonly than those with higher 544 parasites and P.berghei which instead appears to have
508 percentages, unlike the Laverania Plasmodia which 545 the shortest, no differences emerge such as to justify
509 have a skewed distribution towards a few species of 546 the overabundance of LCRs typical of the Laverania
510 codons. 547 Plasmodia. Overall, LCPs emerge longer de facto than
548 nLCPs which weaves, in general, a causative link 511 Protein Length 549 between protein length and the presence of Low
512 Given the positive correlations between LCRs and 550 Complexity Regions. st 513 protein length we observed in 1 paragraph, we 551 ENC 514 analyzed the protein length of each Plasmodium by
515 stratifying proteomes into two distributions, i.e., LCPs 552 We relied on the Effective Number of Codons
516 and nLCPs. We compared the two distributions in each 553 (Wright,1990). We utilized the improved
517 organism (Welch t-test) (Fig. 4 (a)), represented in red 554 implementation proposed by Sun and colleagues
518 and green respectively. LCPs emerge to be 555 (2013). ENC is a widely used index that does not
519 significantly longer than nLCPs in each parasite (p << 556 require a reference set as the Codon Adaptation Index
520 0.01). We have reflected on the linear extension of the 557 (CAI) does (Sharp & Li, 1987). We calculated
521 LCRs. Considering LCRs length (See Supplementary 558 separately ENC scores for LCPs and nLCPs. They are
522 Materials), individually, these do not exceed 250 559 represented in red and green, respectively. Laverania,
523 amino acids in length. However, it is not uncommon 560 Vinckeia and Haemamoeba Subgenera (Fig.5 (A) (C)
524 to find more than one Low Complexity Region within 561 (D)) show off similar distributions placed on the left
525 the same protein. Therefore, for the avoidance of 562 side of the ENC plane, highlighting a low GC content
526 doubt, we studied the proportion between LCRs and 563 in Wobble codon position. Interestingly, nLCPs of
527 protein lengths to understand with certainty whether, 564 P.yoelii, despite mainly grouping on the left side of
528 on average, the length of LCPs was not only an artifact 565 the ENC plot, distribute throughout the plane
529 due to the presence of these stretches (Fig. 4 (b)). 566 following the Wright’s Theoretical Curve. HIPs
530 From what emerges, LCRs occupy a fairly small 567 (Fig.5(B)) have a content halfway between 531 space, rarely reaching 6% of the entire linear surface 568 former Subgenera and Simian Plasmodia whose ENC
532 of the polypeptides. We therefore deprived the LCPs 569 plots, in turn, are in line with other works (Gajbhiye et
533 of their Low Complexity Regions and repeated the 570 al. 2017; Yadav & Swati, 2012). Remarkably, P.vivax
534 experiment by comparing LCPs and nLCPs (Fig. 4 (c)) 571 and P.cynomolgi have a portion of their nLCPs
535 (Welch t-test). LCPs emerge intrinsically longer than 572 positioned on the left side of the ENC plane (Fig.6),
536 nLCPs (p << 0.01). Even if the graph shows a lack of 573 that further emphasises their close phylogenetic
537 statistically significant difference between the length 574 relationship (Sanger Institute). All but Simian
538 distributions of LCPs of the various species, we 575 Plasmodia, show ENC values that distribute linearly
539 wanted to apply MT1 to these to understand if there 576 with . We collected the correlation coefficients 540 could be differences such to explain the 577 (Pearson) of ENC vs in Tab. 2 together with 541 overabundance of LCRs typical of the Laverania 578 regression slopes of both nLCPs and LCPs. Given the
542 Plasmodia. Apart from P. gonderi and P.malariae 579 shape of its distributions we added also P.gonderi. All bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
580 the correlations are significant (p<0.01). We compared 617 (p<<0.01) diverging less from the Wright Theoretical
581 slope distributions of nLCPs and LCPs (Mann 618 Curve in each parasite. Medians of each pair differ
582 Whitney). LCPs regressions are on average 619 with 95% confidence. We evaluated the hypothesis
583 represented by steeper slopes (p<0.01). Mann Whitney 620 that a lower selective pressure could favour a greater
584 test indicates a greater influence of pressure on 621 pervasiveness of LCRs in the Laverania Subgenus. We 585 LCPs. In Tab .3 we collected the of regression 622 therefore applied MT1 on LCP distributions. The test 586 curves we calculated for Simian Plasmodia. Models 623 returns contradictory values. In fact, although the
587 and parameters, calculated with 95% confidence 624 Simian group shows the highest SPI values compared
588 bounds, are provided in the caption of Fig.6. 625 to all the other parasites considered here, some
589 Generally, the regression curves of LCPs tend to 626 parasites of the Vinckeia and Haemamoeba groups
590 overlook those of nLCPs. To add statistics to our 627 show values similar to or lower than the P. falciparum
591 analysis, we compared the ENC scores of LCPs and 628 group. Thus, seeing that LCPs of AT-rich groups
592 nLCPs parasite-wise (Fig .7) (Welch t-test). Most of 629 undergo similar patterns of selective pressure, the
593 the comparisons show that the LCPs have a more 630 hypothesis that greater mutational bias increases LCRs
594 redundant codon bias compared to nLCPs (p<<0.01). 631 lapses. For a better understanding we refer to the
595 The test fails only for P.chabaudi, P.yoelii and 632 interactive MATLAB plot left in the Supplementary
596 P.berghei (p>0.05) even if their medians differ with 633 Materials. Duret and Mouchiroud (1999) noted a
597 95% confidence. In a nutshell, LCPs have a more 634 negative correlation between protein length and codon
598 redundant CUB which suggests a stronger action of 635 bias in C.elegans, D.melanogaster, and A.thaliana.
599 mutational bias than Darwinian Selective Pressure 636 We therefore decided to retrace what they did using
600 (positive / negative). 637 the SPI (Fig. 9). In the proposed plan, each data point
638 (Laverania Plasmodia) represents a protein where the 601 SPI 639 length (abscissa) is measured in nucleotides. SPI
602 Analysis of the Effective Number of Codons has 640 values are placed on the ordinate. Regression model
603 revealed some very interesting details related to the 641 (fit MATLAB function) and parameters calculated
604 codon bias of LCPs and nLCPs which, on average, 642 with 95% confidence interval are provided in the
605 seem to differ in the extent to which selective pressure 643 image caption. Again, we point out that the LCPs have
606 and mutational bias affect the choice of their codons. 644 been deprived of their LCRs. The graph highlights
607 The problem of the ENC is that the distances in 645 how the contribution of the mutational bias tends to
608 absolute value from the Wright Theoretical Curve are 646 increase with the length of the protein for both nLCPs
609 not equivalent to each other and therefore two equal 647 and LCPs. Selective Pressure decreases as protein
610 ENC values can represent two different contributions 648 length increases. Similar trends are also reported for
611 of the selective pressure and of the mutational bias. In 649 the other parasitic groups whose graphs and models
612 this regard we have applied the SPI (Fig.8) repeating 650 are provided in Supplementary Materials. Recalling
613 what we have already done in Fig.7 (Welch t-test). The 651 the correlations between LCRs and protein length
614 SPI analysis corroborates what was primarily 652 observed in the first paragraph, these trends contain
615 observed by ENC. Indeed, LCPs, on average, emerge 653 another hidden information, namely that the lower the
616 to be under a lower selective pressure than nLCPs bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
654 selective pressure, the more LCRs are inserted. This 691 vanishing in green clouds. Pr2 plot is characterized by
655 weaves, more clearly, a relationship between 692 a vectorial nature that allows to discriminate among
656 Mutational bias and Low Complexity Regions. 693 various pressures driving protein coding genes in one
694 quadrant of Pr2 plane rather than in another. Thus, the 657 Pr2 695 diversified selective pressures pushing on protein
658 To strengthen our confidence regarding Darwinian 696 coding genes can, mathematically, neglect each other,
659 Pressures shaping CUB of Plasmodium species, a 697 making the centroids under-representative. Therefore,
660 Parity Rule 2 (Pr2) analysis was performed (Fig.10). 698 we operated as Forcelloni & Giansanti (2020),
661 Once more, we calculated Pr2 scores, separately, for 699 calculating the distributions of gene deviations from
662 LCPs and nLCPs represented in red and green, 700 the Pr2 centre as:
663 respectively. Pr2 plots are provided with centroids of
664 nLCP and LCP distributions. Laverania, Vinckeia and 701 665 Haemamoeba Subgenera show distributions located = − 0.5 + − 0.5 + + 666 near the centre of Pr2 plots even if their tails tend to 702 r better indicates the extent of Pr2 violations. In
667 diverge towards the external edges of Pr2 planes, 703 particular, if r = 0 the genes witness a perfect balance
668 confirming both the actions of selective pressure and 704 between natural selection and mutational bias. If r > 0
669 mutational bias. Diversely from Vinckeia Subgenus, 705 natural selection and mutational bias provide different
670 whose parasites have both their centroids placed in the 706 contributions concerning a greater Pr2 violation
671 second quadrant of Pr2 plot, Laverania and 707 (Forcelloni & Giansanti,2020). We compared the r
672 Haemamoeba Subgenera show to have LCP centroids 708 distributions of nLCPs and LCPs, parasite-wise (welch
673 on the first quadrant showing off, to a certain degree, 709 t test) (Fig. 11). Likewise to the other comparisons
674 a preference for A and G in the wobble codon position 710 (ENC and SPI), LCPs tend to stay closer to the parity
675 of their 4-fold codon families. Differently from 711 centre with respect to the nLCPs do (p<<0.01 on
676 Haemamoeba Subgenus, nLCP centroids of Laverania 712 average, P.fragile p<0.01) (Welch t-test). We were
677 Parasites appear in the second quadrant of Pr2 plane 713 further intrigued by Pr2 violation and the length of
678 with consequently a preference for C and G, in 714 protein coding genes. To unveil this information a 3D
679 contrast to a preference for A and G reported for birds 715 space, where (X,Y) plane is the Pr2 plane and the
680 Plasmodia. Similar considerations to Haemamoeba 716 quote is gene length, is used. Given the difficulties of
681 Plasmodia can be drawn for HIPs. Simian Plasmodia 717 layout and the redundancies of the graphs, we show
682 show more rounded distributions, representative of 718 here only Laverania Plasmodia (Fig.12). The rest of
683 their even GC content. Interestingly, remembering the 719 the parasites is anyway provided in SM. We observe
684 nLCPs of P.berghei that distributed throughout the 720 that longer proteins tend to cluster near the centroids
685 ENC plane and the AT rich islands of P.vivax and 721 that in turn do not deviate consistently from the parity
686 P.cynomolgi, these three clusters disappear in the 722 centre, characteristic that is in accordance with what
687 respective Pr2 plot, underlying that, despite the 723 predicted through SPI analysis. So, LCRs are more
688 differences evidenced through the ENC analysis, 4- 724 abundant where the Pr2 deviation is smaller. Overall,
689 fold codon families of these protein coding genes 725 Pr2 analysis sustains what we stressed through SPI
690 undergo to similar selective trends to the other nLCPs, bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
726 analysis highlighting through 4-fold codons the 762 SM). In this sense, we agree with Chaudhry et al.
727 relationship between mutational bias and LCRs. 763 (2017) in decreeing the role of genomic biases as
764 marginal. On the other hand, we do not deny GC 728 Discussion 765 content has a strong influence in the choice of
729 Malaria is one of the most severe public health 766 synonymous codons.
730 problems worldwide being the leading cause of death 767 RSCU and LCRs 731 and disease in many developing countries, with
732 405.000 estimated deceases in 2018 (CDC). 768 We applied RSCU (Sharp & Li, 1987) utilizing LCRs
733 Comparative genomics is a powerful tool to unravel 769 DNA. We identify similar preference in the most used
734 evolutionary changes among organisms, helping to 770 codons of the AT-rich species while Simian Plasmodia
735 identify the biological characteristics that give each 771 display a shuffled repertoire of RSCU values that is in
736 life form its unique attributes (Nature Education). In 772 line with their even GC content. Clustering analysis
737 this study, we have derived a set of information 773 (Ward, 1963) produces relevant observations that are
738 regarding the selective pressures acting on 22 774 consistent with literature. It divides species into two
739 Plasmodium species in order to then be able to 775 main evolutionary lines, emphasizing a mutual
740 advance a hypothesis regarding the nature of LCRs in 776 difference for their G / C or A / T wobble positions.
741 P. falciparum. 777 An evolutionary blueprint of the Protozoa is indeed
778 shined back, conserving the sisterly relationship 742 GC content 779 between P.cynomolgi and P.vivax (Tachibana et al.
743 In line with the literature (Zilversmit et al.2010; 780 2012; Hayakawa et al. 2008). P.gonderi is placed in
744 DePristo et al.2006; Frugier et al. 2009; Filisetti et al. 781 the same lineage of Simian Plasmodia stressing, once
745 2013; Pizzi & Frontali,2001; Xue & Forsdyke, 2003), 782 more, an African origin for these parasites (Arisue et
746 the P. falciparum proteome presented low levels of 783 al. 2019) further emerging to undergo similar
747 Guanine and Cytosine. In this regard, the analysis has 784 pressures in the choice of synonymous codons for their
748 begun to emphasize a marginal relationship between 785 LCRs. Remarkably, the branching of the Laverania
749 genomic bias and quantity Low Complexity Regions 786 family reports consistent data from various other
750 abundance. Our statistical comparisons do not support 787 works (Otto et al. 2018; Silva et al. 2011; Rich et al.
751 the hypothesis that the preponderance in LCRs of the 788 2009) highlighting a common ancestry with Vinckeia
752 Laverania family is driven by genomic bias. In P. 789 (Ramiro et al. 2012) and Haemamoeba (Waters et al.
753 falciparum, Hamilton and colleagues (2016) have 790 1991; Escalante et al. 1998) given the striking
754 shown in vitro a pronounced excess G: C to A: T 791 similarity of the CUB to their LCRs.
755 transitions finally attributing the genomic skew of 792 On the other hand, the heatmap returned a comparison 756 these parasites to mutational bias. Assuming that 793 between the RSCU values among all species blurring 757 similar transitions also occur in the other AT rich 794 the choice within the single ones. The analysis of the 758 parasites, the simple genomic bias does not explain the 795 RSCU normalized values illustrates for the species 759 extreme diversity of these organisms both in quantity 796 belonging to the Laverania, Vinckeia and 760 of LCRs and in quantity of LCPs of which the 797 Haemamoeba Subgenera, constituting what was seen 761 Laverania family seem to be particularly abundant (see bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
798 through the heat map, a clear pressure for wobble 835 Supplementary Materials), the predominance in the
799 positions in A or T. Similar conclusions are drawn for 836 Laverania Subgenus belongs to the AAT codon that
800 P.gonderi and for HIPs despite having been placed in 837 covers an interquartile range between 35% and 80% of
801 the second evolutionary group by the clustering 838 the length of the LCRs. In the other species rich in
802 algorithm. The situation of the species belonging to the 839 Adenine and Thymine, this dominance is contested by
803 rest of the Simian family is different. In fact, the action 840 GAA (E), GAT(D) and AAA(K) codons. Asparagine
804 of the drift (Bulmer,1991) is more pronounced where 841 has proven to be dispensable in P.falciparum
805 most of the time we do not find a stark preference 842 proteasome lid subunit 6 (Rpn6) (Muralidharan et al.
806 between wobble positions in A / G or T / C. So LCRs 843 2010). However, re-proposing the question already
807 of Simian Plasmodia, if compared to the others, 844 posed, in a more general context by Gemayel and
808 appears under a lower selective pressure being these 845 colleagues (2012), whether a repetitive unit is chosen
809 pressures apparently disputed in a tug of war between 846 for its mutability, or whether its mutability is rather
810 multiple synonymous codons. 847 rewarded for obtaining some effect on fitness, as far as
848 asparagine is concerned, we would expect a similar 811 Complexity 849 pervasiveness also in the other AT rich parasites if it
812 We used Shannon's Entropy, comparing the 850 were for the first reason. Furthermore, observing the
813 complexity of the TRRs belonging to these micro- 851 values of the SPI and the Pr2 violations of the LCPs of
814 organisms. We have shown that the LCRs of 852 the AT-rich parasites, the LCPs of these parasites are
815 Laverania Plasmodia are composed of fewer types of 853 under similar evolutionary pressure. Therefore,
816 codons than other parasites; we observed that, with the 854 attributing the massive presence of asparagine in the
817 same length, the Laverania Plasmodia have less 855 Laverania Plasmodia to a lack of selective constraints
818 complex LCRs than the other species analysed here. 856 against its diffusion does not appear correct since, if
819 Furthermore, the correlation between average 857 this were the case, we would expect, again, a much
820 complexity and GC content suggests that genomic bias 858 more abundant presence of N also in the Haemamoeba
821 cannot be the only contributor to the codon bias of 859 and Vinckeia Plasmodia. Hence, the LCRs appear to
822 LCRs. Among the various factors contributing to the 860 be selected from nucleotide level. As previously
823 instability (the proclivity to expand or contract) of a 861 hypothesised by Dalby (2009) we infer asparagine to
824 TRR there is the purity (Ngai & Saitou, 2016; Saitou, 862 be under positive selection.
825 2018; Gemayel et al. 2012; Legendre et al. 2007) of 863 826 the nucleotide tract and its length (Gemayel et al.
827 2012; Legendre et al. 2007); the nature of the codon 864 Codon Bias
828 bias upstream of a repetitive sequence of amino acids 865 RSCU has a broad spectrum of maximum values 829 (Verstrepen et al. 2005); the nucleotide composition 866 which, however, may not reflect a predominant use of 830 (Gragg et al. 2002).As stated, on average, LCRs of 867 a family of codons which, despite high RSCU values, 831 Laverania Plasmodia appear to be shaped by a lower 868 may be little used. For this reason, we wanted to 832 number of codons species as opposed to the other 869 quantitatively study the codon composition of LCRs. 833 species which on average exhibit higher complexity: it 870 The observation of the bar charts allows to consider 834 suggests a major instability. Moreover (see bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
871 some important aspects of the composition of the 907 function can be ascribed. Therefore, we hypothesize
872 LCRs. The first question concerns the species of 908 that longer proteins are selected to host LCRs because
873 codons most used, where the preponderance of 909 there is less chance of disrupting a functional domain,
874 asparagine (mostly encoded by AAT) is particularly 910 by squeezing them into interdomains. This would also
875 accentuated in the Laverania family. In fact, the 911 explain the positive correlations between protein
876 codons most present in general in all species are E, S, 912 length and LCRs abundance found in each parasite
877 D and R which represent the most common 913 (Tab.1). It is not clear whether recombination or
878 substitutions for N and K (NCBI-N, NCBI - K). NCBI 914 replication slippage is the predominant mechanism.
879 tables stress the similarity of the chemical -physical 915 The lack of consensus in distinguishing between micro
880 characteristics of these amino acids. Although a more 916 and minisatellite regions (Guy-Frank et al. 2008)
881 in-depth study is necessary to better visualize the 917 makes it difficult to propose a generalized hypothesis
882 composition of LC runs, we observe a certain tendency 918 discriminating between the two mechanisms, where in
883 in all species to preserve the CUB of these loci since 919 general, replication slippage is suggested for
884 amino acidic changes appear to be primarily 920 minisatellites (Guy-Frank & Paques, 2000) and
885 conservative, i.e. that the properties of the polypeptide 921 recombination like phenomena are suggested for
886 do not drastically change (Strachan et al. 2020). 922 minisatellite regions (Kokoska et al. 1998) and less
887 Noteworthy, LCPs of the Simian Plasmodia lie in the 923 complex LCRs (Ellegren,2004). Notably, in P.
888 central part of the ENC plots which denotes a higher 924 falciparum, Zilvermit et al. (2010) suggest unequal
889 GC3 content. In spite of this, it is not uncommon to 925 crossing over events for GC rich LCRs that can be
890 observe that the linear extension of the LCRs of 926 luckily be extended to the other Laverania spp.
891 Simian spp. is composed of codons rich in Adenine as 927 Overall, more evidence is needed to decree with
892 is the case of GAA (E) which rivals its synonym GAG 928 greater reliability that LCRs space functional domains
893 (see Supplementary Materials) and in general as for 929 without their disruption, that is left for future research.
894 other codons going against the GC pressure which 930
895 appears to press on their proteomes.
931 Darwinian Pressures
896
932 Codon Bias and Gene Expression
897 Protein Length
933 To cope with Darwinian selective pressures, we relied
898 We observed that LCPs are longer de facto than 934 on the Effective Number of Codons (Wright, 1996)
899 nLCPs. This implies that the LCRs are preferentially 935 using the improved version of Sun et al (2012), SPI,
900 inserted into the longest proteins of each parasite. Pizzi 936 and PR2 (Sueoka, 1995; Sueoka & Kawanishi, 1999).
901 E. and Frontali C. reported how some of the P. 937 These analyses consistently apply to each proteome
902 falciparum proteins emerged to be longer than the 938 stressing different equilibria between selective
903 orthologues found in other organisms (Pizzi & 939 pressure and mutational bias regarding both LCPs and
904 Frontali, 2001) by virtue of the presence of LCRs as 940 nLCPs. Overall, the set of these tools indicates that the
905 retraced by Xue and Forsdyke (2003) that report how 941 CUB of LCPs is more strongly influenced by
906 these are placed between domains to which a specific 942 mutational bias rather than by selective pressure and bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
943 that the contribution of the latter tends to decrease with 980 The SPI is intrinsically linked to the nature of the
944 the length of the protein. Optimal codon bias is often 981 Effective Number of Codons (Wright, 1996) and it is
945 attributed to high levels of gene expression (Bulmer, 982 suggested to be used coupled with the improved
946 1991; Sharp & Li, 1986). Furthermore, it is assumed 983 version of Sun and colleagues (2012) or any other
947 that a shorter protein requires less time to fold 984 improved version of the ENC such as that of Fuglsang
948 (Williamson, 2017) and that it must require a smaller 985 (2006). MT1 revealed the SPI to be in line with the
949 number of molecular chaperones (Lipman et al 2002) 986 purpose for which it was conceived. Compared to
950 for their surveillance function to be performed 987 Laverania Plasmodia, Simian spp. show a more
951 (Daniyan et al. 2019) with less space reserved for their 988 heterogeneous codon bias (their ENC values are
952 folding ultimately avoiding aberrant interactions with 989 larger, see Fig. 7). LPCs of Simian Plasmodia show
953 other cellular components (Williamson, 2017; Hartl & 990 off the highest SPI values, underlining that a higher
954 Hartl, 2002). This would suggest better translation 991 number of codons used does not necessarily identify a
955 efficiency of nLCPs and higher expression. Duret & 992 lower selective pressure. We agree with Gajbhiye and
956 Mouchiroud (1999) showed this in C.elegans where as 993 colleagues (2017) in decreeing a greater selective
957 selective pressure (protein length) decreases 994 pressure on the proteins of P.vivax compared to those
958 (increases) gene expression increases (decreases) as 995 of P. falciparum, to which the other members of its
959 well. However, transcriptomic examinations have 996 family are also added. Anyway, SPI it is not able to
960 shown that gene expression in P. falciparum is 997 distinguish between positive and negative pressures
961 substantially different between field isolates and in 998 but is only able to establish what the divergence from
962 vitro cultured samples, where this variability is 999 Wright's Theoretical Curve is, studying how closely
963 associated with Copy Number Variations (Mackinnon 1000 the CUB of a gene approximates an extreme Bias
964 et al. 2009). Other studies indicated that gene 1001 situation (20 codons for 20 amino acids). Similarly,
965 expression in P.falciparum varies according to the 1002 Pr2 (Sueoka, 1995; Sueoka & Kawanishi, 1999)
966 applied pharmacological stressor (Hu et al. 2009). 1003 allows to understand what the shift from Parity Rule 2
967 Moreover, in vitro proteomics data show how the 1004 is, returning, like SPI, a departure from a situation of
968 amount of proteins expressed in P. falciparum follows 1005 equilibrium. The genomes of malaria parasites suffer,
969 a very precise time course (Foth et al. 2011). A similar 1006 in respect of many genes, an apparent lack of
970 trend was also observed in P.vivax (Bozdech et al. 1007 orthologues (Gardner et al. 2002; Hu et al. 2009)
971 2008). This suggests that the gene expression of these 1008 which causes a complication for the application of
972 parasites may vary depending on external factors and 1009 methods such as, e.g, Nei and Gojobori’s (1986).
973 depending on the stage of the life cycle in which they 1010 Addressing these problems is therefore left to future
974 are found. Therefore, the association of the two gene 1011 efforts.
975 categories (nLCPs and LCPs) to a lower or higher gene 1012 Selective Pressure on LCPs and LCRs abundance 976 expression is controversial and not so easily
1013 The negative correlation between selective pressure 977 attributable.
1014 and protein length, particularly in the Laverania 978 SPI and Pr2 violations: Positive and Negative 1015 Plasmodia, hides information. In fact, looking to the 979 Selective Pressure 1016 correlation between protein length and LCR bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1017 abundance (Tab. 1), it can be observed that proteins 1054 a buffer role for these amino acid regions. Overall,
1018 with a minor deviation from Wright's Theoretical 1055 these correlations are puzzling and open an
1019 Curve and with a minor violation of Pr2 contain more 1056 exceptionally large number of avenues for new
1020 Low Complexity Regions. As the complexity of LCRs 1057 hypothesis and research.
1021 increases, a functional role is suggested (Zilversmit et 1058 Criticisms, Possible Refinements and Future 1022 al. 2010). However, as instance, contrary to the 1059 Addresses 1023 tendency of long E runs to generate distortions in a
1060 An important concept that is necessary to point, 1024 polypeptide chain (Karlin et al. 2001), in vitro assays
1061 concerns SEG parameters. In fact, standard parameters 1025 highlight how the areas rich in glutamate (E) in
1062 allow to identify regions strongly polarized towards a 1026 P.falciparum help drive the immune response away
1063 certain species of amino acids whilst allowing to find 1027 from the functional domains of the proteins thus
1064 LCRs with a more heterogeneous repertoire of amino 1028 evading the host's immune system (Hou et al.
1065 acids (Trilla & Albà, 2012). On the other hand, 1029 2020).Therefore, hypothesising a functional role even
1066 enlarging the window through which SEG is set 1030 for pure or almost pure amino acid run it is possible
1067 introduces a reduction in the LCRs that are identified, 1031 and classifying pure LCRs as a neutral insertion could
1068 where smaller windows (such as W=6) allow to 1032 be not correct in all the circumstances. A hypothesis
1069 observe a larger number of LCRs (Batistuzzi et al. 1033 of interest is that of functional amyloid, which states
1070 2016). Comparing the statistics provided by our work 1034 that many organisms have evolved to take advantage
1071 with some others (Chaudhry et al. 2018), on a 1035 of the potential tendency of some residues to form this
1072 qualitative level, the identified amino acids are not 1036 type of fibrils (Fowler et al. 2007). A shared opinion
1073 starkly different from those of Chaudhry et al. (2018) 1037 is that intracellular parasites possess simplified
1074 and the proportions between nLCPs and LCPs are 1038 genomes to adapt to the host (Daniyan et al. 2019).
1075 consistent with those found by them (see SM). 1039 These simplifications are believed to lead to the
1076 However, the critical sense suggests validating the 1040 expression of mutated proteins prone to aggregation
1077 subjectivity with which SEG is triggered, finding a 1041 (Daniyan et al. 2019). Despite the tendency of
1078 trade-off between the classical approach with standard 1042 asparagine to form amyloid fibrils (Halfmann et al.
1079 parameters and a more refined one such as that 1043 2011) it is reported that there is no in vivo evidence of
1080 proposed by Batistuzzi and colleagues (2016) through 1044 such aggregates in P. falciparum (Muralidharan et al.
1081 by which to identify any artificial classification in 1045 2010) which suggests that the parasite has learned to
1082 LCPs and nLCPs stratification. Likewise, it is 1046 take advantage or at least to manage this amino acid.
1083 important to point that GC content often aggravates 1047 Fairly recent studies (Halfmann et al. 2011) have
1084 genome assemblies (Chen et al. 2013). Even though 1048 clearly highlighted how LCRs rich in asparagine, in
1085 what is reported in the Supplementary Materials 1049 yeast, reduce the toxicity of proteins. Given the
1086 confers a certain robustness to our observations, 1050 negative correlation between selective pressure and
1087 finding similar results for SPI and protein length in 1051 LCRs abundance, it would be interesting to look for a
1088 other genome assemblies (same species different 1052 possible concomitance of detrimental polymorphisms
1089 data), the risk is not absent. Shannon's Entropy, as 1053 and the presence of asparagine, which would suggest
1090 used in this work, made us understand important bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1128 10.1073/pnas.0807404105. Epub 2008 Oct 13. PMID: 1091 characteristics regarding the composition of LCRs.
1129 18852452; PMCID: PMC2571024. 1092 However, making this implementation more general 1130 Brinkmann B, Klintschar M, Neuhuber F, Hühne J, Rolf B
1093 and more widely usable is left to future work. Finally, 1131 (June 1998). "Mutation rate in human microsatellites: influence
1132 of the structure and length of the tandem repeat". American 1094 a functional enrichment in GO terms could indicate the 1133 Journal of Human Genetics. 62 (6): 1408-15. doi: 10.1086 / 1095 phenotype associated with the proteins where the 1134 301869. PMC 1377148. PMID 9585597
1096 LCRs are inserted, providing with more clues the 1135 Brown LY, Brown SA. Alanine tracts: the expanding story of
1136 human illness and trinucleotide repeats, Trends Genet., 2004, 1097 rationale to infer the directionality Darwinian 1137 vol. 20 (pg. 51-58) 1098 Selective Pressure. 1138 Bulmer M. The selection-mutation-drift theory of synonymous
1139 codon usage. Genetics. 1991;129(3):897-907 1099 Conclusion 1140 Carter, R. and Walliker, D. 1975: New observations on the
1141 malaria parasites of rodents of the Central African Republic 1100 Overall, in this study we tackled the proteome analysis 1142 - Plasmodium vinckei petteri subsp. nov. and Plasmodium
1101 of 22 plasmodia providing important insights into the 1143 chabaudi Landau, 1965. Ann. Trop. Med. Parasitol., 69: 187-
1144 196 1102 pervasiveness of Asparagine Low Complexity 1145 Chaudhry, S.R., Lwin, N., Phelan, D. et al. Comparative 1103 Regions. Extending the present research to other 1146 analysis of low complexity regions in Plasmodia. Sci
1104 Apicomplexa would lead to a deepened knowledge of 1147 Rep 8, 335 (2018). https://doi.org/10.1038/s41598-017-18695-
1148 y 1105 the biology of these parasites, reaching a better 1149 Chen YC, Liu T, Yu CH, Chiang TY, Hwang CC. Effects of 1106 understanding of their evolutionary success. 1150 GC bias in next-generation-sequencing data on de novo genome
1151 assembly. PLoS One. 2013;8(4):e62856. Published 2013 Apr 1107 Contribution 1152 29. doi:10.1371/journal.pone.0062856
1153 Collins WE, Jeffery GM. Plasmodium malariae: parasite and 1108 C.A. and G.A. conceived the study. C.A. conducted 1154 disease. Clin Microbiol Rev. 2007;20(4):579 592. ‐ 1109 the analyses. C.A. and G.A. wrote the main 1155 doi:10.1128/CMR.00027-07
1156 Collins WE, Jeffery GM. Plasmodium ovale: parasite and 1110 manuscript. C.A. A.G. and F.S. read and approved the 1157 disease. Clin Microbiol Rev. 2005;18(3):570 581. ‐ 1111 final manuscript. 1158 doi:10.1128/CMR.18.3.570-581.2005
1159 Corradetti A.; Garnham P.C.C.; Laird M. (1963). "New
1112 References 1160 classification of the avian malaria parasites". Parassitologia. 5:
1161 1–4.
1113 1162 Cortés, A., Mellombo, M., Masciantonio, R., Murphy, V.J.,
1163 Reeder, J.C. and Anders, R.F. (2005) Allele specificity of 1114 1164 naturally acquired antibody responses against Plasmodium
1165 falciparum apical membrane antigen 1. Infect. Immun. 73, 422– 1115 Arisue, N., Hashimoto, T., Kawai, S. et al. Apicoplast 1166 430 1116 phylogeny reveals the position of Plasmodium vivax basal to 1167 Dalby AR. A comparative proteomic analysis of the simple 1117 the Asian primate malaria parasite clade. Sci Rep 9, 7274 1168 amino acid repeat distributions in Plasmodia reveals lineage 1118 (2019). https://doi.org/10.1038/s41598-019-43831-1 1169 specific amino acid selection. PLoS One. 2009 Jul
1119 Batistuzzi FU, Schneider KA, Spencer MK, Fisher D, 1170 14;4(7):e6231. doi: 10.1371/journal.pone.0006231. PMID:
1120 Chaudhry S, Escalante AA. Profiles of low complexity regions 1171 19597555; PMCID: PMC2705789.
1121 in Apicomplexa. BMC Evol Biol. 2016; 16:47. Published 2016 1172 Daniyan MO, Przyborski JM, Shonhai A. Partners in Mischief:
1122 Feb 29. doi:10.1186/s12862-016-0625-0 1173 Functional Networks of Heat Shock Proteins of Plasmodium
1123 Bozdech Z, Mok S, Hu G, Imwong M, Jaidee A, Russell B, 1174 falciparum and Their Influence on Parasite Virulence.
1124 Ginsburg H, Nosten F, Day NP, White NJ, Carlton JM, Preiser 1175 Biomolecules. 2019;9(7):295. Published 2019 Jul 23.
1125 PR. The transcriptome of Plasmodium vivax reveals divergence 1176 doi:10.3390/biom9070295
1126 and diversity of transcriptional regulation in malaria parasites. 1177 DePristo, M. A., Zilversmit, M. M., & Hartl, D. L. (2006). On
1127 Proc Natl Acad Sci U S A. 2008 Oct 21;105(42):16290-5. doi: 1178 the abundance, amino acid composition, and evolutionary bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1179 dynamics of low-complexity regions in proteins. Gene, 378, 1230 Complexity Regions behave as tRNA sponges to help co-
1180 19–30. doi:10.1016/j.gene.2006.03.023 1231 translational folding of plasmodial proteins. FEBS Letters,
1181 Derrick, B; Toher, D; White, P (2016). "Why Welchs test is 1232 584(2), 448–454. doi:10.1016/j.febslet.2009.11.004
1182 Type I error robust" (PDF). The Quantitative Methods for 1233 Gajbhiye, Shivani & Patra, P.K. & Yadav, Dr. Manoj. (2017).
1183 Psychology. 12 (1): 30–38. doi:10.20982/tqmp.12.1.p030 1234 New insights into the factors affecting synonymous codon
1184 Duret, L., & Mouchiroud, D. (1999). Expression pattern and, 1235 usage in human infecting Plasmodium species. Acta Tropica.
1185 surprisingly, gene length shape codon usage in Caenorhabditis, 1236 176. 10.1016/j.actatropica.2017.07.025.
1186 Drosophila, and Arabidopsis. Proceedings of the National 1237 Galinski M.R., Barnwell J., 2012, Nonhuman primate models
1187 Academy of Sciences, 96(8), 4482– 1238 for human malaria research. In:Abee, C.R. (Ed.), Nonhuman
1188 4487. doi:10.1073/pnas.96.8.4482 1239 Primates In Biomedical Research: Diseases, Academic Press
1189 Ellegren H. Microsatellites: simple sequences with complex 1240 Elsevier, pp 299-323
1190 evolution, Nat Rev Genet., 2004, vol. 5 (pg. 435-445) 1241 Galland G.G., Role of the Squirrel Monkey in Parasitic Disease
1191 Escalante AA, Ayala FJ. Phylogeny of the malarial genus 1242 Research, ILAR Journal, Volume 41, Issue 1, 2000, Pages 37–
1192 Plasmodium derived from rRNA gene sequences. Proc Natl 1243 43, https://doi.org/10.1093/ilar.41.1.37
1193 Acad Sci U S A. 1994;91(24):11373-11377. 1244 Garnham PC. The Subgenera of Plasmodium In Mammals. Ann 1194 doi:10.1073/pnas.91.24.11373 1245 Soc Belges Med Trop Parasitol Mycol. 1964;44:267 271. 1195 Escalante, Ananias & Freeland, Denise & Collins, William & ‐ 1246 Gatchel JR, Zoghbi HY. Diseases of unstable repeat expansion: 1196 Lal, Altaf. (1998). The Evolution of Primate Malaria Parasites 1247 mechanisms and common principles, Nat Rev Genet., 2005, 1197 Based on the Gene Encoding Cytochrome b from the Linear 1248 vol. 6 (pg. 743-755) 1198 Mitochondrial Genome. Proceedings of the National Academy 1249 Gemayel R, Cho J, Boeynaems S, Verstrepen KJ. Beyond junk- 1199 of Sciences of the United States of America. 95. 8124-9. 1250 variable tandem repeats as facilitators of rapid evolution of 1200 10.1073/pnas.95.14.8124. 1251 regulatory and coding sequences. Genes (Basel). 2012 Jul 1201 Faust, Christina & Dobson, Andrew. (2015). Primate malarias: 1252 26;3(3):461-80. doi: 10.3390/genes3030461. PMID: 1202 Diversity, distribution and insights for zoonotic Plasmodium. 1253 24704980; PMCID: PMC3899988. 1203 One Health. 1. 10.1016/j.onehlt.2015.10.001. 1254 Gragg H., Harfe B.D., Jinks-Robertson S. Base composition of 1204 Ferreira, M.U., Ribeiro, W.L., Tonon, A.P., Kawamoto, F. and 1255 mononucleotide runs affects DNA polymerase slippage and 1205 Rich, S.M. (2003) Sequence diversity and evolution of the 1256 removal of frameshift intermediates by mismatch repair in 1206 malaria vaccine candidate merozoite surface protein-1 (MSP-1) 1257 Saccharomyces cerevisiae. Mol. Cell. Biol. 2002;22:8756– 1207 of Plasmodium falciparum. Gene 30, 65–75 1258 8762. 1208 Filisetti, D., Th´eobald-Dietrich, A., Mahmoudi, N., Rudinger- 1259 Green H, Wang N. Codon reiteration and the evolution of 1209 Thirion, J., Candolfi, E., Frugier, M. (2013). Aminoacylation of 1260 proteins. Proc Natl Acad Sci U S A. 1994;91(10):4298–4302. 1210 Plasmodium falciparum tRNA Asn and Insights in the 1261 doi:10.1073/pnas.91.10.4298 1211 Synthesis of Asparagine Repeats. Journal of Biological 1262 Guy-Franck R., Kerrest A, Dujon B. Comparative genomics 1212 Chemistry, 288(51), 36361–36371. doi:10.1074/jbc.m113 1263 and molecular dynamics of DNA repeats in eukaryotes. 1213 .522896 1264 Microbiol Mol Biol Rev. 2008;72(4):686–727. 1214 Forcelloni, S., Giansanti, A. Evolutionary Forces and Codon 1265 doi:10.1128/MMBR.00011-08 15 1215 Bias in Different Flavors of Intrinsic Disorder in the Human 1266 Halfmann R, Alberti S, Krishnan R, Lyle N, O’Donnell CW, et 1216 Proteome. J Mol Evol (2019) doi:10.1007/s00239-019-09921- 1267 al. (2011) Opposing effects of glutamine and asparagine govern 1217 4 Fowler 1268 prion formation by intrinsically disordered proteins. Mol Cell 1218 Forsdyke D., Evolutionary Bioinformatics 2016, Springer Ed. 1269 43: 72–84. 1219 Foth BJ, Zhang N, Chaal BK, Sze SK, Preiser PR, Bozdech Z. 1270 Hamilton WL, Claessens A, Otto TD, Kekre M, Fairhurst RM, 1220 Quantitative time-course profiling of parasite and host cell 1271 Rayner JC, Kwiatkowski D. Extreme mutation bias and high 1221 proteins in the human malaria parasite Plasmodium 1272 AT content in Plasmodium falciparum. Nucleic Acids Res. 1222 falciparum. Mol Cell Proteomics. 2011;10(8):M110.006411. 1273 2017 Feb 28;45(4):1889-1901. doi: 10.1093/nar/gkw1259. 1223 doi:10.1074/mcp.M110.006411 1274 PMID: 27994033; PMCID: PMC5389722. 1224 Fowler, D. M., Koulov, A. V., Balch, W. E., & Kelly, J. W. 1275 Hartl FU, Hayer-Hartl M. Molecular chaperones in the cytosol: 1225 (2007). Functional amyloid – from bacteria to humans. Trends 1276 from nascent chain to folded protein. Science. 2002; 295:1852– 1226 in Biochemical Sciences, 32(5), 217– 1277 1858. doi: 10.1126/science.1068408. 1227 224. doi:10.1016/j.tibs.2007.03.003 1278 Hayakawa, T., Culleton, R., Otani, H., Horii, T., & Tanabe, K. 1228 Frugier, M., Bour, T., Ayach, M., Santos, M. A. S., Rudinger- 1279 (2008). Big Bang in the Evolution of Extant Malaria Parasites. 1229 Thirion, J., Th´eobald-Dietrich, A., Pizzi, E. (2009). Low bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1280 Molecular Biology and Evolution, 25(10), 2233–2239. 1331 substitutions., Molecular Biology and Evolution, Volume 3,
1281 doi:10.1093/molbev/msn171 1332 Issue 5, September 1986, Pages 418–
1282 Howes RE, Battle KE, Mendis KN, et al. Global Epidemiology 1333 426, https://doi.org/10.1093/oxfordjournals.molbev.a040410
1283 of Plasmodium vivax. Am J Trop Med Hyg. 2016;95(6 1334 M.J. Gardner, N. Hall, E. Fung, O. White, M. Berriman, R.W.
1284 Suppl):15‐34. doi:10.4269/ajtmh.16-0141 1335 Hyman, et al., Genome sequence of the human malaria parasite
1285 Hu G, Cabrera A, Kono M, Mok S, Chaal BK, Haase S, 1336 Plasmodium falciparum, Nature 419 (2002) 498–511,
1286 Engelberg K, Cheemadan S, Spielmann T, Preiser PR, 1337 http://dx.doi.org/10.1038/nature01097.
1287 Gilberger TW, Bozdech Z. Transcriptional profiling of growth 1338 M.M. Alba, R. Guigo, Comparative analysis of amino acid
1288 perturbations of the human malaria parasite Plasmodium 1339 repeats in rodents and humans, Genome Res. 14 (2004) 549–
1289 falciparum. Nat Biotechnol. 2010 Jan;28(1):91-8. doi: 1340 554
1290 10.1038/nbt.1597. Epub 2009 Dec 27. PMID: 20037583. 1341 Mackinnon MJ, Li J, Mok S, Kortok MM, Marsh K, et al.
1291 James R.K.; Ray, C. George, eds. (2004). Sherris medical 1342 (2009) Comparative Transcriptional and Genomic Analysis
1292 microbiology: an introduction to infectious diseases (4th ed.). 1343 of Plasmodium falciparum Field Isolates. PLOS Pathogens
1293 McGraw-Hill Professional Med/Tech. ISBN 978-0-8385-8529- 1344 5(10): e1000644. https://doi.org/10.1371/journal.ppat.1000644
1294 0 1345 McCutchan TF, Kissinger JC, Touray MG, Rogers MJ, Li J,
1295 John Federhen, Scott. (1993). Statistics of Local Complexity in 1346 Sullivan M, Braga EM, Krettli AU, Miller LH. Comparison of
1296 Amino Acid Sequences and Sequence Databases. Computers 1347 circumsporozoite proteins from avian and mammalian
1297 Chemistry. 17. 149- 163. 10.1016/0097-8485(93)85006-X. 1348 malarias: biological and phylogenetic implications. Proc Natl
1298 Xiao H, Jeang KT. Glutamine-rich domains activate 1349 Acad Sci USA. 1996;93:11889–11894.
1299 transcription in yeast Saccharomyces cerevisiae, J Biol Chem. , 1350 Mitsui H., Arisue N, Sakihama N, Inagaki Y, Horii T,
1300 1998, vol. 273 (pg. 22873-22876) 1351 Hasegawa M, Tanabe K, Hashimoto T., Phylogeny of Asian
1301 Kokoska, R.J.; Stefanovic, L.; Tran, H.T.; Resnick, M.A.; 1352 primate malaria parasites inferred from apicoplast genome-
1302 Gordenin, D.A.; Petes, T.D. Destabilization of yeast micro- and 1353 encoded genes with special emphasis on the positions of
1303 minisatellite DNA sequences by mutations affecting a nuclease 1354 Plasmodium vivax and P. fragile, Gene. 2010 Jan 15;450(1-
1304 involved in Okazaki fragment processing (rad27) and DNA 1355 2):32-8. doi: 10.1016/j.gene.2009.10.001
1305 polymerase delta (pol3-t). Mol. Cell. Biol. 1998, 18, 1356 Mosbach V, Poggi L, Richard GF. Trinucleotide repeat
1306 2779±2788 1357 instability during double-strand break repair: from mechanisms
1307 Kristan M, Thorburn SG, Hafalla JC, Sutherland CJ, Oguike 1358 to gene therapy. Curr Genet. 2019;65(1):17-28.
1308 MC. Mosquito and human hepatocyte infections with 1359 doi:10.1007/s00294-018-0865-1.
1309 Plasmodium ovale curtisi and Plasmodium ovale 1360 Mularoni L, Veitia RA, Albà MM. Highly constrained proteins
1310 wallikeri. Trans R Soc Trop Med Hyg. 2019;113(10):617‐622. 1361 contain an unexpectedly large number of amino acid tandem
1311 doi:10.1093/trstmh/trz048 1362 repeats. Genomics. 2007 Mar; 89(3):316-25. doi:
1312 Kumari, B., Kumar, R., Kumar, M. (2015). Low complexity 1363 10.1016/j.ygeno.2006.11.011. Epub 2006 Dec 28. PMID:
1313 and disordered regions of proteins have different structural and 1364 17196365.
1314 amino acid preferences. Molecular BioSystems, 11(2), 585– 1365 Muralidharan V, Goldberg DE. Asparagine repeats in
1315 594. doi:10.1039/c4mb00425f 1366 Plasmodium falciparum proteins: good for nothing? PLoS
1316 Levinson G, Gutman G. A, Slipped-strand mispairing: a major 1367 Pathog. 2013;9(8):e1003488. doi:10
1317 mechanism for DNA sequence evolution. (1987). Molecular 1368 .1371/journal.ppat.1003488
1318 Biology and 1369 Muralidharan V, Oksman A, Iwamoto M, Wandless TJ,
1319 Evolution. doi:10.1093/oxfordjournals.molbev.a040442 1370 Goldberg DE. Asparagine repeat function in a Plasmodium
1320 Lipman DJ, Souvorov A, Koonin EV, Panchenko AR, Tatusova 1371 falciparum protein assessed via a regulatable fluorescent
1321 TA. The relationship of protein conservation and sequence 1372 affinity tag. Proc Natl Acad Sci U S A. 2011 Mar
1322 length. BMC Evol Biol. 2002; 2:20. doi: 10.1186/1471-2148- 1373 15;108(11):4411-6. doi: 10.1073/pnas.1018449108. Epub 2011
1323 2-2 1374 Feb 28. PMID: 21368162; PMCID: PMC3060247.
1324 Liu W, Sundararaman SA, Loy DE, et al. Multigenomic 1375 Muralidharan V, Oksman A, Pal P, Lindquist S, Goldberg DE.
1325 Delineation of Plasmodium Species of the Laverania Subgenus 1376 Plasmodium falciparum heat shock protein 110 stabilizes the
1326 Infecting Wild-Living Chimpanzees and Gorillas. Genome Biol 1377 asparagine repeat-rich parasite proteome during malarial
1327 Evol. 2016;8(6):1929 1939. Published 2016 Jul 2. 1378 fevers. Nat Commun. 2012; 3: 1310. doi: ‐ 1328 doi:10.1093/gbe/evw128 1379 10.1038/ncomms2306
1329 M Nei, T Gojobori, Simple methods for estimating the numbers 1380 N.G. Faux, S.P. Bottomley, A.M. Lesk, J.A. Irving, J.R.
1330 of synonymous and nonsynonymous nucleotide 1381 Morrison, M.G. de la Banda, J.C. Whisstock, Functional bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1382 insights from the distribution and role of homopeptide repeat- 1433 Salichs E, Ledda A, Mularoni L, Alba MM, de la Luna S. 2009.
1383 containing proteins, Genome Res. 15 (2005) 537–551. 1434 Genome-wide analysis of histidine repeats reveals their role in
1384 Ngai, M.Y.; Saitou, N. (2016). The effect of perfection status 1435 the localization of human proteins to the nuclear speckles
1385 on mutation rates of microsatellites in primates. 1436 compartment. PLoS Genet. 5:e1000397
1386 Anthropological Science, 124(2), 85– 1437 Shannon, Claude E. (July 1948). "A Mathematical Theory of
1387 92. doi:10.1537/ase.160124 1438 Communication". Bell System Technical Journal. 27 (3): 379-
1388 Novembre JA (2002) Accounting for background nucleotide 1439 423. doi: 10.1002 / j.1538-7305.1948.tb01338.x. hdl:
1389 composition when measuring codon usage bias. Mol Biol Evol 1440 10338.dmlcz / 101429.
1390 19:1390–1394 1441 Sharp PM, Li WH. The codon Adaptation Index--a measure of
1391 Otto TD, Gilabert A, Crellen T, et al. Genomes of all known 1442 directional synonymous codon usage bias, and its potential
1392 members of a Plasmodium subgenus reveal paths to virulent 1443 applications. Nucleic Acids
1393 human malaria. Nat Microbiol. 2018;3(6):687–697. 1444 Res. 1987;15(3):1281D1295. doi:10.1093/nar/15.3.1281
1394 doi:10.1038/s41564-018-0162-2 1445 Sharp PM, Li WH., 1986 An evolutionary perspective on
1395 Pizzi E., Frontali C. (2001). Low-Complexity Regions in 1446 synonymous codon usage in unicellular organisms. J. Mol.
1396 Plasmodium falciparum Proteins. Genome Research, 11(2), 1447 Shen H, Kan JL, Green MR. 2004. Arginine-serine-rich
1397 218–229. doi:10.1101/gr.gr-1522r 1448 domains bound at splicing enhancers contact the branchpoint to
1398 Prugnolle F, Durand P, Neel C, et al. African great apes are 1449 promote prespliceosome assembly. Mol Cell. 13:367–376.
1399 natural hosts of multiple related malaria species, including 1450 Silva JC, Egan A, Friedman R, Munro JB, Carlton JM, Hughes
1400 Plasmodium falciparum. Proc Natl Acad Sci U S A. 1451 AL. Genome sequences reveal divergence times of malaria
1401 2010;107(4):1458 1463. doi:10.1073/pnas.0914440107 1452 parasite lineages. Parasitology. 2011;138(13):1737-1749. ‐ 1402 Radó-Trilla, N., Albà, M. Dissecting the role of low-complexity 1453 doi:10.1017/S0031182010001575
1403 regions in the evolution of vertebrate proteins. BMC Evol 1454 Singh B, Daneshvar C. Human infections and detection of
1404 Biol 12, 155 (2012). https://doi.org/10.1186/1471-2148-12- 1455 Plasmodium knowlesi. Clin Microbiol Rev. 2013;26(2):165 ‐ 1405 155 1456 184. doi:10.1128/CMR.00079-12
1406 Ramiro, R.S., Reece, S.E. & Obbard, D.J. Molecular evolution 1457 Strachan T., Goodship J., Chinnery P, Genetics and Genomics
1407 and phylogenetics of rodent malaria parasites. BMC Evol Biol 1458 in Medicine, Garland Science
1408 12, 219 (2012). https://doi.org/10.1186/1471-2148-12-219 1459 Sueoka N, Kawanishi Y. DNA G+C content of the third codon
1409 Rich SM, Leendertz FH, Xu G, et al. The origin of malignant 1460 position and codon usage biases of human genes. Gene. 2000
1410 malaria. Proc Natl Acad Sci U S A. 2009;106(35):14902- 1461 Dec 30;261(1):53-62. doi: 10.1016/s0378-1119(00)00480-7.
1411 14907. doi:10.1073/pnas.0907740106 1462 PMID: 11164037.
1412 Rich SM, Xu G. Resolving the phylogeny of malaria parasites. 1463 Sueoka, N. Intrastrand parity rules of DNA base composition
1413 Proc Natl Acad Sci U S A. 2011;108(32):1297312974. 1464 and usage biases of synonymous codons. J Mol Evol 40, 318–
1414 doi:10.1073/pnas.1110141108 1465 325 (1995). https://doi.org/10.1007/BF00163236
1415 Richard, G.F.; Paques, F. Mini- and microsatellite expansions: 1466 Sun X, Yang Q, Xia X. An improved implementation of
1416 The recombination connection. EMBO Rep. 2000, 1, 122±126. 1467 effective number of codons (nc). Mol Biol Evol.
1417 Ruxton, G. D. (2006). "The unequal variance t-test is an 1468 2013;30(1):191 196. doi:10.1093/molbev/mss201 ‐ 1418 underused alternative to Student's t-test and the Mann–Whitney 1469 Tachibana, S.-I., Sullivan, S. A., Kawai, S., Nakamura, S., Kim,
1419 U test". Behavioral Ecology. 17 (4): 688–690. 1470 H. R., Goto, N., Tanabe, K. (2012). Plasmodium cynomolgi
1420 doi:10.1093/beheco/ark016 1471 genome sequences provide insight into Plasmodium vivax and
1421 Ryan, Kenneth James; Ray, C. George, eds. (2004). Sherris 1472 the monkey malaria clade. Nature Genetics, 44(9), 1051–1055.
1422 medical microbiology: an introduction to infectious diseases 1473 doi:10.1038/ng.2375
1423 (4th ed.). McGraw-Hill Professional Med/Tech. ISBN 978-0- 1474 Valkiunas G., Avian Malaria Parasites and Other
1424 8385-8529-0 1475 Haemosporidia, CRC Press, 2000 NW Corporate Blvd., Boca
1425 S. Karlin, L. Brocchieri, A. Bergman, J. Mrazek, A.J. Gentles, 1476 Raton, Florida, 33431 USA. 2005. 932 pp. ISBN 0- 415-30097-
1426 Amino acid runs in eukaryotic proteomes and disease 1477 5.
1427 associations, Proc. Natl. Acad. Sci. USA 99 (2002) 333–338 1478 Verstrepen K.J., Jansen A., Lewitter F., Fink G.R. Intragenic
1428 S.J. Newfeld, H. Tachida, B. Yedvobnick, Drive-selection 1479 tandem repeats generate functional variability. Nat.
1429 equilibrium: homopolymer evolution in the Drosophila gene 1480 Genet. 2005;37:986–990
1430 mastermind, J. Mol. Evol. 38 (1994) 637–641. 1481 Vogel G (November 2013). "The forgotten malaria". Science. nd 1431 Saitou N., Introduction to Evolutionary Genomics, Springer 2 1482 342(6159): 684–7. doi:10.1126/science.342.6159.684
1432 Ed., pg 45-48 (2018) bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
1483 Ward, J. H., Jr. (1963), "Hierarchical Grouping to Optimize an 1528
1484 Objective Function", Journal of the American Statistical
1485 Association, 58, 236–244. 1529
1486 Waters AP, Higgins DG, McCutchan TF. Evolutionary
1487 relatedness of some primate models of Plasmodium. Mol Biol 1530
1488 Evol. 1993b; 10: 914–923.
1489 Waters AP, Higgins DG, McCutchan TF. The Phylogeny of 1531
1490 malaria: a useful study. Parasitol Today. 1993a;9:246–250
1532 1491 Waters, A. P., D. G. Higgins, and T. F. McCutchan. 1991.
1492 Plasmodium falciparum appears to have arisen as a result of
1533 1493 lateral transfer between avian and human hosts. Proceedings of
1494 the National Academy of Sciences USA 88: 3140-3144 1534 1495 Williamson M., How Proteins Work, Garland Science
1496 Wootton JC, Federhen S. Analysis of compositionally biased 1535 1497 regions in sequence databases. Methods Enzymol. 1996;
1498 266:554 571. doi:10.1016/s0076-6879(96)66035-2 ‐ 1499 Wright F. The 'effective number of codons' used in a
1500 gene. Gene. 1990;87(1):23 29. doi:10.1016/0378- ‐ 1501 1119(90)90491-9
1502 Xue, H.Y. and Forsdyke, D.R. (2003) Low-complexity
1503 segments in Plasmodium falciparum proteins are primarily
1504 nucleic acid level adaptations. Mol. Biochem. Parasitol. 128,
1505 21–32.
1506 Xuhua Xia, Bioinformatics and the Cell. Modern
1507 Computational Approaches in Genomics, Proteomics and
1508 Transcriptomics Springer Ed.
1509 Yadav MK, Swati D. Comparative genome analysis of six
1510 malarial parasites using codon usage biasbased
1511 tools. Bioinformation.2012;8(24):1230 1239. ‐ 1512 doi:10.6026/97320630081230
1513 Zhang, F., Gu, W., Hurles, M. E., & Lupski, J. R. (2009). Copy
1514 Number Variation in Human Health, Disease, and Evolution.
1515 Annual Review of Genomics and Human Genetics,10(1), 451–
1516 481. doi: 10.1146/annurev.genom.9.081307.164217
1517 Zilversmit MM, Volkman SK, DePristo MA, Wirth DF,
1518 Awadalla P, Hartl DL. Low-complexity regions in Plasmodium
1519 falciparum: missing links in the evolution of an extreme
1520 genome. Mol Biol Evol. 2010;27(9):2198-2209.
1521 doi:10.1093/molbev/msq108
1522
1523
1524
1525
1526
1527 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Fig. 1 Heatmap of the RSCU values of the Low Complexity Regions. Codons are placed on the columns. Plasmodia on the rows. Data is standardized along the columns. Ward's Linkage was used. Row and column pairwise distance was calculated with Standardized Euclidean Distance. Data were standardized along rows. The MATLAB clustergram function was used.
1536 1549
1537 1550
1538 1551
1539 1552
1540 1553
1541 1554
1542 1555
1543 1556
1557 1544
1558 1545
1559
1546 1560
1547 1561
1562 1548
1563
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Fig. 2 (A) Regression of Entropy vs LCRs length. Length of LCRs is measured in nucleotides. Model parameters are provided with 95% coefficient bounds Laverania(x) = a*exp(b*x) + c*exp(d*x) : a = 1.918 (1.885, 1.951), b = -0.0003328 (-0.0004543, -0.0002113), c = -5.435 (-5.793, -5.078), d = - 0.05285 (-0.05514, -0.05056); Simian(x) = a*exp(b*x) + c*exp(d*x) : a = 2.41 (2.368, 2.452), b = -0.0002985 (-0.0004451, -0.0001519), c = -39.41 (-47.79, -31.02), d = -0.1012 (-0.1077, -0.09478); Vinckeia(x) = a*exp(b*x) + c*exp(d*x): a = -1.006e+04 (-4.838e+12, 4.838e+12), b = 0.002788 (-52.56, 52.57),c = 1.006e+04 (-4.838e+12, 4.838e+12),d = 0.002788 (-52.55, 52.56); Haemamoeba(x) = a*exp(b*x) + c*exp(d*x): a = -2.832 (-3.196, -2.469),b = -0.04565 (- 0.05072, -0.04057),c = 2.018 (1.956, 2.08), d = -0.0003467 (-0.0004884, -0.000205); HIPs(x) = a*exp(b*x) + c*exp(d*x): a = -9.49 (-10.81, -8.165), b = - 0.06278 (-0.0674, -0.05817), c = 2.357 (2.291, 2.423), d = -0.0006497 (-0.0008514, -0.0004479) Fig. 2(B)Illustration of the average complexity of each Low Complexity. The mean of each distribution is represented by a data point of the same colour as the outline of the violin plot. The trend of the means and medians is represented by the yellow and red lines respectively. Violin Plot Function has been taken from GitHub-Matlab (C) Trend of the sample averages relative to the distributions of Fig. 2 (B) with respect to the average Guanine and Cytosine content of each parasite.
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Fig. 3 Percentage representation of the CUB in the various parasitic subgroups. The amount of each codon was divided by the total amount of residues present in the LCRs of each parasite. (A) Laverania Subgenus (B) Simian Plasmodia (C) Haemamoeba Subgenus (D) Subgenus Vinckeia (E) HIPs. Each peak represents half of the bar it refers to.
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. theprotein.host Fig.(A) 4 representedgree in
Pairwisecom (C) n. n. Pairwise comparison of LCPs, deprived of LCRs, and areare nLCPs red.deprivedThe represented nLCPs. LCPs LCRs, in The LCPs, Pairwise of of comparison parison of LCPs and parasite.LCPs in nLCPs each parison of (B)
Relationship between the length of the LCRs and the length of the and of of betweenlength LCRs the length Relationship the bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Fig. 5 Illustration of ENC plot for Laverania Subgenus : (1) P.reichnowi (2) P.gaboni (3) P.alderi (4) P.praefalciparum (5) P.falciparum; Human Infectious Plasmodia: (1) P.malariae (2) P.ovale (3) P.o.curtisi; Vinckeia Subgenus: (1) P.berghei(2) P.yoelii (3) P.petteri (4) P.chabaudi (5) P.vinckei; Haemamoeba Subgenus: (1) P.gallinaceum (2) P.relictum. Globally ENC distributions cluster in the left region of the ENC plane. LCPs are represented in Red. nLCPs are represented in green.
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Fig.6 Illustration of the ENC Plots of Simian Plasmodia. We used a f(x) = p1*x^2 + p2*x + p3 model. Best Fit’s parameters (BFPs) are provided with the 95% confidence bounds. In red LCPs, in green nLCPs. In orange the best fit for nLCPs. In light green the best fit for LCPs. (A) P. vivax: nLCPs’ BFPs: p1 = -160.3 (- 164.3, -156.3), p2 = 146.5 (143.3, 149.7),p3 = 19.7 (19.13, 20.27);LCPs’ BFPs: p1 = -197.6 (-211.5, -183.7),p2 = 188.5 (175.5, 201.4),p3 = 10.18 (7.227, 13.14) (B) P.knowlesi: nLCPs’ BFPs: p1 = -334.7 (-352.9, -316.5),p2 = 289.9 (274.4, 305.4),p3 = -7.969 (-11.24, -4.701); LCPs’ BFPs: p1 = -298.4 (-351, - 245.8),p2 = 272.7 (232.2, 313.2), p3 = -5.609 (-13.32, 2.101) (C) P.cynomolgi: nLCPs’ BFPs: p1 = -207 (-214.2, -199.8),p2 = 185.1 (179.6, 190.7),p3 = 12.28 (11.23, 13.32); LCPs’ BFPs: p1 = -259 (-291.5, -226.5),p2 = 245.9 (217.5, 274.3),p3 = -1.82 (-8.019, 4.379) (D) P.inui: nLCPs’ BFPs: p1 = -268.7 (- 284.1, -253.3),p2 = 253.2 (239.1, 267.3),p3 = -5.331 (-8.551, -2.111);LCPs’ BFPs: p1 = -320.5 (-363.3, -277.8),p2 = 310.3 (271.1, 349.6),p3 = -18.31 (- 27.29, -9.327) (E) P.fragile: nLCPs BFPs: p1 = -188.2 (-197.4, -179),p2 = 165.9 (158.1, 173.6),p3 = 17.42 (15.75, 19.09);LCPs’ BFPs: p1 = -416.8 (- 476.9, -356.6),p2 = 379.3 (327.3, 431.2),p3 = -29.75 (-40.93, -18.57); (F) P.coatney: nLCPs’ BFPs: p1 = -280.9 (-293.3, -268.6),p2 = 254.8 (244, 265.6),p3 = -3.065 (-5.406, -0.724);LCPs’ BFPs: p1 = -246.7 (-289.4, -203.9), p2 = 234 (198.5, 269.5),p3 = 0.6524 (-6.597, 7.902) (G) P.gonderi
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Fig. 7 Pairwise (Welch t-test) comparison of the ENC distributions of LCPs and nLCPs in each parasite. Consistent with what was done in the previous graphs, the LCPs are represented in red. Similarly, nLCPs are represented in green. The medians of each boxplot pair differ with 95% statistical significance. (Mathworks).
Fig. 8 Pairwise (Welch t-test) comparison of the SPI distributions of LCPs and nLCPs in each parasitet LCPs are represented in red. In green nLCPs. The medians of each boxplot pair differ with 95% statistical significance. (Mathworks). bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
- ) ) - 6 6 =
A c = 0.51 = 0.51 = 0.57 = 0.57
a a
: = -0.001 = model. model. ( 0.001252),
b P.reichnowi
) ) = -0.00017 ( B b 0.090 (0.07191,
= c LCPs’ BFPs: LCPs’ nLCPs’ nLCPs’ BFPs c*exp(d*x) : = -0.000183 (-0.000253, d a*exp(b*x) a*exp(b*x) + = 0.51 (0.4397, 0.5836), (0.4397, 0.51 = b = -0.001489 (-0.001727, - (-0.001727, -0.001489 = b a a
: = -0.003121 -0.00256) , (-0.003681, P.praefalciparum b ) ) = 0.14 (0.1177, 0.1604) , C a : LCPs’ BFPs LCPs’
= -7.5e-05 (-0.0001052, -4.61e-05) ,( -4.61e-05) (-0.0001052, -7.5e-05 = d = 0.13 (0.1085, 0.1525) , c nLCPs’ BFPs = -0.00124 (-0.001438, -0.001038) , : b = -0.000229 (-0.0002978, -0.0001602) -0.0001602) = ; -0.000229 (-0.0002978, d d = 0.4437 (0.3759, 0.5114), (0.3759, 0.4437 = = 0.55 , (0.4769, 0.6341) P.alderi P.alderi a a ) ) : : D : LCPs’ BFPs LCPs’ = 0.083 (0.06994, 0.09596) , 0.09596) (0.06994, 0.083 = c = -7.12e-05 (-0.0001015, -4.094e-05) ( -4.094e-05) = -7.12e-05 (-0.0001015, d d
= 0.14 (0.1202, 0.1643), = 0.39 (0.3423, 0.4314), c = -0.00018 (-0.0002396, -0.000124); -0.000124); (-0.0002396, -0.00018 = d
= -0.0028 -0.002413) , (-0.003272, b P.gaboni : P.gaboni nLCPs’ BFPs ) ) E (
LCPs’ BFPs : a = = -8.149e-05 (-0.0001133, -4.968e-05) ( = 0.13 (0.1118, 0.149), (0.1118, 0.13 = = 0.09343), 0.08 (0.06701, d d c
c = -0.0027 (-0.003079, -0.00226); -0.00226); (-0.003079, -0.0027 =
= = 0.55 (0.4898, 0.618) , d d = -0.0016 (-0.001798, -0.001371), (-0.001798, -0.0016 =
a a b : = , -0.002601) -0.0030 (-0.003502, b = -3.842e-05) -6.124e-05 (-8.405e-05, d = 0.50 (0.4461, 0.5571), (0.4461, 0.50 = = 0.53 (0.4595, 0.601) , 0.601) (0.4595, 0.53 = c c
a = 0.084 (0.07013, 0.09811),
= -0.0002504 (-0.0003267, -0.0001741) ; c
d = -0.0015 (-0.001708, -0.001303), (-0.001708, -0.0015 = = -0.0030 (-0.003265, -0.002493), (-0.003265, -0.0030 = b b
= 0.580.6502), (0.5122, a : LCPs’ BFPs: LCPs’ -0.0001093), = -5.822e-05 (-8.891e-05, -2.752e-05) . All the coefficients are provided with their 95% confidence bounds 95%confidence their with provided are coefficients the All . -2.752e-05) (-8.891e-05, -5.822e-05 = d Illustration Illustration of the SPI vs length analysis performed with Laverania parasites. In red LCPs. In green nLCPs. We utilized an = = 0.10 (0.09086, 0.1175), ig. ig. 9 -0.00181, -0.001351), nLCPs’ BFPs F P.falciparum: nLCPs’ best fit parameters (BFPs) 0.0001131) ; 0.0001131) : (0.4451, 0.5775), 0.6328), (0.5096, ( 0.0002284, c 0.16 (0.1342, 0.1881 , , 0.1039)
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Fig 10 Pr2 plots. (A, B, C, D, E) Laverania Plasmodia; (F, G, H, I, L, M, N) Simian Plasmodia; (O,P) Haemamoeba Plasmodia;(Q,R,S,T,U) Vinckeia Plasmodia; (V, Y,Z) Human Infecting Plasmodia. Moving away from the Pr2 center returns the extent with which Parity Rule 2 is violated in a Protein Coding Sequence. The more a CDS moves away from the center, the more the contribution to the CUB can be attributed to Selective Pressure.
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Fig. 11 Pairwise comparison between Pr2 Violation distributions of LCPs and nLCPs represented in red and green, respectively. Medians in each pair differ with 95% confidence (Mathworks).
Fig 12 Correlation between Pr2 data and protein length. Pr2 plot is represented in the (X,Y) plane while protein length (nucleotides) is placed over the quote
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Tab.1 In the first column, the Pearson correlation coefficients of 22 distributions are collected in which the length of the proteins and the amount of LCR present in them is correlated. The second column collects the average Guanine and Cytosine content of each parasite. The third column collects the number of low complexity residues present within each organism as identified by SEG.
Organism Pearson-correlation GC content LC content (amino – acids coefficient count)
P.falciparum r = 0.70 0.23 130020 P.praefalciparum r = 0.71 0.23 125491 P.reichnowi r = 0.71 0.23 133878 P.alderi r = 0.67 0.22 105572 P.gaboni r = 0.70 0.22 102893 P.vivax r = 0.35 0.48 28753 P.knowlesi r = 0.26 0.40 19693 P.coatney r = 0.32 0.44 15273 P.cynomolgi r = 0.42 0.42 26101 P.inui r = 0.26 0.44 16109 P.fragile r = 0.22 0.44 19378 P.gonderi r = 0.39 0.26 31857 P.yoelii r = 0.33 0.24 29765 P.berghei r = 0.38 0.24 28753 P.petteri r = 0.21 0.27 11737 P.vinckei r = 0.16 0.26 11324 P.chabaudi r = 0.26 0.28 11265 P.gallinaceum r = 0.46 0.21 49707 P.relictum r = 0.37 0.21 39707 P.malariae r = 0.45 0.30 31876 P.ovale r = 0.43 0.36 43101 P.o.curtisi r = 0.30 0.32 32005
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Tab.2 Correlation coefficients between content of each protein and their ENC scores.
Organism nLCP LCP nLCPs-slope LCPs-slope
P.falciparum r = 0.56 r = 0.65 43.9 55.4
P.praefalciparum r = 0.57 r = 0.62 46.8 51.5
P.reichnowi r = 0.62 r = 0.62 49.2 52.9
P.alderi r = 0.61 r = 0.60 47.4 48.1
P.gaboni r = 0.59 r = 0.53 44.2 41.7
P.yoeli r = 0.54 r = 0.52 23.4 44.5
P.berghei r = 0.57 r = 0.62 43.3 51.5
P.petteri r = 0.54 r = 0.47 41.4 41.5
P.vinckei r = 0.57 r = 0.54 41.9 43.8
P.chabaudi r = 0.60 r = 0.51 43.9 39.3
P.gallinaceum r = 0.42 r = 0.43 30.8 37.8
P.relictum r = 0.41 r = 0.41 37.7 44.5
P.malariae r = 0.68 r = 0.72 50.4 63.5 bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
Tab. 3 ENC vs values. Simian Plasmodia Organism nLCP LCP
P.vivax = 0.67 = 0.50
P.knowlesi = 0.23 = 0.47
P.inui = 0.20 = 0.35
P.cynomolgi = 0.56 = 0.34
P.coatney = 0.32 = 0.42
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
bioRxiv preprint doi: https://doi.org/10.1101/2020.03.14.992107; this version posted January 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.