<<

PROPEPTIDES OF

EVOLVED SENSORS TO

EXPLOIT ORGANELLAR PH

Johannes Elferich, Danielle M. Williamson, Bala Krishnamoorthy¶, and Ujwal Shinde*

Department of Biochemistry and ,

Oregon Health and Science University, 3181 SW Sam Jackson Park Road,

Portland, OR, 97239, USA

¶ Department of Mathematics, Washington State University

Pullman, WA, 99164, USA

*Corresponding Author:

Ujwal Shinde, [email protected] Phone (503)-494-8683 Facsimile: (503)-494-8393

Running title: evolution to exploit pH

SUMMARY:

Eukaryotic cells maintain strict control over protein secretion, in part by utilizing the pH-gradient maintained within their secretory pathway. How eukaryotic proteins evolved from prokaryotic orthologs to exploit the pH-gradient for biological function remains a fundamental question in cell biology. We have previously demonstrated that protein domains located within precursor proteins, propeptides, encode histidine-driven pH-sensors to regulate organelle-specific activation of the eukaryotic proteases, and -1/3. Using bioinformatics, we analyzed over 10,000 unique proteases within evolutionarily unrelated families, and established that eukaryotic propeptides are enriched in histidines when compared to prokaryotic orthologs. On this basis, we propose that eukaryotic proteins evolved to contain histidines within cognate propeptides to exploit the tightly controlled pH-gradient of the secretory pathway, thereby directing activation within specific . Enrichment of histidine in propeptides may therefore be used to predict the presence of pH sensors in other proteases or even protease substrates.

HIGHLIGHTS:

• Histidine residues in propeptides act as pH-sensors in furin, a eukaryotic protease

• Histidine is enriched in eukaryotic, but not prokaryotic, propeptides

• Histidine enrichment is found in protein families unrelated to

• We propose histidine enrichment as an evolutionary mechanism to sense organellar pH

INTRODUCTION:

Eukaryotes are descendants of distinct prokaryotic cells that united symbiotically to subsequently evolve complex cellular compartment called organelles (Embley and Martin,

2006). Although both prokaryotes and eukaryotes are able to secrete proteins, only eukaryotes employ multi-compartmental secretory and endocytotic pathways. These pathways maintain a precise pH-gradient that acidifies from the endoplasmic reticulum (pH~7.2) to secretory vesicles

(pH~5.5). This gradient provides the unique environmental conditions essential for the optimal structure and function of proteins within distinct biochemical pathways (Casey et al., 2010).

Since many secreted eukaryotic proteins have prokaryotic orthologs, how and when eukaryotic proteins evolved the ability to regulate their activity within different organelles is a central question germane to our understanding of protein trafficking. Comparing secreted eukaryotic proteins with their bacterial orthologs may potentially provide information about mechanisms by which protein activity is regulated during trafficking through the secretory pathway.

Proteases hydrolyze bonds and likely arose early during evolution as simple catabolic catalysts that generated residues in primitive organisms (Lopez-Otin and

Bond, 2008). Due to their ubiquitous distribution within prokaryotes, eukaryotes, and archea, the three domains of life, proteases are well suited for analysis of selective pressures that drove adaptation of eukaryotic proteins to the complex organelle trafficking system. Since uncontrolled has catastrophic consequences, cells appear to have evolved two distinct mechanisms that maintain protease activities under exquisite spatiotemporal control (Lopez-

Otin and Bond, 2008). The first mechanism involves co-evolution of specific endogenous inhibitors, typically within compartments distinct from those containing active . The second mechanism involves proteases being synthesized as inactive precursors called zymogens, which become active by limited intra- or intermolecular proteolysis. In some cases the two regulatory mechanisms are combined; N-terminal propeptides co-evolved to facilitate folding of cognate catalytic domains and act as potent inhibitors after cleavage from the catalytic domain (Shinde and Inouye, 1993; Shinde and Thomas, 2011).

Subtilases – a ubiquitous super-family of serine proteases – represents an ideal group of homologs to analyze protein adaptation to eukaryotic organelles, since they exist in all three domains of life. Bacterial and mammalian proprotein convertase (PC) sub-families constitute the most extensively studied enzymes (Shinde and Thomas, 2011). Despite evolutionary divergence, proteins in these subfamilies display common folds with conserved catalytic triads. Subtilases are almost always expressed as zymogens, with amino and occasionally carboxy propeptide extensions. They are classified into two sub-families;

Extracellular Serine Proteases (ESP) and Intracellular Serine Proteases (ISP) (Subbian et al.,

2004). ESPs have 80-100 residue long propeptides that catalyze folding and act as inhibitors after cleavage, while ISPs have shorter propeptides that only act as inhibitors in the zymogen.

Catalytic domains and propeptides of mammalian PCs are closely related to protease domains of ESPs and not ISPs (Shinde and Thomas, 2011). Similar to bacterial ESPs, propeptides of

PCs assist folding and require two ordered steps of proteolytic cleavage for activation. The two proteolytic cleavages are precisely controlled within different organelles. The first cleavage occurs rapidly after protein folding in the endoplasmic reticulum and results in a non-covalent complex between the propeptide and the catalytic domain. Activation requires an additional cleavage within the propeptide; in the case of furin this cleavage occurs only after the protein is trafficked to a different organelle, the trans-golgi-network (TGN) (Anderson et al., 2002). Other

PCs are activated in a similar manner, but within different compartments (Seidah and Prat,

2012).

Experiments in vitro show that the pH of the TGN is sufficient to trigger the second activating cleavage of furin (Anderson et al., 1997) and that a histidine residue in the propeptide acts as a pH-sensor (Feliciangeli et al., 2006). We recently showed that propeptides of PCs mediate the pH of activation, as swapping propeptides between PCs reassigned the pH of activation (Dillon et al., 2012). We therefore hypothesized that propeptides of eukaryotic subtilases evolved to sense organelle pH in order to direct activation. Such a broad hypothesis is difficult to test experimentally, as it would require biochemical studies on a large number of proteins. We overcame this problem by predicting properties of protein sequences based on our hypothesis, and testing these against sequence databases using statistical methods. Histidine is the only residue with an intrinsic pKa near the physiological range (~6.5) and therefore likely involved in pH-sensing mechanisms. In this paper we show that enrichment of histidines in propeptides correlates with the requirement to sense pH for activation within the subtilase family. Furthermore, we demonstrate similar enrichment in other protease families, indicating that enrichment of histidines in propeptides is a common mechanism to regulate activity in the secretory pathway.

RESULTS:

Propeptide sequences of subtilases are more divergent than cognate catalytic domains:

To identify conserved sequence elements unique to either prokaryotic or eukaryotic subtilases, we performed an evolutionary conservation analysis using the ConSurf server (Ashkenazy et al., 2010). The analysis of prokaryotic subtilisin and eukaryotic proprotein convertase families was initiated using sequences of Subtilisin E and /3 (PC1/3), respectively. The resulting conservation scores were mapped on the crystal structure of the propepide:Subtilisin E complex (PDB: 1SCJ) and on a homology model of the propeptide:PC1 complex (based on PDB: 1P8J and 1KN6), respectively (Figure 1). Catalytic domains of eukaryotic and prokaryotic subtilases depict a highly conserved core. On the contrary, propeptides demonstrate less sequence conservation, with the dibasic cleavage motif at the C- terminus of eukaryotic propeptides representing the only conserved region.

Since histidine 69 was demonstrated to function as a pH sensor in furin (Feliciangeli et al., 2006), and given that propeptides of furin and PC1/3 alone are sufficient to impart organelle- specific pH-dependent activation of cognate catalytic domains (Dillon et al., 2012), we analyzed whether histidine residues demonstrate any sequence conservation within propeptides.

Although we could not identify absolutely conserved histidine residues in propeptides of eukaryotic subtilases, several positions in our alignment contain a histidine residue in a substantial fraction of sequences, especially at the position corresponding to histidine 69 in furin

(53.3% of sequences). In contrast, prokaryotic subtilases, which do not traverse the secretory pathway, appear to encode less histidines within their propeptides. However, when catalytic sequences are compared, we find strictly conserved histidine residues within prokaryotic and eukaryotic sequences, and studies indicate that they play essential roles in catalysis or protein stability (Carter and Wells, 1987). Hence, biased enrichment for histidine residues appears localized within propeptides of eukaryotic subtilases.

The ConSurf analysis can only accommodate 150 sequences within each group, and a search initiated using Subtilisin E and PC1/3 may introduce a selection bias based on input sequences. Since subtilases encompass over 10,107 unique sequences, as per the database family PF00082 (Punta et al., 2012), we developed a robust analysis of histidine distribution within the available sequence data. Since PFAM employs a hidden Markov model of only the catalytic domain in subtilases to scour through sequence databases, this method avoids any selection bias for propeptide sequences. As PFAM families include only approximate demarcations for start and stop positions for catalytic domains, we made the following three suppositions to define propeptide and catalytic domain in each sequence: (i) the first 20 residues correspond to signal (Hebert and Molinari, 2007) (ii) residues between position 21 and the start of the catalytic domain correspond to the propeptide, and (iii) subtilases with propeptides less than 50 residues represent ISP-like sequences that employ different mechanisms for folding and activation (Subbian et al., 2004). This stringency provides a total of

6533 unique sequences from the PF00082 family for further analyses of a histidine bias.

Increased histidine content in propeptides of subtilases correlate with requirement of pH- mediated activation:

We computed the abundance of histidine residues in propeptides ([His]Pro) and catalytic domains ([His]Cat) for all sequences that met the above criteria. For comparison, we calculated the difference in abundance in propeptides and catalytic domains within each protein (Δ[His] = [His]Pro – [His]Cat). A positive value of Δ[His] indicates abundance of histidines in propeptides, a negative value signifies abundance in catalytic domains, while near zero values imply equal distribution. While Δ[His] values in individual proteins may be subject to random fluctuations, the absence of any functional requirements would result in a distribution centered around zero. If histidine residues in propeptides are required for the experimentally observed function of sensing organelle specific pH, they would be selected during evolution, and one would expect statistical bias for positive Δ[His] only within eukaryotic subtilases and near zero or negative

Δ[His] for prokaryotic subtilases.

For initial assessments, we plotted Δ[His] on a phylogenetic tree generated by the PFAM database (Figure 2A). The tree is consistent with the homology groups defined by Siezen and coworkers (Siezen and Leunissen, 1997), with the largest clades representing subtilisin, , , and pyrolisin, as well as the later characterized family (Wlodawer et al.,

2003). Four of these five families contain eukaryotic and prokaryotic proteins, suggesting these families diverged before speciation. Only the subtilisin family is exclusively found within prokaryotes. Interestingly, we observed that three of these four families display a predominantly positive Δ[His] in eukaryotes, but not in prokaryotes. Only show positive Δ[His] values in both prokaryotes and eukaryotes.

To validate that positive Δ[His] values are unique to eukaryotic sequences, we constructed a tree based on NCBI taxonomic classification and plotted Δ[His] within all subtilases (Figure 2B). The slightly negative mean values of Δ[His] in prokaryotic and archaic proteins imply that there is no functional requirements for histidines in prokaryotic propeptides.

In contrast, we observe a predominantly positive Δ[His] values in eukaryotes, with a mean value of 1.72%. This difference signifies a strong increase in histidine content in the propeptide compared to the catalytic domain. We observed positive Δ[His] values in all 3 kingdoms of higher eukaryotes. In bacteria, the difference of about -0.3% was consistent in the 3 most represented phylums. Interestingly, the phylum of Acidobacteria had a mean difference of

2.13%, comparable to eukaryotes. Although the above analysis provides a visual description, we wanted to analyze the statistical significance of the observed bias towards positive Δ[His] values within eukaryotes.

First, we plotted distributions of [His]Pro and [His]Cat for subtilases in prokaryotes and eukaryotes

(Figure 2C). The catalytic domains in both species display a distribution centered on 2%, with eukaryotes having slightly higher [His]Cat values than prokaryotes, as expected from the average histidine content in the UniProt database. While the distribution of [His]Pro in prokaryotes is shifted towards lower values with several propeptides completely lacking histidines, the [His]Pro in eukaryotes is shifted to higher values, much greater than the catalytic domains. It is important to note that the distribution of [His]Pro in eukaryotes displays a much higher deviation than that for [His]Cat within both prokaryotes and eukaryotes, which is likely due to the shorter length of propeptides. When we investigated distributions for every amino acid we found that this enrichment exists only for histidine residues (Figure S1). To further analyze this bias we also investigated the distribution of Δ[His] (Figure 2D), which clearly demonstrates the differences in histidine bias in prokaryotes and eukaryotes. The Δ[His] distribution in both species are positively skewed, with median values of -0.56% (mean = -0.34%) and 1.5% (mean = 1.7%) for prokaryotes and eukaryotes, respectively. When differences in distribution for every individual amino acids were plotted (Δ[AA]), only cysteine displays a difference between prokaryotes and eukaryotes similar to histidine (Figure S2). The cysteine bias is likely due to higher prevalence of disulfide bonds in eukaryotes than prokaryotes. To quantify this distribution difference between species we employed a non-parametric Mann-Whitney test (Table 1). For several amino acids, the test resulted in small p-values (<0.05), indicating statistically significant differences in Δ[AA] distributions between eukaryotes and prokaryotes. However, due to the large sample size even tiniest differences can result in statistically significant p-values. A more meaningful, sample size independent, measure of the difference in distribution can be obtained using effect sizes (U/mn) (Newcombe, 2006). These values vary between 0.0 and 1.0, and estimate the probability that a random sample of Δ[AA] in eukaryotes is larger than a random sample of Δ[AA] in prokaryotes. Equal distribution of Δ[AA] in both species would result in an effect size of 0.5. As seen in Figure 2E, histidine shows the highest deviation from 0.5, suggesting this bias is not by pure chance. Only cysteine deviates substantially (more than 0.15 units from 0.5), which is likely due to higher frequency of disulfide bonds in eukaryotes than prokaryotes. The fact that deviation from 0.5 in the effect size for histidine is considerably greater than that observed for cysteine suggests a biological significance for a histidine bias.

Since possible errors in database annotation and differences in length between propeptides and catalytic domains may result in a false-positive bias, we developed a test that is independent of the start annotation in the PFAM database. We calculated histidine content in a

20-residue sliding window from the beginning of the sequence to the end of catalytic domain for all sequences. After normalization as described in Methods, we averaged the resulting histidine content profiles for eukaryotic and prokaryotic proteins (Figure 2F). Eukaryotic but not prokaryotic proteins show an increase in histidine content in the first 100 residues, corresponding to the propeptide. Proteins from both species have increased histidine content at positions 200-250 along with a small increase at the C-terminus of the catalytic domain, likely due to presence of the catalytic histidine, and a conserved histidine at the C-terminus of the catalytic domain. Changes in length of the sliding window do not change the overall profile

(Figure S3).

To decipher correlations that may exist between the histidine bias and experimental evidence of pH-dependent activation, we analyzed histidine contents in propeptides and catalytic domains of individual proteins. For prokaryotes and archaea, we selected proteins that displayed “reviewed status” in the UniProt database, and for eukaryotic sequences all homologs in Homo sapiens, Saccharomyces cerevisiae, Arabidopsis thaliana, and the model proteases cucumisin and Proteinase K/R (Figure 2G). While most bacterial proteins show comparable histidine content in propeptides and catalytic domain (approximately 2%), Kumamolisin and

Xanthomonalisin display histidine content >4% in their propeptides. Consistent with our hypothesis, both proteins undergo activation at acidic pH in vitro (Oda et al., 1987; Oyama et al.,

2002), which is not surprising because their hosts display optimum growth under acidic conditions. Since the intracellular pH within these cells is maintained near neutral, pH sensing is an ideal mechanism for discerning intracellular and extracellular environments. Hence, it is not surprising that the sedolisin family shows a histidine bias within propeptides of prokaryotic sequences, given that this family is optimized to function at low pH. Interestingly, Acidobacteria almost exclusively express sedolisins, which explains their positive Δ[His] values. On the other hand, all eukaryotic propeptides display histidine content above 4%, with the exception of

Proteinase K/R and SKI-1. Recombinant expression of proteinase K/R in E. coli produces active protease (Gunkel and Gassen, 1989) and SKI-1 loses its propeptide in the ER (Seidah et al.,

1999), suggesting these processes occur at neutral pH, relaxing the necessity for histidines.

In conclusion, our results demonstrate that the histidine bias in subtilase propeptides generally correlates with host species as it is present in eukaryotes but not in prokaryotes

(Figure 2). These results are consistent with our hypothesis that during evolution subtilases responded to the requirement of regulating their activation as per the pH of their environment by encoding histidine residues in propeptides to sense pH and direct activation.

Cathepsin propeptides are enriched in histidine:

To investigate whether our hypothesis applies to other pH-activated, propeptide- dependent proteases, we analyzed histidine content in , a large family of cysteine peptidases found mainly in (Turk et al., 2012). Similar to subtilases, acidic pH initiates activation of cathepsins by propeptide proteolysis. Due to these parallels, we hypothesized that eukaryotic cathepsins should show a similar bias for histidine in their propeptides.

We used the PFAM family PF00112 to obtain sequences and analyzed them in a manner identical to subtilases. Figure 3A shows the phylogenetic tree for various cathepsins along with their Δ[His] values. While only few cathepsin homologs exist in prokaryotes, this paucity is not due to exclusion of sequences based on our criteria. Although experimental data on prokaryotic cathepsins is scarce, we included them in the analysis for comparison. The two major well-studied lysosomal cathepsin families are and , both of which activate at low pH (Nishimura et al., 1988; Turk et al., 1993). Since no experimental data regarding activation of , , and plant cathepsins was found, we excluded them from further analysis. Nonetheless, the latter two cathepsins also display the Δ[His] bias.

While the cathepsin L family shows bias towards positive Δ[His] values, the cathepsin B family does not. The distributions of [His]Pro and [His]Cat in the cathepsin L family are similar to those in eukaryotic subtilases (Figure 3B). However, the cathepsin B family displays increased

[His]Pro and [His]Cat values, leading to near-zero Δ[His] values (Figure 3C). Prokaryotic cathepsins show similar distributions as prokaryotic subtilases. The small number of prokaryotic sequences precludes a statistical comparison between species with robustness similar to subtilases.

We next applied the sliding window analysis to validate the increased histidine content in the propeptides of cathepsin L, and to map the specific location of increased histidine content in the sequence of cathepsin B (Figure 3D). Prokaryotic cathepsins showed low histidine content throughout the protein, with one peak between residues 250 and 300, which is due to the catalytic histidine. Consistent with our hypothesis, an additional increase in histidine content exists within the first 100 residues of cathepsin L. Interestingly, cathepsin B shows a moderate increase in histidine content within the first 100 residues compared to prokaryotes, but a substantial second peak corresponding to the occluding loop within the catalytic domain (Figure

3D and 3E). A comparison of the crystal structures of cathepsin L and cathepsin B (Figure 3E) shows that the catalytic domains of the two families are similar. However, the cathepsin B propeptide is truncated compared to cathepsin L, while the occluding loop in the catalytic domain is longer in cathepsin B. Moreover, the cathepsin B occluding loop in the catalytic domain extends into the region occupied by the cathepsin L propeptide to form direct contacts with the cathepsin B propeptide. Notably, histidines within the occluding loop of cathepsin B occupy similar spatial locations as histidine residues within the cathepsin L propeptide. This suggests that the pH-sensing capability in cathepsin B is encoded not only within the propeptide, but also in the occluding loop within the catalytic domain. Consistent with this prediction, experimental data demonstrates that the occluding loop interacts with the propeptide in a pH-dependent manner and mutation of histidines in the occluding loop to alanine blocks activation (Quraishi et al., 1999). Moving pH-sensitivity from the propeptide into the catalytic domain provides an evolutionary advantage to cathepsin B by enabling it to switch between an endo- and exopeptidase in a pH dependent manner (Illy et al., 1997). In summary, these results are consistent with our hypothesis, although subtle variations can exist within individual propeptide-dependent protease families.

Propeptides in the cytosolic family do not display a histidine bias:

Our hypothesis assumes that eukaryotic proteases require histidines in their propeptides to sense the pH of the secretory pathway. Therefore, proteases that are expressed and function in the cytosol would be expected to show no histidine bias within their propeptides. constitute the most prominent, propeptide-dependent cytosolic protease family. They are responsible for initiating apoptosis within eukaryotic cells (Creagh et al., 2003). Similar to subtilases and cathepsins, caspases are expressed as inactive zymogens and activated by proteolytic processing. Although apoptosis is linked with mild acidification of the cytosol, pH is not shown as important for triggering caspase activation.

We used the PFAM family PF00656 to obtain caspase sequences and processed them as described in Methods. The phylogenetic tree demonstrates that caspase homologs are found in metazoans, fungi, and plants (Figure 4A). We exclude metacaspases (homologs in fungi and plants) from our analysis because their propeptides contain histidine residues that are involved in zinc binding (Tsiatsiani et al., 2011). Metazoan caspases demonstrate increased [His]Cat values, while the [His]Pro values are similar to that of prokaryotic caspases (Figure 4B).

Consistently, Δ[His] values were slightly smaller for eukaryotic proteins (Figure 4C). The sliding window analysis of prokaryotic and eukaryotic caspases shows that there is no substantial histidine enrichment in the N-terminal residues (Figure 4D). Overall these results are consistent with the assumption that the functional requirement of histidines in the propeptide is unique to proteases that need to sense pH to direct their activation.

DISCUSSION:

We report a correlation of increased histidine content in propeptides with the requirement to sense pH. But does a correlation imply causality? Histidine residues play multiple unique roles in proteins because they can (i) function as proton exchangers in catalysis (Dodson and Wlodawer, 1998), (ii) form complexes with soft metals such as iron and zinc (Andreini et al., 2009), (iii) provide unique hydrogen bonding geometry, and (iv) alter protein structure and interactions in a pH-dependent manner. Since propeptides are not part of the that mediates proteolysis, and because the propeptides analyzed in this study do not bind metal ions (Coulombe et al., 1996; Jain et al., 1998; Tangrea et al., 2002), one can exclude the first two roles. It is also unlikely that propeptides in eukaryotes have a different requirement for hydrogen bonding than their prokaryotic orthologs, thus endorsing their roles as pH-sensors as the most likely explanation for the observed histidine bias.

Histidine residues have been demonstrated to function as pH sensors within various proteins in prokaryotes and eukaryotes (Casey et al., 2010; Srivastava et al., 2007). In some cases such as Na+/H+ antiporters, pH-regulation is found in prokaryotic and eukaryotic orthologs

(Slepkov et al., 2007). This is due to the common functional requirement to regulate intracellular pH. Since in the case of the protease families investigated here, pH-sensing seems to be unique to eukaryotic proteins (with the exception of sedolisins), one can surmise that this is indeed due to a functional requirement that is unique to eukaryotes, such as regulation within the secretory pathway.

It is important to note that electrostatic interactions within a can migrate during the course of evolution, even though their physiological functions such as pH sensing can be conserved. This may appear through mutations that introduce a redundant charge pair, which display a pKa similar to the previous one, to allow the original charge to disappear during subsequent steps of evolution while maintaining or subtly modifying its titration properties, without requiring exquisite stereo chemistry (Harrison, 2008). In fact, we have observed that the histidine residues in the propeptide of eukaryotic subtilases are not strictly conserved (Figure 1) but are spatially “unconstrained” within their sequence. Based on this finding, we preferred to analyze overall histidine content within propeptides and catalytic domains. Since histidine is among the least abundant amino acids within proteins (~2.3%), small changes in their numbers can significantly influence the overall content, especially in the rather short propeptides.

We selected three specific families for our analysis because we wanted proteins that have (i) a well-distributed phylogeny in prokaryotes and eukaryotes (ii) propeptide domains necessary for chaperoning folding (iii) the structures of both propeptides and catalytic domains solved, and (iv) activation pathways which are experimentally well characterized. Subtilases, cathepsins and caspases, conform to the above requirements, and represent more than 10,000 protease sequences.

It is astonishing that the histidine bias is found in families that have no evolutionary relationship. Also, it is consistently found in subfamilies of subtilases, even though these likely diverged before the speciation of eukaryotes and prokaryotes. This suggests that different protease families may have reacted independently to the same evolutionary pressures to converge to similar solutions, and therefore represent examples of convergent evolution. We argue that there are two reasons why pH-sensing has independently evolved within propeptide domains. First, propeptides appear to be under less evolutionary pressure than cognate protease domains, which must maintain their catalytic function throughout evolution, necessitating preservation of the active site geometry. As a consequence, propeptides are more likely to develop new features that allow the modulation of the protease in different environments. Secondly, since catalytic domains must often work in diverse pH-environments, the introduction of elements that change protein conformation in a pH-dependent manner are likely to be detrimental for protein stability and function, and are better “outsourced” to domains of the protein that are no longer present in the mature protein. The notable exception is the cathepsin B family, where the presence of the pH-sensitive occluding loop in the catalytic domain allows a pH-dependent change in protease characteristics, which might be important for the biological function (Nagler et al., 1997).

Further research must focus on the mechanisms by which histidines in propeptides mediate activation. Since, not every histidine in a protein acts as a pH-sensor, determinants of pH-sensing other than the mere presence of histidines must exist. For example, the propeptides of furin and PC1/3 both are enriched in histidine, but they show differences in pH-sensitivity.

How are such differences encoded within their sequences? Understanding the physical principles will also better predict the functional significance of histidines within sequences, an important challenge given the abundance of sequence data compared to experimental evidences.

While details of the mechanism of pH-sensing in propeptide-dependent proteases are unknown, we speculate that two general mechanisms are possible. First, protonation of histidines due to lower pH, could destabilize the interaction between propeptide and protease domain, either by directly affecting charge or hydrophobic interactions, or by destabilizing the propeptide structure, which is required for binding. Since subtilases and cathepsins can autoactivate, the subsequent increase in the fraction of free protease leads to digestion of the free propeptide and thereby prevents rebinding. A second potential mechanism is that protonation of histidine leads to structural changes in the propeptides that make cleavage motifs within the propeptide accessible for active proteases. While a mixture of both mechanisms is likely responsible, we speculate that the second mechanism plays a prominent role, since a histidine within the core and not at the interface to the catalytic domain was essential for pH- mediated activation of furin (Feliciangeli et al., 2006). Also, the second mechanism was shown to regulate maturation of the dengue virus within the secretory pathway, where lowering the pH leads to drastic conformational changes in the capsid proteins, which expose cleavage sites to control capsid processing (Yu et al., 2008). In this case the substrate, and not the protease, controls activation, tempting one to speculate that such a mechanism may be found in other proteins processed by proteases within the secretory pathway.

Since histidine enrichment correlates with the pH-mediated activation in subtilases and cathepsins, we speculate that it can be used to predict proteins that use a similar mechanism for activation. A list of all human proteins with annotated propeptides in the UniProt database, which have more histidines in their propeptides than expected assuming a probability for histidine of 2.3%, (Table S1) includes 52 proteins that are either secreted or targeted to the secretory to endocytotic pathway. While this biased can be random or caused by other factors, such as zinc binding sited, which could explain why metalloproteases like “A Disintegrin and metalloproteinase domain-containing protein” (ADAM) or Matrix metalloproteases are frequent in the list, we propose that members on that list use the pH of the secretory pathway to regulate their proteolytic activation. While the lack of knowledge about their activation and functions of propeptides, especially in prokaryotic homologs, did not allow us to include a detailed analysis, we note that several of these proteins show consistent enrichment of histidines in eukaryotic, but not prokaryotic homologs (Table S2).

In summary, this study suggests a prominent role of the pH gradient in the secretory pathway in orchestrating the proteolytic processing of secreted proteins. Any disturbances in this gradient could therefore lead do disregulation of protease activity. Disregulation of proprotein convertases and cathepsins can have adverse effects, and are associated with diseases like , artherosclerosis and Dent’s disease (Reiser et al., 2010; Seidah and Prat,

2012). Since all these diseases are also associated with changes in cytosolic pH (Naghavi et al., 2002; Webb et al., 2011), studies that address whether the secretory pH-gradient is also effected are needed to address the question of whether pH-disregulation plays a role in disturbing the regulation of the secretory pathway.

EXPERIMENTAL PROCEDURE:

Conservation Analysis: Analysis of conserved residues was performed using the ConSurf server with standard settings (Ashkenazy et al., 2010). The crystal structure of the propeptide:subtilisin E complex (PDB: 1SCJ) was used as input for analyzing bacterial subtilases, while a homology model for the catalytic domain of PC1 based on the crystal structure of furin (PDB: 1P8J) and an NMR solution structure of the PC1 propeptide (PDB:

1KN6) docked onto the catalytic domain using the subtilisin structure as a reference, was used for eukaryotic subtilases. Results were analyzed and plotted using the UCSF Chimera package

(Pettersen et al., 2004).

Data acquisition: The BioMart interface of the InterPro database (Hunter et al., 2012) was used to download UniProt sequence identifiers, start and stop positions, and taxonomy identifiers of annotations from the entries PF00082, PF00112, and PF00656 of the PFAM database for subtilases, cathepsins, and caspases, respectively (Punta et al., 2012; The UniProt

Consortium, 2012). Protein sequences were downloaded from the UniProt database. Phylogeny was downloaded from the PFAM database, and taxonomy was obtained from the NCBI

Taxonomy homepage.

Amino acid content calculations: Sequences with two annotated catalytic domains or those marked as deprecated in the UniProt database were discarded. The catalytic domains were defined as sequences between the start and stop annotations while propeptides were defined as sequences between positions 20 and the start annotations for subtilases and cathepsins.

The first 20 residues were not included since they represent the signal peptide. Since caspases lack signal peptides, residues from position 1 to the start annotations were denoted as propeptides. Sequences with propeptides shorter than 50 residues or longer than 300 residues were discarded. For the remaining sequences the amino acid contents of the propeptides and catalytic domains, [AA]Pro and [AA]Cat, were calculated by dividing the number of occurrences of the amino acid AA in a sequence by the sequence length. The difference between [AA]Pro and

[AA]Cat was calculated as Δ[AA].

Tree construction: NCBI taxonomy based trees were constructed using taxonomy identifiers as input for the iTol Tree generator (Letunic and Bork, 2011) and adding each protein as a node of their species. Trees were plotted using the ‘ape’ package written in R statistical computing language (Paradis et al., 2004; R Core Team, 2012).

Statistical testing: A non-parametric Mann-Whitney test was performed to assess differences in the distribution of Δ[AA] between prokaryotes and eukaryotes using the R statistical computing language. The effect size was calculated as U/mn, by dividing the test statistic U by the product of the two sample sizes (Newcombe, 2006).

Sliding window analysis: For each sequence the number of histidines, #His(i,k), in a window of length k starting at position i, ranging from 1 to n–k+1 were counted, where n is the length of the sequence. To account for different sequence lengths, the starting sequence positions were normalized as follows:

! #!"# !, ! = #!"#( ∗ !, !) ; !"#$ !

! Where, ! is the median sequence length and the term ∗ ! was rounded to the nearest integer. !

For each position i, the #!"#!"#$ !, ! values were averaged and then divided by k to obtain the average histidine content, #His(i), at that position. This method assumes that differences in length due to insertion and deletions are evenly distributed within the protein. Using a multiple sequence alignment for normalization would potentially account better for the position of insertion, but the number of sequences and the low quality of the alignment especially in the propeptide region made that impractical for this study.

ACKNOWLEDGEMENTS:

REFERENCES:

FIGURE LEGENDS:

Figure 1: Propeptides are more divergent than cognate catalytic domains. Conservation scores mapped onto a ribbon presentation of (A) Subtilisin E and (B) PC1/3. Thick tubes represent high divergence at this position while thin tubes represent conservation. Color indicates percentage of sequences that encode a histidine residue at this position from 0%

(grey) to 100% (blue)

Figure 2: Histidines are enriched in propeptides of eukaryotic, but not prokaryotic, subtilases. (A) Phylogenetic tree of subtilases from the PFAM database. Bars on the outside indicate the Δ[His] value of each sequence. A black circle represents 0%. Bars pointing outward and inward represent positive and negative Δ[His] values, respectively. Dashed circles outside and inside of the solid black circle represent Δ[His] values of ±1%. Eukaryotic, prokaryotic, and archean sequences are colored red, blue, and green, respectively. Black arcs on the outside mark the clades of major subtilase subfamilies. (B) Tree based on the NCBI taxonomy classification. Annotation of Δ[His] and color coding are as above. Thick black arcs mark the three super-kingdoms of life, while thin arcs denote kingdoms of eukaryotes and phylums of prokaryotes. (C) Kernel density estimation of the distribution of [His]Pro and [His]Cat in prokaryotes and eukaryotes. (D) Kernel density estimation of the distribution of Δ[His] for prokaryotes and eukaryotes. (E) Effect size U/mn of the Mann-Whitney test for difference between the distributions shown in panel D performed for all 20 natural amino acids. Figure S1 and S2 show the Kernel density estimations for all amino acids. (F) Sliding Window Analysis of average histidine content in eukaryotic and prokaryotic subtilases using a window of 20 residues. The black dashed line indicates average histidine content in the UniProt database.

See methods for detailed explanation of normalization of the sequence length. Arrows indicate relative position of annotations for the end of the propeptide domain and the catalytic histidine residue according to subtilisin E and PC1/3. (G) Bar graph showing [His]Pro and [His]Cat values for selected subtilases. Blue, red, and green shades represent prokaryotic, eukaryotic, and archean sequences, respectively. Light shades indicate [His]Cat and dark shades indicate [His]Pro Figure 3: Histidine enrichment exists only in propeptide domains of the Cathepsin L family, while it is also present in the occluding loop of the Cathepsin B family. (A)

Phylogenetic tree of cathepsins from the PFAM database. Bars on the outside indicate the

Δ[His] value of each sequence. A black circle represents 0%. Bars pointing outward and inward represent positive and negative Δ[His] values, respectively. Dashed circles outside and inside of the solid black circle represent Δ[His] values of ±1%. Eukaryotic, prokaryotic, archean, and viral sequences are colored red, blue, green, and cyan, respectively. Black arcs on the outside mark the clades of major cathepsin subfamilies, with the cathepsin L family shown in green and the cathepsin B family shown in purple. (B) Kernel density estimation of the distribution of [His]Pro and [His]Cat in cathepsin L and B families and in prokaryotes. (C) Kernel density estimation of the distribution of Δ[His] in cathepsin L and B families and in prokaryotes. (D) Sliding Window

Analysis of average histidine content in cathepsin L and B families and in prokaryotes using a window of 20 residues. The black dashed line indicates average histidine content in the UniProt database. See methods for detailed explanation of normalization of the sequence length.

Arrows indicate relative position of annotations for the end of the propeptide domain and the catalytic histidine residue according to Cathepsin L and B, as well as the occluding loop in cathepsin B. (E) Structure superimposition of procathepsin L (PDB: 1BY8) and procathepsin B

(PDB: 1MIR). The catalytic domains are shown in grey ribbon, while propeptides are shown in green and purple for cathepsin L and B, respectively. The occluding loop of cathepsin B is colored in orange and the corresponding loop in cathepsin L is colored green. The sidechains of histidine residues are depicted as stick representations. (F) A close up of interactions between the occluding loop and the propeptide. Colors are as above. Structural depictions were created using the UCSF Chimera Suite Pettersen, 2004 #807}.

Figure 4: The cytosolic caspase family shows no histidine bias in propeptides. (A)

Phylogenetic tree of caspases from the PFAM database. Bars on the outside indicate the Δ[His] value of each sequence. A black circle represents 0%. Bars pointing outward and inward represent positive and negative Δ[His] values, respectively. Dashed circles outside and inside of the solid black circle represent Δ[His] values of ±1%. Prokaryotic, metazoan, plant, fungal and other eukaryotic sequences are colored blue, yellow, cyan, purple and red, respectively. Black arcs on the outside depict the metazoan caspase and metacaspase families. (B) Kernel density estimation of the distribution of [His]Pro and [His]Cat in prokaryotes and metazoan shown in blue and yellow, respectively. (C) Kernel density estimation of the distribution of Δ[His] in metazoan and prokaryotic caspases shown in yellow and blue, respectively. (D) Sliding Window Analysis of average histidine content in metazoan and prokaryotic caspases using a window of 20 residues. See methods for detailed explanation of normalization of the sequence length. Arrows indicate relative position of annotations for the end of the propeptide domain and the catalytic histidine residue according to .

TABLE LEGENDS:

Table 1: Results of Mann-Whitney tests to evaluate differences in distribution of Δ[AA] between eukaryotes and prokaryotes. For each amino acid the following numbers are reported: Median of Δ[AA] for eukaryotes and prokaryotes, test statistic of the Mann-Whitney test, the resulting p-value, the effect size U/mn. Sample sizes were 2156 and 4256 for eukaryotes and prokaryotes, respectively.

TABLES:

Residue Median [%] Median [%] U p U/mn Eukaryotes Prokaryotes A -2.01 -0.29 3484667 6.3 x 10-56 0.38 V -0.03 -0.13 4692108 1.4 x 10-1 0.51 L 1.27 1.53 4494450 1.8 x 10-1 0.49 I -0.29 -0.61 4845335 2.4 x 10-4 0.53 M -0.32 -0.44 4781449 5.7 x 10-3 0.52 F 0.53 0.06 5019341 7.3 x 10-10 0.55 Y -0.27 -1.00 5564109 3.6 x 10-44 0.61 W -0.46 -0.92 5529692 3.2 x 10-41 0.60 S -0.47 -0.18 4329195 2.2 x10-4 0.47 T -1.36 -0.07 3616488 9.2 x 10-44 0.39 N -1.63 -1.86 4339854 3.9 x 10-4 0.47 Q 1.11 1.49 4198344 2.6 x 10-8 0.46 C -1.25 -0.28 2881834 7.5 x 10-132 0.31 G -5.2 -5.3 4576112 8.7 x 10-1 0.50 P -0.67 0.04 3852582 8.5 x 10-26 0.42 D -0.25 -1.52 5644738 1.9 x 10-51 0.62 E 2.85 2.51 4864907 7.7 x 10-5 0.53 H 1.53 -0.56 7048731 1.6 x 10-270 0.77 K 1.45 1.58 4356812 9.6 x 10-4 0.47 R 1.86 0.95 5376156 2.2 x 10-29 0.59

REFERENCES:

Anderson, E.D., Molloy, S.S., Jean, F., Fei, H., Shimamura, S., and Thomas, G. (2002). The ordered and compartment-specfific autoproteolytic removal of the furin intramolecular chaperone is required for enzyme activation. J Biol Chem 277, 12879-12890. Anderson, E.D., VanSlyke, J.K., Thulin, C.D., Jean, F., and Thomas, G. (1997). Activation of the furin endoprotease is a multiple-step process: requirements for acidification and internal propeptide cleavage. Embo J 16, 1508-1518. Andreini, C., Bertini, I., Cavallaro, G., Najmanovich, R.J., and Thornton, J.M. (2009). Structural analysis of metal sites in proteins: non-heme iron sites as a case study. J Mol Biol 388, 356- 380. Ashkenazy, H., Erez, E., Martz, E., Pupko, T., and Ben-Tal, N. (2010). ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucleic acids research 38, W529-533. Carter, P., and Wells, J.A. (1987). Engineering enzyme specificity by "substrate-assisted catalysis". Science 237, 394-399. Casey, J.R., Grinstein, S., and Orlowski, J. (2010). Sensors and regulators of intracellular pH. Nat Rev Mol Cell Biol 11, 50-61. Coulombe, R., Grochulski, P., Sivaraman, J., Menard, R., Mort, J.S., and Cygler, M. (1996). Structure of human procathepsin L reveals the molecular basis of inhibition by the prosegment. Embo J 15, 5492-5503. Creagh, E.M., Conroy, H., and Martin, S.J. (2003). Caspase-activation pathways in apoptosis and immunity. Immunological reviews 193, 10-21. Dillon, S.L., Williamson, D.M., Elferich, J., Radler, D., Joshi, R., Thomas, G., and Shinde, U. (2012). Propeptides Are Sufficient to Regulate Organelle-Specific pH-Dependent Activation of Furin and Proprotein Convertase 1/3. J Mol Biol 423, 47-62. Dodson, G., and Wlodawer, A. (1998). Catalytic triads and their relatives. Trends Biochem Sci 23, 347-352. Embley, T.M., and Martin, W. (2006). Eukaryotic evolution, changes and challenges. Nature 440, 623-630. Feliciangeli, S.F., Thomas, L., Scott, G.K., Subbian, E., Hung, C.H., Molloy, S.S., Jean, F., Shinde, U., and Thomas, G. (2006). Identification of a pH sensor in the furin propeptide that regulates enzyme activation. J Biol Chem 281, 16108-16116. Gunkel, F.A., and Gassen, H.G. (1989). Proteinase K from Tritirachium album Limber. Characterization of the chromosomal and expression of the cDNA in Escherichia coli. European journal of biochemistry / FEBS 179, 185-194. Harrison, S.C. (2008). The pH sensor for flavivirus membrane fusion. The Journal of cell biology 183, 177-179. Hebert, D.N., and Molinari, M. (2007). In and out of the ER: protein folding, quality control, degradation, and related human diseases. Physiological reviews 87, 1377-1408. Hunter, S., Jones, P., Mitchell, A., Apweiler, R., Attwood, T.K., Bateman, A., Bernard, T., Binns, D., Bork, P., Burge, S., et al. (2012). InterPro in 2011: new developments in the family and domain prediction database. Nucleic acids research 40, D306-312. Illy, C., Quraishi, O., Wang, J., Purisima, E., Vernet, T., and Mort, J.S. (1997). Role of the occluding loop in cathepsin B activity. J Biol Chem 272, 1197-1202. Jain, S.C., Shinde, U., Li, Y., Inouye, M., and Berman, H.M. (1998). The crystal structure of an autoprocessed Ser221Cys-subtilisin E-propeptide complex at 2.0 A resolution. J Mol Biol 284, 137-144. Letunic, I., and Bork, P. (2011). Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy. Nucleic acids research 39, W475-478. Lopez-Otin, C., and Bond, J.S. (2008). Proteases: multifunctional enzymes in life and disease. J Biol Chem 283, 30433-30437. Naghavi, M., John, R., Naguib, S., Siadaty, M.S., Grasu, R., Kurian, K.C., van Winkle, W.B., Soller, B., Litovsky, S., Madjid, M., et al. (2002). pH Heterogeneity of human and rabbit atherosclerotic plaques; a new insight into detection of vulnerable plaque. Atherosclerosis 164, 27-35. Nagler, D.K., Storer, A.C., Portaro, F.C., Carmona, E., Juliano, L., and Menard, R. (1997). Major increase in activity of human cathepsin B upon removal of occluding loop contacts. Biochemistry 36, 12608-12615. Newcombe, R.G. (2006). Confidence intervals for an effect size measure based on the Mann- Whitney statistic. Part 1: general issues and tail-area-based methods. Statistics in medicine 25, 543-557. Nishimura, Y., Kawabata, T., and Kato, K. (1988). Identification of latent procathepsins B and L in microsomal lumen: characterization of enzymatic activation and proteolytic processing in vitro. Archives of biochemistry and biophysics 261, 64-71. Oda, K., Sugitani, M., Fukuhara, K., and Murao, S. (1987). Purification and properties of a pepstatin-insensitive carboxyl proteinase from a gram-negative bacterium. Biochimica et biophysica acta 923, 463-469. Oyama, H., Hamada, T., Ogasawara, S., Uchida, K., Murao, S., Beyer, B.B., Dunn, B.M., and Oda, K. (2002). A CLN2-related and thermostable serine-carboxyl proteinase, kumamolysin: cloning, expression, and identification of catalytic serine residue. Journal of biochemistry 131, 757-765. Paradis, E., Claude, J., and Strimmer, K. (2004). APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics 20, 289-290. Pettersen, E.F., Goddard, T.D., Huang, C.C., Couch, G.S., Greenblatt, D.M., Meng, E.C., and Ferrin, T.E. (2004). UCSF Chimera--a visualization system for exploratory research and analysis. Journal of computational chemistry 25, 1605-1612. Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., et al. (2012). The Pfam protein families database. Nucleic acids research 40, D290-301. Quraishi, O., Nagler, D.K., Fox, T., Sivaraman, J., Cygler, M., Mort, J.S., and Storer, A.C. (1999). The occluding loop in cathepsin B defines the pH dependence of inhibition by its propeptide. Biochemistry 38, 5017-5023. R Core Team (2012). R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing). Reiser, J., Adair, B., and Reinheckel, T. (2010). Specialized roles for cysteine cathepsins in health and disease. The Journal of clinical investigation 120, 3421-3431. Seidah, N.G., Mowla, S.J., Hamelin, J., Mamarbachi, A.M., Benjannet, S., Toure, B.B., Basak, A., Munzer, J.S., Marcinkiewicz, J., Zhong, M., et al. (1999). Mammalian subtilisin/kexin isozyme SKI-1: A widely expressed proprotein convertase with a unique cleavage specificity and cellular localization. Proceedings of the National Academy of Sciences of the United States of America 96, 1321-1326. Seidah, N.G., and Prat, A. (2012). The biology and therapeutic targeting of the proprotein convertases. Nature reviews Drug discovery 11, 367-383. Shinde, U., and Inouye, M. (1993). Intramolecular chaperones and protein folding. Trends Biochem Sci 18, 442-446. Shinde, U., and Thomas, G. (2011). Insights from bacterial subtilases into the mechanisms of intramolecular chaperone-mediated activation of furin. Methods Mol Biol 768, 59-106. Siezen, R.J., and Leunissen, J.A. (1997). Subtilases: the superfamily of subtilisin-like serine proteases. Protein science : a publication of the Protein Society 6, 501-523. Slepkov, E.R., Rainey, J.K., Sykes, B.D., and Fliegel, L. (2007). Structural and functional analysis of the Na+/H+ exchanger. The Biochemical journal 401, 623-633. Srivastava, J., Barber, D.L., and Jacobson, M.P. (2007). Intracellular pH sensors: design principles and functional significance. Physiology (Bethesda) 22, 30-39. Subbian, E., Yabuta, Y., and Shinde, U. (2004). Positive selection dictates the choice between kinetic and thermodynamic protein folding and stability in subtilases. Biochemistry 43, 14348- 14360. Tangrea, M.A., Bryan, P.N., Sari, N., and Orban, J. (2002). Solution structure of the pro- convertase 1 pro-domain from Mus musculus. J Mol Biol 320, 801-812. The UniProt Consortium (2012). Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucleic acids research 40, D71-75. Tsiatsiani, L., Van Breusegem, F., Gallois, P., Zavialov, A., Lam, E., and Bozhkov, P.V. (2011). Metacaspases. Cell death and differentiation 18, 1279-1288. Turk, B., Dolenc, I., Turk, V., and Bieth, J.G. (1993). Kinetics of the pH-induced inactivation of human cathepsin L. Biochemistry 32, 375-380. Turk, V., Stoka, V., Vasiljeva, O., Renko, M., Sun, T., Turk, B., and Turk, D. (2012). Cysteine cathepsins: from structure, function and regulation to new frontiers. Biochimica et biophysica acta 1824, 68-88. Webb, B.A., Chimenti, M., Jacobson, M.P., and Barber, D.L. (2011). Dysregulated pH: a perfect storm for cancer progression. Nature reviews Cancer 11, 671-677. Wlodawer, A., Li, M., Gustchina, A., Oyama, H., Dunn, B.M., and Oda, K. (2003). Structural and enzymatic properties of the sedolisin family of serine-carboxyl peptidases. Acta biochimica Polonica 50, 81-102. Yu, I.M., Zhang, W., Holdaway, H.A., Li, L., Kostyuchenko, V.A., Chipman, P.R., Kuhn, R.J., Rossmann, M.G., and Chen, J. (2008). Structure of the immature dengue virus at low pH primes proteolytic maturation. Science 319, 1834-1837.

Figure 1

C-terminus propeptide A

Catalytic Domain Propeptide

C-terminus propeptide B

Catalytic Domain Propeptide Figure 2 Eukaryota A B Fungi [His] 1.72 % AnimalsAnimals ∆ = Pyrolysin/CucumolisinPyrolysin/Cucumolisin ∆ [His] = 1.61 % Pyrolysin/Cucumolisin ∆∆ [His][His][His]==2.362.36 %

PlantsPlants Kexin/PCs Kexin/PCs ∆∆ [His][His][His]==2.082.08 %

FirmicutesFirmicutes Archaea ∆∆ [His][His][His]==% ∆ [His] = − 0.37 %

SedolisinSedolisin ActinobacteriaActinobacteria AcidobacteriaAcidobacteria ∆∆ [His][His][His]==% ∆∆ [His][His][His]==2.132.13%

SubtilisinSubtilisin Proteobacteria Proteinase K ∆∆ [His][His][His]==% Bacteria ∆ [His] = − 0.34 % C D 0 2 4 6 8 70 [His]Cat Prokaryotes Prokaryotes G 30 [His]Pro Prokaryotes Eukaryotes Protease nisP 60 [His]Cat Eukaryotes 25 [His]Pro Eukaryotes 50 Subtilisin J 20 Alkaline protease 40 Q45670 P41363 15 Density Density 30 Kumamolisin P20724 10 Subtilisin 20 Subt. Carlsberg Subtilisin BPN' 10 5 Subtilisin NAT P00783 0 P54423 0 Bacillopeptidase F 0 1 2 3 4 5 6 7 8 10 0 2 4 6 Protease epr [His] [%] ∆ [His] [%] Subtilisin E E 1 Alkaline protease Pseudomonalisin 0.75 Xanthomonalisin P16558 0.5 P42780

U\mn 0.25 Aqualysin 0 P80146 AlanineValineLeucineIsoleucineMethioninePhenylalanineTyrosineTryptophanSerineThreonineAsparagineGlutamineCysteineGlycineProlineHistidineAspartateGlutamateLysineArginine PC5 PC4 PC7 F PC6 PC1 PC2 Furin PC1: End of propeptide Subtilisin E: Catalytic histidine Kexin Subtilisin E: End of propeptidePC1: Catalytic histidine Cerevisin 7 Proteinase R 6 Proteinase K XSP1 5 ARA12 Cucumisin 4 Halolysin 3 Pyrolysin 2 0 2 4 6 8 Histidine content [%] 1 0 His content [%] [His]Pro [His]Cat 0 100 200 300 400 500 Prokaryotes Residue Eukaryotes Archae Figure 3

A Cathepsin O ∆∆ [His][His][His]==0.280.28% PlantPlant CathepsinsCathepsins ∆∆ [His][His][His]==1.811.81%

Cathepsin B ∆∆ [His][His][His]==%

Cathepsin L ∆∆ [His][His][His]==% Cathepsin F ∆∆ [His][His][His]==1.141.14% B C

70 30 [His]Cat Prokaryotes Prokaryotes [His]Pro Prokaryotes Cathepsin L 60 [His]Cat Cathepsin L 25 Cathepsin B

[His]Pro Cathepsin L 50 [His] Cathepsin B Cat 20 [His]Pro Cathepsin B 40 15 Density Density 30 10 20

5 10

0 0

0 1 2 3 4 5 6 7 8 10 0 2 4 6 8 ∆ [His] [%] D [His] [%]

CathepsinCathepsin B: End L:of Endpropeptide of propeptideCathepsin B: OccludingCathepsin Cathepsinloop L: Catalytic B: Catalytic histidine histidine 7 6 5 4 3 2

% Histidine content 1 0 0 50 100 150 200 250 300 Residue E F Figure 4

Metazoan Caspases [His][His]=% A ∆∆ [His][His][His]==%

Metacaspases ∆∆ [His][His][His]==0.830.83% B C 70 [His] Prokaryotes Prokaryotes Cat 30 [His]Pro Prokaryotes Metazoans 60 [His]Cat Metazoans 25 [His]Pro Metazoans 50 20 40

15 Density 30 Density

10 20

10 5

0 0

0 1 2 3 4 5 6 7 8 10 0 2 4 6 D [His] [%] ∆ [His] [%]

Caspase 2: End of propeptideCaspase 2: Catalytic histidine 7 6 5 4 3 2

% Histidine content 1 0 0 50 100 150 200 250 300 350 Residue Figure S1

60 Alanine Valine Leucine 40 Density Density Density 20 0

60 Isoleucine Methionine Phenylalanine

40 N = 4256 Bandwidth = 0.005246 N = 4256 Bandwidth = 0.002666 N = 4256 Bandwidth = 0.002614 Density Density Density 20 0

60 Tyrosine Tryptophan Serin

40 N = 4256 Bandwidth = 0.001882 Density Density Density 20 0

60 Threonine Asparagine Glutamine

40 N = 4256 Bandwidth = 0.00222 N = 4256 Bandwidth = 0.001004 Density Density Density Density 20 0

60 Cysteine Glycine Proline

40 N = 4256 Bandwidth = 0.00343 Density Density Density 20 0

60 Histidine Aspartate Glutamate

40 N = 4256 Bandwidth = 0.001254 Density Density Density 20 0

60 Lysine Arginine 0 2.5 5 7.5 10 12.5 15 17.5 20 Bacteria Prot 40 N = 4256 Bandwidth = 0.001215 N = 4256 Bandwidth = 0.002323 N = 4256 Bandwidth = 0.002885 Bacteria IMC Density Density

20 Eukaryota Prot Eukaryota IMC 0

0 2.5 5 7.5 10 12.5 15 17.5 20 0 2.5 5 7.5 10 12.5 15 17.5 20 Content [%] N = 4256 Bandwidth = 0.002723 Figure S2

80 Alanine Valine Leucine 60 40 Density Density Density 20 0

80 Isoleucine Methionine Phenylalanine 60 N = 4256 Bandwidth = 0.00818 N = 4256 Bandwidth = 0.004751 N = 4256 Bandwidth = 0.00482 40 Density Density Density 20 0

80 Tyrosine Tryptophan Serin 60 N = 4256 Bandwidth = 0.002277 N = 4256 Bandwidth = 0.003236 40 Density Density Density 20 0

80 Threonine Asparagine Glutamine 60 N = 4256 Bandwidth = 0.00302 N = 4256 Bandwidth = 0.001516 N = 4256 Bandwidth = 0.005756 40 Density Density Density Density 20 0

80 Cysteine Glycine Proline 60 N = 4256 Bandwidth = 0.00504 N = 4256 Bandwidth = 0.004001 N = 4256 Bandwidth = 0.004062 40 Density Density Density 20 0

80 Histidine Aspartate Glutamate 60 N = 4256 Bandwidth = 0.001235 N = 4256 Bandwidth = 0.004345 40 Density Density Density 20 0

80 Lysine Arginine 6 4 2 0 2 4 6

60 Bacteria Prot N = 4256 Bandwidth = 0.002244 N = 4256 Bandwidth = 0.004321 N = 4256 Bandwidth = 0.005338

40 Bacteria IMC Density Density Eukaryota Prot 20 Eukaryota IMC 0

6 4 2 0 2 4 6 6 4 2 0 2 4 6 Content Difference [%] N = 4256 Bandwidth = 0.005832 N = 4256 Bandwidth = 0.005206 Figure S3

7 6 5 4 3 10 Residue Windows 2 1 Histidine content [%] 0 0 100 200 300 400 500 Residue PC1: End of propeptide Subtilisin E: End ofSubtilisin propeptidePC1: E: Catalytic Catalytic histidine histidine 7 6 5 20 Residue Windows 4 3 2

Histidine content [%] 1 0 7 0 100 200 300 400 500 6 Residue 5 4 3 30 Residue Windows 2 1 Histidine content [%] 0 0 100 200 300 400 500 Residue 7 6 5 4 3 40 Residue Windows 2 1 Histidine content [%] 0 100 200 300 400 500 Residue 7 6 5 4 3 50 Residue Windows 2 1 Histidine content [%] 0 100 200 300 400 500 Residue SUPPLEMENTAL TABLES:

TABLE S2:

Protein(Name( Prokaryotes( Eukaryotes ( Propeptid Catalytic(( Δ(His)( Propeptid Catalytic(( Δ(His)( e( e( Cathepsin(D/E(Family( 1.34( 2.02( C0.68( 2.92( 1.15( +1.75( Carboxypeptidase(Y( 1.48( 1.95( C0.47( 2.16( 2.01( +0.15( alphaClytic((protease( 0.92( 0.99( C0.07( 1.61%( 0.82%( +0.79( Legumain( 2.19( 1.5( +0.69( 8.37( 4.4( +3.97( Lysosomal(Acid(Lipase(( 4.80( 3.14( +1.66( 5.00( 3.36( +1.64( Lysosomal(αC ( ( ( ( ( ( glucosidase( (factor(VII( ( ( ( ( ( ( CadherinCI( 0.81( 1.16( C0.35( 8.78( 1.9( +6.88( βCHexosaminidase( 1.53( 3.26( C1.73( 3.33( 1.98(α( +1.35( 2.27( 2.62( C0.35( 3.11( 2.13(β( +0.98( BMP4( ( ( ( 6.22( 5.11( +1.11( Platelet(derived(growth( ( ( ( ( ( ( factor(

SUPPLEMENTAL FIGURE LEGENDS:

Figure S1: Distributions of [AA]Pro and [AA]Cat for all 20 amino aicds in eukaryotic and prokaryotic subtilases.

Figure S2: Distributin of Δ[AA] for all 20 amino acids in eukaryotic and prokaryotic subtilases.

Figure S3: Sliding window analysis of histidine content in eukaryotic and prokaryotic subtilases using different window sizes.

SUPPLEMENTAL TABLE LEGENDS:

Table S1: List of human proteins with histidine enrichement in their propeptides. All human proteins with annotated propeptides in the UniProt database that have more histidine in their propeptides than expected, assuming a 2.3% probability of histidine at each sequence position, with a significance level of 5%.

Table S2: Histidine content of propeptide and catalytic domain of several protein families. Homologs in eukaryotes and prokaryotes were identified using BLAST. Propeptide regions were assigned by homology using multiple sequence alignments.

UniProt(identifierName Length(propeptideHistidine(contentP(X>=k) P12821 Angiotensin-converting1enzyme 74 6.76% 2.80E-02 O14672 ADAM10 194 7.73% 5.08E-05 O75078 ADAM11 202 4.46% 4.54E-02 Q9Y3Q7 ADAM18 168 4.76% 4.15E-02 Q9P0K1 ADAM22 197 6.09% 2.22E-03 O75077 ADAM23 227 4.85% 1.69E-02 Q9UKF2 ADAM30 171 4.68% 4.52E-02 Q9BZ11 ADAM33 174 5.17% 2.01E-02 O15204 ADAM-like1protein1decysin-1 175 6.29% 2.61E-03 Q9H324 ADAM-TS10 208 5.29% 9.33E-03 P58397 ADAM-TS12 215 6.05% 1.59E-03 Q8TE57 ADAM-TS16 255 5.88% 9.60E-04 Q8TE60 ADAM-TS18 237 5.91% 1.34E-03 P59510 ADAM-TS20 232 4.74% 1.96E-02 Q9UKP5 ADAM-TS6 223 8.52% 1.31E-06 Q9P2N4 ADAM-TS9 269 4.09% 4.88E-02 O95972 Bone1morphogenetic1protein115 249 5.22% 5.55E-03 P12643 Bone1morphogenetic1protein12 259 5.79% 1.12E-03 P12644 Bone1morphogenetic1protein14 273 6.23% 2.38E-04 P18075 Bone1morphogenetic1protein17 263 5.70% 1.30E-03 P55287 Cadherin-11 31 12.90% 5.36E-03 Q13634 Cadherin-18 29 17.24% 4.82E-04 P12830 Cadherin-1 132 6.06% 1.17E-02 P14091 Cathepsin1E 34 8.82% 4.29E-02 P09668 Cathepsin1H 85 8.24% 3.52E-03 P43235 Cathepsin1K 99 6.06% 2.71E-02 P25774 Cathepsin1S 98 8.16% 1.97E-03 Q6YHK3 CD1091antigen 25 12.00% 1.92E-02 P0CG37 Cryptic1protein 65 7.69% 1.70E-02 Q14126 Desmoglein-2 26 11.54% 2.13E-02 P12259 Coagulation1factor1V 836 3.59% 1.27E-02 P02765 Alpha-2-HS-glycoprotein 40 12.50% 2.17E-03 P09958 Furin 83 6.02% 4.28E-02 O60383 Growth/differentiation1factor19 295 4.41% 2.04E-02 P07686 Beta-hexosaminidase1subunit1beta 79 6.33% 3.58E-02 P55103 Inhibin1beta1C1chain 218 4.59% 3.07E-02 P58166 Inhibin1beta1E1chain 217 5.07% 1.25E-02 P51460 -like13 47 14.89% 9.55E-05 P19827 ITI1heavy1chain1H1 246 4.47% 2.84E-02 Q99538 Legumain 110 9.09% 2.40E-04 P09848 Lactase-phlorizin1hydrolase 847 3.78% 5.15E-03 P10253 Lysosomal1alpha-glucosidase 42 9.52% 1.56E-02 P14151 L-selectin 10 20.00% 2.11E-02 Q9NRE1 Matrix1metalloproteinase-26 72 8.33% 6.34E-03 P16519 PCSK2 84 8.33% 3.29E-03 P29122 PCSK6 86 5.81% 4.86E-02 P01127 PDGF1subunit1B 112 5.36% 4.53E-02 Q96B86 Repulsive1guidance1molecule1A 147 5.44% 2.10E-02 P10600 Transforming1growth1factor1beta-3 280 5.00% 5.96E-03 Q9BZD6 Proline-rich1Gla1protein14 32 9.38% 3.68E-02 Q8N2E6 Prosalusin 163 5.52% 1.37E-02 O43915 Vascular1endothelial1growth1factor1D 216 6.02% 1.66E-03