www.nature.com/scientificreports

OPEN Transmembrane Domain Lengths Serve as Signatures of Organismal Complexity and Viral Transport received: 08 October 2015 accepted: 12 February 2016 Mechanisms Published: 01 March 2016 Snigdha Singh & Aditya Mittal

It is known that membrane proteins are important in various secretory pathways, with a possible role of their transmembrane domains (TMDs) as sorting determinant factors. One key aspect of TMDs associated with various “checkposts” (i.e. organelles) of intracellular trafficking is their length. To explore possible linkages in organisms with varying “complexity” and differences in TMD lengths of membrane proteins associated with different organelles (such as Endoplasmic Reticulum, Golgi, Endosomes, Nucleus, Plasma Membrane), we analyzed ~70000 membrane protein sequences in over 300 genomes of fungi, plants, non-mammalian vertebrates and mammals. We report that as we move from simpler to complex organisms, variation in organellar TMD lengths decreases, especially compared to their respective plasma membranes, with increasing organismal complexity. This suggests an evolutionary pressure in modulating length of TMDs of membrane proteins with increasing complexity of communication between sub-cellular compartments. We also report functional applications of our findings by discovering remarkable distinctions in TMD lengths of membrane proteins associated with different intracellular transport pathways. Finally, we show that TMD lengths extracted from viral proteins can serve as somewhat weak indicators of viral replication sites in plant cells but very strong indicators of different entry pathways employed by animal .

Proper subcellular localization of membrane proteins into different organelles/membrane-environments is one of the keys to survival and propagation of living cells. Thus, identification of subcellular localization of novel mem- brane proteins in eukaryotic cells is an important step towards understanding their role in the life of a cell. Several years ago, transmembrane domains (TMDs) of integral membrane proteins were shown to contain factors that control their sorting in secretory pathways1–3. While there is general consensus on cytosolic sorting signals being responsible for determining intracellular trafficking pathways and subsequent subcellular location of membrane proteins, the role of their own TMDs in sorting of these membrane proteins has started emerging prominently4. In fact, TMD-dependent sorting has been proposed to be more significant than cytosolic sorting signals for membrane proteins5. In this regard, it is interesting to note that around the same time as demonstration of role of TMDs in the secretory pathway, it was reported that membrane proteins with TMDs shorter by (on an average) five amino acids than TMDs of plasma membrane proteins were retained in the cisternae of the Golgi complex6. Additional studies in yeast showed that proteins with longer TMDs were targeted to the plasma membrane while those with shorter TMDs were targeted to vacuole7–9. More recently, some studies reported that proteins with very short TMDs are specifically targeted for endocytosis by clathrin-coated vesicles10. Interestingly, an elegant proteomic analysis carried out by Munro and colleagues led to strong insights on the difference in TMD lengths of membrane proteins associated with different organelles along secretory pathways in fungal and vertebrate cells11. The conceptual simplicity in lengths of TMDs of intracellular membrane proteins serving as signatures for their respective intracellular locations/organelles in fungi and vertebrates is very appealing. If this finding can be generalized/extrapolated to all living systems, it would be very promising towards obtaining mechanistic insights into intracellular trafficking using proteomics—especially due to relatively straightforward algorithms for detection and prediction of TMDs in known protein sequences, a very large majority of which are yet to be structurally resolved. Thus, in this work, we carried out a comprehensive TMD length analysis on ~70000

Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India. Correspondence and requests for materials should be addressed to A.M. (email: [email protected])

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 1 www.nature.com/scientificreports/

membrane protein sequences corresponding to 301 genomes of fungi, plants, non-mammalian vertebrates and mammals. The first remarkable result we report in this work, while confirming the original findings of Sharpe et al.11 for fungi and vertebrates, is a decrease in variation of TMD lengths of membrane proteins with increasing organismal complexity (i.e. difference in TMD lengths of organellar membrane proteins and plasma membrane proteins decreases as we move from “simpler” to “complex” organisms). This result provides a profound insight towards increased sharing/exchange of intracellular and plasma membrane components with increasing organ- ismal complexity over time. The fact that we were able to confirm and substantially generalize that TMD lengths indeed serve as signatures for subcellular locations in different eukaryotic systems, we decided to further explore the scope of our findings from an extremely important applicational perspective. Viruses in both animals and plants, regardless of pres- ence12–16 or absence of a membrane envelope, rely heavily on intracellular trafficking and sorting17 mechanisms of their host cells, including viral replication associated with host intracellular membranes18. Mechanistic insights into viral entry and exit mechanisms provide strong avenues of controlling their pathogenic activity. Therefore, we analyzed protein sequences from 34 different viruses (19 infecting animal cells and 15 infecting plant cells) and extracted their TMD lengths by using the methodology developed in this work. The key hypothesis to test was that TMD lengths of viral proteins may provide signatures of subcellular locations (i.e. membrane/organellar) of internalization- or secretory- or replication- pathways associated with their life cycles in respective host cells. We specifically analyzed experimentally determined sequences that are known to play a key role in their entry (in case of animal viruses), and replication (in case of plant viruses), into their respective hosts – thus serving as strong experimentally determined controls for our analyses. With this approach, we report that TMD lengths of viral proteins do indeed serve as signatures for important host cell “checkposts” (i.e. membrane/ organellar host cell locations) involved in life cycles of both animal and plant viruses. To our knowledge, this is a first-of-a-kind study, especially involving plant viruses, along with several animal viruses, showing that solely TMD lengths (independent of actual primary sequences) of viral proteins serve as signatures of subcellular loca- tions important in viral life cycles. This work opens up a very promising avenue for designing experiments aimed at interfering with viral transport mechanisms for both animal and plant viruses using a relatively straightfor- ward, yet rigorous and somewhat computationally economical, analytical approach. Results Evaluation of TMDs of single span membrane proteins localized to their associated compart- ments i.e. Golgi, Endoplasmic reticulum, Nucleus, TGN/endo and Plasma Membrane from dif- ferent organisms (Fungi, Plants, Non-Mammalian Vertebrates and Mammals). To compare the TMDs from different organelles we extracted bitopic proteins (i.e. with only one TM helix) from all membrane proteins of eukaryotic genomes. Bitopic proteins include many important families of receptors many of which are fruitful targets for biopharmaceuticals and also mutations in bitopic protein are frequent cause of various human diseases such as cancer or developmental disorder. For our analysis we collected reference proteins from the best characterized eukaryotic genomes, Saccharomyces cerevisiae (Supplementary Table S1), Arabidopsis thaliana (Supplementary Table S2), Gallus gallus (Supplementary Table S3) and Homo sapiens (Supplementary Table S4). To expand our dataset we used BLAST to include orthologs in our dataset. Here, it needs to be emphasized that if orthologous proteins were very much similar to reference proteins and among themselves also, then our analyses would be biased and probably less meaningful. Therefore, to avoid such bias in analyzing membrane protein sequences, we used BLASTClust to cluster these proteins on the basis of their TMDs and flanking sequences, finally selecting only those proteins which were not very similar (less than 30% identity) to each other from each organelle set (Fig. 1). For first screening of identifying membrane proteins, hydrophobicity (also called hydrop- athy) profiles with an average window size of 18 were scanned – the actual TMD identification was done after a computationally intensive analysis refining the hydropathy profile data in an unbiased manner by utilizing a scanning and alignment algorithm developed (see Methodology for details). The accuracy of the developed com- putational method was rigorously tested by repeating exact results obtained by Sharpe et al.11.

Mean Lengths of TMDs. It is important to note that the hydrophobicity graph shown in Fig. 1 represents the averages of hydrophobicity profiles of all the helices—with this calculation the hydrophilic-hydrophobic crossing point (e.g. at 0 kcal/mol) reflects the dominant value rather than the outliers. Specifically, for example, the lengths of TMDs obtained by counting from 0 to the tip of the arrows shown in Fig. 1 represent the most common values of the graphs (the modes), rather than the means. However, for meaningful interpretation of comparative TMD lengths it is important to analyze and compare the mean, rather than only the mode, since it is imperative to analyse membrane proteins in each organelle class as a whole, rather than the most dominant sub-population of membrane proteins. Thus, the mean hydrophobic lengths of TMDs, for each of the organelles in each of the organisms analyzed, are shown in Table 1. Recently Sharpe et al.11 showed that, on an average, the hydrophobic length for plasma membrane TMDs is larger than those of ER and Golgi in both fungi and vertebrates. While the data of Sharpe et al.11 is a subset of the data analyzed by us, it is not only interesting that our larger data set on fungi confirms their findings, it is also remarkable that those findings still hold true for organellar TMDs vs. plasma membrane TMDs when data from plants is analyzed (Fig. 2A,B). Even more remarkable is the fact when we analyze much larger data sets for verte- brates, and parse them into non-mammalian and mammalian, Golgi TMDs are found to be shorter than plasma membrane TMDs (Fig. 2C,D). Thus, while confirming earlier findings, our analysis substantially generalizes the conclusions reached by Sharpe et al.11 in terms of variety of organisms. An important aspect with respect to the possible common cellular origins of all organisms, regardless of the level of their complexity at the whole organ- ism level, was also confirmed by us – Fig. 2E shows the distribution plots of lengths of plasma membrane TMDs of different organisms. Figure 2F expresses the same results (i.e. those of Fig. 2E) in terms of mean ± std as a bar

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 2 www.nature.com/scientificreports/

Figure 1. A Sequence-Spatio-Energy-based methodology for trans-membrane domain (TMD) analysis. Trans-membrane proteins of known topology and location were identified from literature and databases for different organisms (fungi, plants, non-mammalian vertebrates and mammals). Reference species for each organism is shown in parentheses below the organism name. Orthologous proteins were identified using BLAST searches. Subsequent to prediction of protein orientation, a “rough” screening for TMDs and assignment of up to eight flanking residues on both sides of TMDs was done in all sequences. Then, BLASTClust was used to remove sequence redundancy by collecting protein sequences which were not very similar (< 30% identity) to respective reference protein, and with each other, in order to cover a wide range of protein sequences. Finally, refinement of TMDs was done in which all the protein sequences from a dataset (for example from an organelle of an organism) were aligned at the positions where a sharp change in hydrophobicity (hydopathy) occurred. The cytosolic end of hydrophobic region was assumed as position one and the hydrophobic spans were aligned from the cytosolic side to the exoplasmic side. Arrows connecting each step are also marked with number of sequences involved in that step, and a description of whether the step involved sequence-based, spatially-based, and energy-based or a combination of any of these analyses.

graph. It can be clearly seen from Fig. 2E,F that plasma membrane TMDs for all variety of organisms analyzed by us are similar. This is a clear indication of a common cellular evolution with respect to the cellular boundaries regardless of organismal complexity.

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 3 www.nature.com/scientificreports/

Organism ER Golgi TGN/endo/Nucleus PM 22.60 ± 6.4 20.25 ± 5.8 23.28 ± 5.7 24.80 ± 8.3 Fungi n = 2446 n = 4686 n = 865 n = 785 22.80 ± 6.7 23.80 ± 7.7 19.40 ± 7.8 24.47 ± 5.3 Plants n = 1813 n = 3197 n = 801 n = 2519 Non-Mammalian 23.13 ± 7.1 19.86 ± 5.9 24.26 ± 6.4 24.00 ± 6.8 Vertebrates n = 327 n = 823 n = 429 n = 1072 23.65 ± 6.1 21.44 ± 6.9 25.90 ± 7.4 24.95 ± 6.4 Mammals n = 1082 n = 1657 n = 644 n = 3113

Table 1. Mean hydrophobic lengths of TMDs of different organelles.

Differences in TMD lengths among different organelles in fungi, plants, non-mammalian verte- brates and mammals. As mentioned earlier, our findings show that Golgi TMDs have lesser TMD length than plasma membrane TMDs (Fig. 2), confirming earlier conclusions11, but with a much larger dataset using the GES hydrophobicity scale. To further ensure analytical rigor of the results, and, that these findings are not “hydrophobicity-scale-dependent”, we confirmed that the same conclusions are reached from hydrophobicity graphs plotted on the basis of Kyte-Doolittle scale (Supplementary Fig. S1). Below we discuss the differences in TMD lengths, among different organelles for different organisms, with distribution profiles shown in Fig. 3A–D.

Fungi. We observed that the length of TMDs of plasma membrane is highest compared to different organelles and not just the Golgi (Supplementary Fig. S2A). Analysis of amino acid composition of TMDs from cytosolic to exoplasmic side for different organelles in fungi indicate that the regions abundant in hydrophobic residues are larger in TGN/endo and plasma membrane proteins compared to ER and Golgi proteins—showing a difference in TMD length (Supplementary Figs S3A and S4). Additionally, we found that hydrophobic residues occupy smaller regions along the TMDs in Golgi than in plasma membrane (Supplementary Fig. S5). While the distribution plots of TMD lengths of all organelles in fungi, shown in Fig. 3A, clearly indicate distinct (and shorter) TMD lengths compared to plasma membranes (black distribution in Fig. 3A), rigorous statistical analyses shown in Table 2, i.e. small “p” values in t-tests and Kullback-Leibler Divergence Measure (KLDM) values, confirm that the differences in the TMD lengths of all the organelles are highly significant, especially w.r.t. plasma membrane, and from each other.

Plants. We observed from hydrophobicity graphs of plants that length of TMDs of Golgi and ER were almost same but length of TMDs of Mitochondria, Nucleus, Peroxisomes, Chloroplasts and Plasma membrane were different (Fig. 2B and Supplementary Fig. S6). Distributions of TMD lengths in plants clearly indicated longer TMD lengths for plasma membranes, as shown in Fig. 3B (corresponding mean ± std is shown as a bar graph in Supplementary Fig. S2A). Analysis of amino acid composition of TMDs from different organelles in plants indi- cates that the regions abundant in hydrophobic residues are almost similar in ER and Golgi membrane and the regions abundant in hydrophobic residues are minimum in nuclear membrane proteins and maximum in plasma membrane proteins (Supplementary Figs S3B and S4). Additionally we observed in our abundance graphs that area of nuclear membrane TMDs enriched in hydrophobic residues is smaller as compared to TMDs of plasma membrane (Supplementary Fig. S5). As done earlier, rigorous statistical analyses shown in Table 2, i.e. small “p” values in t-tests and Kullback-Leibler Divergence Measure (KLDM) values, confirm that the differences in the TMD lengths of all the organelles are highly significant, especially w.r.t. plasma membrane, and from each other (except ER and Golgi).

Non-mammalian vertebrates. In non-mammalian vertebrates, we found that there is a significant difference in TMD lengths of ER and Golgi, but the difference between TGN/endo and plasma membranes was not sig- nificant (Fig. 2C, Table 2). Distribution plots of TMD lengths of all the organelles show relatively less distinct profiles (Fig. 3C) and the differences in the TMD lengths of all the organelles are much lower as compared to those observed in fungi and plants (Table 2). Analysis of amino acid composition of TMDs from different orga- nelles in non-mammalian vertebrates indicate that the regions abundant in hydrophobic residues are somewhat similar in TGN/endo and in plasma membrane proteins, and in ER and Golgi proteins the regions abundant in hydrophobic residues are smaller than TGN/endo and plasma membrane (Supplementary Figs S3C and S4). In abundance graphs of non-mammalian vertebrates, we observed that hydrophobic residues occupy almost same area along the TMDs in plasma membrane than in TGN/endo membrane proteins (Supplementary Fig. S5). Thus, in non-mammalian vertebrates, TMD lengths are similar for TGN/endo and plasma membranes, and these have higher TMD lengths compared to other organelles. Interestingly, the overall differences between the plasma membrane TMD lengths and other organelles are much lower compared to fungi and plants.

Mammals. Mean hydrophobic lengths of TMDs of endoplasmic reticulum, endosomes and plasma membrane were found to be similar in mammals, indicating that the thickness of bilayers of these organelles is similar, in contrast to other organelles of fungi, plants and non-mammalian vertebrates. However there is still a significant difference (even though less significant than other organisms) among these organelles (Tables 1 and 2). This can also be seen from the distribution plots of TMD lengths of all the organelles (Fig. 3D). Analysis of amino acid compositions of TMDs from different organelles at their different position i.e. from cytosolic side to exo- plasmic side in mammals indicates that the regions abundant in hydrophobic residues are almost similar in ER,

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 4 www.nature.com/scientificreports/

Figure 2. Differences in hydrophobicity profiles of TMDs of all the organelles in fungi, plants, non-mammalian vertebrates and mammals (GES Scale). (A) Analysis of hydrophobicity profiles of TMDs of proteins from different organelles (Golgi, Endoplasmic Reticulum—ER, TGN/endo and Plasma Membrane—PM) along the secretory pathway in fungi. This figure is directly inspired by the work of Sharpe et al.11 – it was essential to reproduce their earlier results with our extended data set. Therefore, axes and color coding (for organelles) are retained – the figure serves as strong positive (computational) control for the methodology developed in this work. (B) Analysis of hydrophobicity profiles of TMDs of proteins from different organelles (Golgi, ER, Nucleus and PM) in plants along the secretory pathway. Clearly, plasma membrane proteins in plants also have TMDs longer than proteins of all other organelles. (C) Analysis of hydrophobicity profiles of TMDs from different organelles (Golgi, ER, TGN/endo and PM) along the secretory pathway in non-mammalian vertebrates. (D) Analysis of hydrophobicity profiles of TMDs of proteins from different organelles (Golgi, ER, TGN/endo and PM) in mammals along the secretory pathway. (E) Distribution plots of TMD lengths of plasma membrane proteins from different organisms (fungi, plants, non- mammalian vertebrates and mammals). (F) Bar chart showing mean hydrophobic lengths of TMDs of plasma membrane proteins in different organisms.

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 5 www.nature.com/scientificreports/

Figure 3. Distribution plots of TMD lengths of proteins in (A) fungi, (B) plants, (C) non-mammalian vertebrates, and, (D) mammals. Organelles were the same as in Fig. 2. X-axes represent TMD lengths and Y-axes represent frequencies, with n = number of transmembrane protein sequences associated with each organelle in each of the organisms, (E) Bar chart showing variation in the differences of TMD lengths of proteins among different organelles with respect to PM in different organisms. The root mean square of difference for an organism was calculated by taking the square root of the sum of squares of the difference between TMD length of each protein associated with each organelle and corresponding PM of each species in that organism.

endosomes and plasma membrane but smaller for Golgi proteins (Supplementary Figs S3D, S4 and S5). Thus, in mammals, as observed in non-mammalian vertebrates, while TMD lengths are similar for TGN/endo and plasma membranes, and these have higher TMD lengths compared to other organelles, the overall differences between

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 6 www.nature.com/scientificreports/

Fungi ER Golgi TGN/endo PM p = 6.3 × 10−51 p = 0.004 p = 4 × 10−12 ER — KLDM = 0.5155 KLDM = 0.0787 KLDM = 0.157 p = 6.3 × 10−51 p = 1 × 10−42 p = 2.9 × 10−22 Golgi — KLDM = 0.4208 KLDM = 0.4599 KLDM = 0.4953 p = 0.004 p = 1 × 10−42 p = 8.2 × 10−6 TGN/endo — KLDM = 0.0779 KLDM = 0.4604 KLDM = 0.166 p = 4 × 10−12 p = 2.9 × 10−22 p = 8.2 × 10−6 PM — KLDM = 0.1636 KLDM = 0.8733 KLDM = 0.1927 Plants ER Golgi Nucleus PM p = 8.1 × 10−7 p = 1.6 × 10−25 p = 6.9 × 10−18 ER — KLDM = 0.0684 KLDM = 0.3945 KLDM = 0.2841 p = 8.1 × 10−7 p = 1.9 × 10−43 p = 4 × 10−4 Golgi — KLDM = 0.0818 KLDM = 0.4499 KLDM = 0.4539 p = 1.6 × 10−25 p = 1.9 × 10−43 p = 8.5 × 10−58 Nucleus — KLDM = 0.4884 KLDM = 0.502 KLDM = 0.942 p = 6.9 × 10−18 p = 4 × 10−4 p = 8.5 × 10−58 PM — KLDM = 0.2578 KLDM = 0.3519 KLDM = 0.7979 Non-Mammalian Vertebrates ER Golgi TGN/endo PM p = 7.4 × 10−13 p = 0.024 p = 0.051 ER — KLDM = 0.4353 KLDM = 0.4005 KLDM = 0.2239 p = 7.4 × 10−13 p = 6.9 × 10−30 p = 5.7 × 10−43 Golgi — KLDM = 0.4018 KLDM = 0.9439 KLDM = 0.6351 p = 0.024 p = 6.9 × 10−30 p = 0.485 TGN/endo — KLDM = 0.4154 KLDM = 0.8613 KLDM = 0.1081 p = 0.051 p = 5.7 × 10−43 p = 0.485 PM — KLDM = 0.2528 KLDM = 0.5843 KLDM = 0.1181 Mammals ER Golgi TGN/endo PM p = 4.4 × 10−18 p = 1.3 × 10−10 p = 2.9 × 10−9 ER — KLDM = 0.4732 KLDM = 0.4936 KLDM = 0.205 p = 4.4 × 10−18 p = 7.1 × 10−37 p = 8.8 × 10−63 Golgi — KLDM = 0.439 KLDM = 0.434 KLDM = 0.417 p = 1.3 × 10−10 p = 7.1 × 10−37 p = 0.003 TGN/endo — KLDM = 0.5438 KLDM = 0.4457 KLDM = 0.174 p = 2.9 × 10−9 p = 8.8 × 10−63 p = 0.003 PM — KLDM = 0.2016 KLDM = 0.3761 KLDM = 0.1678

Table 2. Statistical analyses of differences in TMD lengths between different organelles using t-tests and Kullback-Leibler Divergence Measure (KLDM) – Fungi, Plants, Non-Mammalian Vertebrates and Mammals.

the plasma membrane TMD lengths and other organelles are interestingly much lower compared to fungi, plants and non-mammalian vertebrates.

Summarizing. The overall differences in TMD lengths of different organelles in fungi, plants, non-mammalian vertebrates and mammals with their respective plasma membranes reveals a very remarkable result, based on GES hydrophobicity scale, shown in Fig. 3E (here it is important to note supplementary Fig. S1F, based on Kyte and Doolittle hydrophobicity scale, confirms that these findings are hydrophobicity scale independent). As we move towards organisms perceived to be more complex (i.e. in terms of organismal complexity fungi < plants < non-mammalian vertebrates < mammals), the overall statistical differences of organelles from respective plasma membranes decreases. Thus, our results suggest that there is gradual decrease in the difference of TMD length, and hence bilayer thickness, among cellular organelles and plasma membrane when we move from “sim- pler” (or “lower”) organisms to more “complex” (or “higher”) organisms. The variation in hydrophobic length of TMDs of membrane proteins for different organelles, especially when compared with that of their respective plasma membranes, decreases with increase in cellular dynamics and increased organismal complexity. These results are strongly supported by earlier findings that increased intracellular dynamics are one of the key features of higher (more complex) organisms5. Higher cell dynamics imply more exchange and/or continuity of material exchange, thereby reducing differences in bilayer thickness of subcellular components – a finding remarkably well extracted by lengths of TMDs of membrane proteins analysed in this work. We address some very exciting aspects of these startling findings in the discussion section.

Functional application of TMD lengths serving as signatures of organismal complexity towards cellular transport mechanisms. Encouraged by our findings on TMD lengths showing organelle spe- cific dependence at a cellular level in different organisms, we decided to explore applicational perspectives of our results. All viruses, regardless of whether they are enveloped (i.e. contain their own membrane bilayer) or non-enveloped, employ intracellular transport mechanisms of their host cells for initiating and propagating infec- tions. These transport mechanisms involve intricate associations and interplay of members of viral proteomes with membranes of different organelles, their by assisting virions in entry into-, and exit from-, host cells. For

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 7 www.nature.com/scientificreports/

Figure 4. Different entry pathways followed by animal viruses along with assembly and exit mechanism. A schematic showing different modes of entry of enveloped animal viruses into host cells. Some viruses enter directly through plasma membrane (Pathway III) but most of the viruses penetrate through endocytic machinery. Pathways I, II and IV represent viral entry into host cells through macropinocytosis, clathrin-mediated endocytosis and caveolin-mediated endocytosis respectively.

example, Fig. 4 shows how many animal viruses are known to employ several modes of entry into their host cells – mainly through clathrin-mediated endocytosis, but some also via macropinocytosis, caveolin-mediated endocytosis, plasma membranes and through some other pathways, along with their exit mechanisms. Therefore, we formulated a straightforward question – Given a viral proteome, is it possible to scan for putative hydropho- bic segments representing specific TMD lengths to extract information on specific organellar association of the virus during its journey into or out of a host cell ? To answer the above question, first we had to confirm whether host cell proteins themselves, which are involved in intracellular sorting pathways (e.g. clathrin-mediated – and caveolin-mediated – endocytosis, macropinocytosis), have TMD lengths serving as signatures of intracellular sorting. While a regular feature in cell biology textbooks is the important role of cytosolic endocytic signals in endocytosis of membrane proteins, it is emerging that in the absence of cytosolic sorting signals TMDs may act as sorting determining factors during endocytosis. In fact, a few earlier reports encouraged us to hypothesize that TMD lengths can serve as signatures of intracellular sorting19–22. To test this hypothesis, we plotted hydro- phobicity graphs of clathrin coat assembly proteins10,23, along with proteins involved in macropinocytosis24–26

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 8 www.nature.com/scientificreports/

Known* Reference(s) Predicted TMD S. No. Enveloped Virus (Protein) Entry Mode for* Length(s) Mean ± Std 1 HIV () 33,45 30 2 ( glycoprotein gH) 33,45 28 PM 28.25 ± 2.36 3 () 33,45 30 4 Sendai (Fusion Protein) 33,45 25 5 (E2 glycoprotein) 29,30 17 6 Sindbis (E1 glycoprotein) 28 5 7 Vesicular Stomatitis (Glycoprotein G) 31,32 7 CME 10.80 ± 5.12 8 Semliki Forest (E1 glycoprotein) 33 15 Equine Infectious Anemia (Outer 9 34 10 membrane protein) 10 (A21 protein) 33,46 8 11 Filamentous (Hemagglutinin) 38 27 12 Cytomegalovirus (Env Glycoprotein H) 37 20 Respiratory Syncytial virus A (F1 MPE 20.67 ± 7.06 13 36 27 protein) 14 African Swine Fever (177L) 35 19 15 Ebola (Envelope Glycoprotein) 39–41 23 16 SV 40 () 43 10 17 Polyoma (Agnoprotein) 43 10 CAME 13.75 ± 7.50 18 New Castle Disease (F1 protein) 42 25 19 Echovirus 1 ( protein) 44 10

Table 3. Viral entry pathways of animal viruses and TMD lengths (predicted from hydrophobicity plots) of proteins known to play a key role in viral entry. PM – Plasma membrane; CME – Clathrin-mediated endocytosis; MPE – Macropinocytosis; CAME – Caveolin-mediated endocytosis.

and caveolin-mediated27 endocytosis (Supplementary Table S5 and Supplementary Fig. S7). We found that proteins associated with caveolin-mediated endocytosis (~10 ± 8, n = 9), clathrin coated vesicles (~17 ± 6, n = 176) and macropinocytosis (~22 ± 7, n = 19) have shorter TMDs than typical plasma membrane proteins (comparing above results in parentheses with those in Fig. 2F). Further, two-tailed heteroscedastic t-tests for above TMD-length distributions yielded p = 0.021 for clathrin coated vesicles vs. caveolin-mediated endocytosis, p = 0.001 for caveolin-mediated endocytosis vs. macropinocytosis, and p = 0.011 for clathrin coated vs. macropi- nocytosis. These statistically relevant differences (p ≤ 0.05) clearly show that TMD lengths serve as signatures for intracellular sorting mechanisms. Therefore now we were in a position to directly investigate whether viral proteomes consist of TMD length signatures for specific associations with host cell organelles. To be able to do so, first we collected a list of animal viruses for which the primary intracellular pathways (all of which are shown in Fig. 4) involved in their entry into their respective host cells are known28–46. Table 3 shows the collected list of animal viruses, along with the names of specific proteins in their proteomes that are experimentally established to play a key role in their entry.

Functional application of TMD lengths serving as signatures of organismal complexity towards animal viral transport mechanisms in host cells. The next step was to utilize the methodology devel- oped in this work to calculate TMD lengths of each of the specific proteins for each of the animal viruses listed in Table 3. In order to do this, we first obtained hydrophobicity plots for TMD lengths for each of these specific proteins as shown in Fig. 5 (a couple of plots are also shown in supplementary Figs S8A and S8B – they were not included in Fig. 5 to maintain visual symmetry in the main figure). For convenience of interpretation, Fig. 5 also shows the experimentally known entry pathway corresponding to each virus within each hydrophobicity plot. TMD lengths predicted from these hydrophobicity plots are listed in Table 3 in the fifth column, with the last (sixth) column showing mean ± standard deviation of viral proteins’ TMD lengths. Close observation of these results yields a surprisingly solid result supporting the idea that TMD lengths of animal viral proteins do indeed serve as signatures of viral entry mechanisms into their respective host cells. With the sole exception of clathrin-mediated – and caveolin-mediated – endocytosis, there is a clear statistical difference in TMD lengths of viral proteins utilizing different cellular entry pathways (see also supplementary Fig. S8C).

Functional application of TMD lengths serving as signatures of organismal complexity towards viral replication in plants. Encouraged by the results obtained above, we decided to explore whether TMD lengths can serve as signatures of viral transport mechanisms for plant viruses also. To our surprise, we found that the literature on plant viral transport mechanisms is quite sparse – however, we found several reports on plant viral replication sites inside plant cells47–55. Therefore, based on our literature survey, we compiled a broad classi- fication of replication sites important for viral replication in plant cells, as shown in Fig. 6. Next, as done earlier, we collected a list of plant viruses (including Flockhouse virus, which is not a plant virus, but is known to be able to replicate in plant cells55) for which the primary replication sites (shown in Fig. 6) in

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 9 www.nature.com/scientificreports/

Figure 5. Hydrophobocity graphs of TMDs of animal viral proteins associated with viral entry through various pathways into host cells. Name of the virus, and member of the viral proteome experimentally identified (from literature) as the key to entry into host cells is given in each plot. Pathway I: Mode of entry is established to be through macropinocytosis for African Swine Fever Virus, Filamentous Influenza Virus, Respiratory Syncytial Virus and Cytomegalovirus. Pathway II: Mode of entry is established to be through clathrin-mediated endocytosis for Hepatitis C Virus, Sindbis Virus, Semliki Forest Virus and Vesicular Stomatitis Virus. Pathway III: Mode of entry is established to be through plasma membrane for Measles Virus, Herpes Simplex Virus, Human Immunodeficiency Virus and Sendai Virus. Pathway IV: Mode of entry is established to be through caveolin-mediated endocytosis for New Castle Disease Virus, SV40, Polyoma virus and Echovirus.

their respective host cells are known47–55. Table 4 shows the collected list of plant viruses, along with the names of specific proteins in their proteomes that are experimentally established to play a key role in viral replication. The next step was to utilize the methodology developed in this work to calculate TMD lengths of each of the spe- cific proteins for each of the plant viruses listed in Table 4. In order to do this, we first obtained hydrophobicity plots for predicting TMD lengths of each of these specific proteins as shown in Fig. 7 (a couple of plots are also shown in supplementary Figs S8D and S8E – they were not included in Fig. 7 to maintain visual symmetry in the main figure). For convenience of interpretation, Fig. 7 also shows the experimentally known replication site

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 10 www.nature.com/scientificreports/

Figure 6. Different virus replication sites for plant viruses. In plants several organelles serve as replication sites of virues. The schematic broadly categorizes these replication sites into ER, mitochondria, peroxisomes, chloroplast and nucleus.

corresponding to each virus within each hydrophobicity plot. TMD lengths predicted from these hydrophobicity plots are listed in Table 4 in the fifth column, with the last (sixth) column showing mean ± standard deviation of viral proteins’ TMD lengths. Clearly, TMD lengths of viral replication proteins are not able to distinguish between replication sites of plant viruses with the exception of those involved in replication at the chloroplast (see also sup- plementary Fig. 8F). However, in the case of chloroplasts also, data is not populated enough (“n” is only 2) to be able to make a strong conclusion. Nevertheless, this analyses of TMD lengths towards gaining insights into plant viral infection mechanisms is, to our knowledge first of its kind, and could be quite promising in future especially when applied to larger data sets on viral replication and transport mechanisms in plants. Discussion The objective of this study was to determine the importance of length of transmembrane domains (TMDs) of single span (bitopic) membrane proteins from an evolutionary perspective. Multi-span membrane proteins were not included since it is not straight forward to identify specific (groups of) residues interacting with membrane lipids in addition to the fact that polytopic membrane proteins usually have active sites mostly buried within the transmembrane helical bundles. Investigations in the last few decades indicate that most of the important protein-protein interactions or protein-lipid interactions take place due to the involvement of transmembrane helices of bitopic membrane proteins instead of polytopic membrane proteins12,13. The purpose and importance of our work was to proceed beyond pure sequence analysis of single span membrane proteins and to consider implications of TMD lengths for the organellar systems. Here it is important to note that our work evolved very serendipitously while trying to reproduce and extend the results of an elegant study published by Sharpe et al.11 – in fact it was essential to reproduce their earlier results with our extended data sets (for fungi and vertebrates) to serve as strong positive (computational) controls. Additionally, a close comparison of our results on TMD lengths in only plants showed remarkable agreement with those reported in a recent study56. In spite of analyzing independent datasets, our results on plant TMDs (Figs 2B and 3B) confirm the findings of Nikolovskiet al.56 on TMD lengths of proteins associated with ER, Golgi/TGN and plasma membranes in plant cells (their Fig. 4A,B).

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 11 www.nature.com/scientificreports/

Predicted TMD S. No. Plant Virus (Protein) Known* Replication Site Reference(s) for* Length(s) Mean ± Std 1 Potato mop-top (TGBp2) 51 20 2 Tomato-ringspot (X2 ) 51 20 3 Tobacco-etch (6K2 protein) 48 20 Endoplasmic Reticulum 19.83 ± 0.41 4 Potato Virus X (TGBp3) 49 20 5 Maize Dwarf Mosaic (6K2) 48 19 6 Turnip Mosaic (6K2) 51 20 7 Flock house* (Protein A) 55 20 Carnation Italian Ringspot (Replication- 8 Mitochondria 50 21 20.67 ± 0.58 associated protein 1) 9 Pelargonium Flower Break (p27 protein) 54 21 10 Tomato Bushy Stunt (p33) 47 21 Cymbidium Ringspot (Movement 11 Peroxisome 51 23 21.67 ± 1.15 protein) 12 Cucumber Necrosis (p33) 52 21 Turnip Yellow Mosaic (Methyl 13 51 11 transferase) Chloroplast 13.00 ± 2.83 14 Cowpea chlorotic mottle (coat protein) 53 15 15 Sonchus Yellow Net (M2 protein) Nucleus 51 20 20

Table 4. Viral replication sites of plant viruses and TMD lengths (predicted from hydrophobicity plots) of proteins known to play a key role in replication. *Flock house virus is actually not a plant virus, but can replicate in plant cells.

Our study further extends observations on statistically significant differences in organellar TMD lengths with the inclusion of data on proteins associated with nuclear membranes in plant cells. While reproducibility of earlier findings with independent and expanded datasets is indeed promising by itself, to our knowledge our work here is the first of its kind comprehensive and simultaneous analyses of the available eukaryotic proteomes with different eukaryotic organisms (fungi, plants, non-mammalian vertebrates and mammals – the ordering was chosen by us in view of increased organismal complexity). We find that TMD lengths serve as signatures of the specific orga- nelles in which the corresponding transmembrane proteins reside in different eukaryotic cells. Plasma membrane TMD lengths of all the organisms are longer than those from the TMDs of intracellular organelle membranes, and interestingly, we found that TMD lengths of plasma membrane proteins were similar for all organisms – indi- cating similar thickness of bilyaers in plasma membranes and supporting a common origin of these eukaryotic cell boundaries57. At the same time, there is no relationship in sequence and function of the plasma membrane proteins of all these organisms. The significance of our results is well supported by results indicating that localiza- tion of membrane proteins neither depend on their sequence homology nor structural features—the only feature reported to influence localization is length of their TMDs5. Interestingly, these results on membrane proteins strongly support recently emerging views on “secularity” of amino acid residues in soluble proteins – composi- tion of primary sequences, in terms of percentage occurrence of amino acids relative to each other (reflected here by TMD lengths identified by hydropathy plots based on relative occurrence of “stretches” of hydrophobic and hydrophilic), rather than the actual sequence in which the residues appear, play a key role in obtaining functional folded proteins58–62. We have also explored the scope of our results from applicational perspectives by applying our methodology for specifically investigating intracellular and viral transport mechanisms. To our pleasant sat- isfaction, we find that TMD lengths serve as signatures not only for intracellular transport mechanism native to eukaryotic cells, but also provide clear indications of possible viral transport mechanisms, especially for animal viruses. Future applications of methodologies developed in this work may provide great assistance in designing well-directed experiments for investigating intracellular transport mechanisms utilized by viruses, whose mech- anisms are not yet known, but whose proteomes can be determined by using modern experimental proteomic tools. Further, our results on TMD lengths of bitopic proteins, approximated as helices and representing bilayer thickness, is also a strong addition to the growing literature on geometrical interpretations of molecular interac- tions especially pertaining to membranes and proteins in biology63–65. Conclusions: A new perspective on organismal complexity. Finally, we wish to emphasize that our results provide a unique and novel evolutionary perspective. Two contrasting, but highly appealing, evolutionary inferences can be made from our results. From Fig. 3E it can be inferred that the next major step in evolution of complexity in eukaryotic cells is a further decrease in differences between TMD lengths, and hence bilayer thick- ness, of plasma membranes and intracellular organellar systems. From a philosophical standpoint, it appears anal- ogous to homogenization of differences between various compartments of the cell and cell boundary – reflective of evolution of the society in general. Alternatively, Fig. 3E may indicate that an evolutionary saturation in differ- ences between bilayer thickness of plasma membranes and intracellular organellar systems has been reached or will be reached soon. Beyond this, a new evolutionary cycle may begin with simpler eukaryotic cells originating again and leading to further complex systems. Regardless of which of the above two are correct, they are equally thought provoking and open up a fresh avenue towards views on cellular complexity and evolution.

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 12 www.nature.com/scientificreports/

Figure 7. Hydrophobocity graphs of TMDs of plant viral proteins associated with viral replication in plant cells. Name of the virus, and member of the viral proteome experimentally identified as the key to replication in the host cells is given in each plot. Viral Replication Site I: ER serves as the replication site for Potato-mop-top Virus, Tomato Ringspot Virus and Tobacco Etch Virus in plant cells. Viral Replication Site II: Mitochondria serve as the replication sites for Flock House Virus, Carnation Italian Ringspot Virus and Pelargonium Flower Break Virus in plant cells. Note that while Flock House is not a plant virus, it has been shown to replicate at the mitochondria of plant cells. Viral Replication Site III: Peroxisomes serve as replication sites for Tomato Bushy- Stunt Virus, Cucumber Necrosis Virus and Cymbidium Ringspot Virus in plant cells. Viral Replication Site IV: Chloroplasts serve as replication sites for Turnip Yellow Mosaic virus, Cowpea Chlorotic Mottle Virus. It is important to note that the nucleus serves as the replication site (i.e. site V) for Sonchus Yellow Net Virus – this hydrophobicity plot is not shown to maintain visual symmetry in the figure.

Methodology Data Collection. In our study, we have done comparative analysis of transmembrane domains of membrane proteins (n = 68,281) of different organelles from discrete subcellular locations. We collected datasets of bitopic transmembrane proteins from the best studied eukaryotic genomes—therefore reference proteins for fungi were collected from Saccharomyces cerevisiae for the computational analysis of TMD proteins of their different organelles (Endoplasmic Reticulum, Golgi, TGN/Endo and Plasma membrane). Reference proteins for plants were collected from Arabidopsis thaliana for the computational analysis of TMD proteins of ER, Golgi, Nucleus, Mitochondria, Peroxisomes, Chloroplast and Plasma Membrane. Reference proteins for non-mammalian ver- tebrates and mammals were collected from Gallus gallus and Homo sapiens respectively for TMD analysis of all their organelles. Reference proteins are those proteins whose TMD organelle localisation and TMD span (start and end) definitions are well known – this original dataset of all reference proteins comprised of 394 sequences. Accession numbers of reference proteins for fungi and non-mammalian vertebrates were collected from litera- ture11 whose organelle residences and topology were known and best studied. Accession numbers of reference proteins for plants were collected from Plant Proteome Database, ARAMEMNON Database and AT_CHLORO Database. For mammals accession numbers were collected from literature searches and LOCATE DATABASE.

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 13 www.nature.com/scientificreports/

Accession numbers for membrane proteins involved in clathrin-mediated endocytosis, macropinocytosis and caveolin-mediated endocytosis were collected from National Centre for Biotechnology Information. Accession numbers of membrane proteins of animal and plant viruses (enveloped and non-enveloped) were collected from Viral Zone Database and National Centre for Biotechnology Information. All accession numbers and related information is provided in supplementary information (with tables).

Collection of orthologous membrane proteins and transmembrane protein orientation. Size of our initial dataset was somewhat limited due to selection of only those membrane proteins whose organelle location and topology were known in literature (n = 394). Thus, in order to increase the dataset (i.e. number of sequences) for our TMD analysis we did BLAST to collect orthologous proteins for each organism. We col- lected orthologous proteins from 162 fungal-, 32 plant-, 17 non-mammalian vertebrate- and 90 mammalian- genomes by using BLAST to augment the sequence information (n = 68,281). The cut-off stringency for BLAST was smaller than E = 10−10. After collecting all the sequences (n = 68,281), TMHMM server used to predict the transmembrane protein orientation in the reference protein and the orthologous proteins. The hydrophobic spans were aligned from the cytosolic side to the exoplasmic side.

Screening of TMD span in membrane protein sequences. We used Goldman-Engelman-Steitz (GES) scale because it is based on thermodynamic measurement, rather than a statistical one. Therefore for preparing hydrophobicity graphs we used GES hydrophobicity scale. To avoid biasness and to ensure that our results are not hydrophobicity-scale-dependent, we also used the Kyte-Doolittle scale for our analyses. Then we analysed the “rough” position of the TMDs in all the membrane protein sequences by giving window size of 18 residues ini- tially for all the organelles (Golgi, Endoplasmic Reticulum, TGN/Endosomes, Nucleus, Chloroplast, Peroxisomes, Mitochondria and Plasma Membrane). We defined the initial 18 residue window to be the one that is the most hydrophobic in the transmembrane region, irrespective of how many hydrophilic residues were present within it (and their relative location). This step also ensured an additional check of the presence of a TM span in the orthologous sequences. The TMD spans identified in this screening step were then catalogued along with their flanking residues (i.e. those residues that are next to both cytosolic and exoplasmic edges of the “rough” TMDs). A maximum of 8 residues on both sides were considered as flanking residues. Thus, sequences emerging out of this step of analysis had a maximum possible length of 34 residues (8 + 18 + 8) and a minimum possible length of 26 residues (8 + 18 or 18 + 8).

Reduction in sequence redundancy based on sequence similarity. In order to ensure that our anal- yses were not biased because of presence of closely related sequences in our orthologous collection, we used BLASTClust to screen for sequence redundancy in our dataset (i.e. sequences obtained from the last step of the previous screening). Using BLASTClust we checked the similarity between reference and their corresponding orthologous protein sequences, and also among different orthologous proteins of the same reference protein. The aim was to collect only those protein sequences from each group which were not more than 30% similar to each other. This clustering process was performed on the TMD region along with their flanking sequences. Final numbers of non-redundant sequences are shown in Table 1.

Refinement of TMD span – Defining “Start” and “End” edges of TMDs. The next, but the most cru- cial step, was refinement of TMD span edges (start and end points). In the previous “rough” screening for TMDs, we had allowed hydrophilic residues if they are followed by sufficiently hydrophobic residues, and if they were not followed by hydrophobic residues then they were chopped off the edge of TMD span. It is challenging to deal with the edge cases of TMDs, especially in case of individual hydrophilic residues appearing in the middle of the core region of TMDs (thereby having negligible effects on overall hydrophobicity scores). However, hydrophilic residues at the edge of TMD spans help to define the edge precisely and are not included in TMDs. If we follow the above rule then the end points of the TMD span, i.e. edges, would contract (i.e. the length of the TMD span would reduce) if the hydrophilic residues on the edges are not surrounded by sufficiently hydrophobic residues (e.g. three hydrophobic residues after and three hydrophobic residues before any hydrophilic residue). Further, if hydrophilic residues are surrounded by sufficiently hydrophobic residues, the edges of 18 residue region would expand resulting in the difference in TMD lengths among different sequences. Essentially, the most important step is to find out the hydrophilic to hydrophobic transition (TM/aqueous vs buried) in the sequences. On the basis of above, all the protein sequences from each set (e.g. for each organism and each organelle) were aligned at the position where a sharp change in hydopathy occurred – the cytosolic end of hydrophobic region was assumed as position one (01) while doing this. The next challenge was to consistently define the end of trans- membrane spans in sequences. Here it is important to note that while TMHMM is excellent at identifying TM spans, however, the exact end points can vary – even if the variation is only by a couple of residues only, it is not precise. To overcome this limitation, we recognized that once all the ends are refined consistently (i.e. applying exactly the same series of steps) based on hydrophobicity and the presence of charges, all of the features became much sharper (at both ends of the span). Therefore instead of using TMHMM server for defining TM span we wrote a refinement algorithm (executed in MATLAB, Mathworks Inc.). Figure S9 shows the algorithm devel- oped and used by us as a flow chart (with description). Implementation of this algorithm enabled us to align protein sequences in a given dataset (e.g. an organelle set) at the positions where a sharp change in hydropathy occurred. Thus, after implementation of this algorithm to our sequences, we were able to plot hydrophobicity graphs of each of the TMD sequences from different datasets, belonging to fungi, plants, non-mammalian verte- brates and mammals and their different subcellular locations/organelles (Plasma Membranes, ER, Golgi, TGN/ Endosomes, Mitochondria, Chloroplast, Nucleus, Peroxisomes). Additionally, the above approach was applied to

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 14 www.nature.com/scientificreports/

obtain hydrophobicity graphs of protein sequences (a) involved in different endocytic pathways, and, (b) from animal and plant viruses.

Statistical analyses for comparing TMD lengths and distributions. We performed t-tests to test the significance of differences (or lack thereof) in TMD lengths among different organelles and report the p-values obtained. To confirm our findings rigorously, we also performed Kullback-Leibler divergence tests with distri- butions of the TMD lengths. KL Divergence measure (KLDM) was calculated for all combinations of organelle sets in each organism. Since the measure is asymmetric, it gives different values when a distribution X is com- pared to Y with X as the base distribution vs with Y as the base distribution. The symmetric version of KLDM would simply be the average of KLDM for X vs Y and Y vs X measures. Briefly, KL divergence method function takes three arguments

1. X: the set of values. 2. P1: First probability distribution. 3. P2: Second probability distribution.

Since KLDM calculation involves logarithm of probabilities (relative frequencies), any entries that have no occurrences in the two distributions being compared cause the calculation to fail. Since our intent was pair-wise comparisons of all types within each organelle, we had to trim the data so that all probability entries considered for KLD measure were non zero. References 1. Bonifacino, J. S., Cosson, P., Shah, N. & Klausner, R. D. Role of potentially charged transmembrane residues in targeting proteins for retention and degradation within the endoplasmic reticulum. EMBO J. 10, 2783–2793 (1991). 2. Munro, S. Sequences within and adjacent to the transmembrane segment of alpha-2,6-sialyltransferase specify Golgi retention. EMBO J. 10, 3577–3588(1991). 3. Machamer, C. E. et al. Retention of a cis Golgi protein requires polar residues on one face of a predicted alpha-helix in the transmembrane domain. Mol. Biol. Cell 4, 695–704(1993). 4. Cortesio, C. L., Lewellyn, E. B. & Drubin, D. G. Control of lipid organisation and actin assembly during clathrin mediated endocytosis by the cytoplasmic tail of the rhomboid protein Rbd2. Mol. Biol. Cell 26, 1509–1522(2015). 5. Karsten,V., Hedge, R. S., Sinai, A. P., Yang, M. & Joiner, K. A. Transmembrane Domain Modulates Sorting of Membrane Proteins in Toxoplasma gondii. J. Biol. Chem. 279, 26052–26057 (2004). 6. Bretscher, M. S. & Munro, S. Cholesterol and the Golgi apparatus. Science 261, 1280–1281(1993). 7. Rayner, J. C. & Pelham, H. R. Transmembrane domain-dependent sorting of proteins to the ER and plasma membrane in yeast. EMBO J. 16, 1832–1841 (1997). 8. Roberts, C. J., Nothwehr, S. F. & Stevens, T. H. Membrane protein sorting in the yeast secretory pathway: evidence that the vacuole may be the default compartment. J. Cell Biol. 119, 69–83 (1992). 9. Watson, R. T. & Pessin, J. E. Transmembrane domain length determines intracellular membrane compartment localization of syntaxin 3,4 and 5. Am. J. Physiol. Cell Physiol. 281, C215–C223 (2001). 10. Mercanti,V. et al. Transmembrane domains control exclusion of membrane proteins from clathrin-coated pits. J.Cell Sci. 123, 3329–3335 (2010). 11. Sharpe, H. J., Stevens, T. J. & Munro, S. A Comprehensive Comparison of transmembrane Domains Reveals Organelle-Specific Properties. Cell 142, 158–169 (2010). 12. Cosson, P., Perrin, J. & Bonifacino, J. S. Anchors aweigh: protein localization and transport mediated by transmembrane domains. Trends in Cell Biology 23, 511–517 (2013). 13. Kemble, G. W., Danieli, T. & White, J. M. Lipid-anchored influenza hemagglutinin promotes hemifusion to both red blood cell and planar bilayer membranes. Cell 76, 383–391 (1994). 14. Melikyan, G. B., White, J. M. & Cohen, F. S. GPI-anchored influenza hemagglutinin induces hemifusion to both red blood cell and planar bilayer membranes. J. Cell Biol. 131, 679–691 (1995). 15. Ragheb, J. A. & Anderson, F. Uncoupled expression of Moloney envelope polypeptides SU and TM: a functional analysis of the role of TM domains in viral entry. J. Virol. 68, 3207–3219 (1994). 16. Odell, D., Wanas, E., Yan, J. & Ghosh, H. P. Influence of membrane anchoring and cytoplasmic domains on the fusogenic activity of vesicular stomatitis virus glycoprotein. G. J. Virol. 71, 7996–8000 (1997). 17. Cortesio, C. L., Lewellyn, E. B. & Drubin, D. G. Control of lipid organization and actin assembly during clathrin-mediated endocytosis by the cytoplasmic tail of the rhomboid protein Rbd2. Mol. Biol. Cell 26, 1509–1522 (2015). 18. Sanfacon, H. Investigating the role of viral integral membrane proteins in promoting the assembly of nepovirus and comovirus replication factories. Frontiers in Plant Science 3, 1–7 (2013). 19. Mouritsen, O. G. & Bloom, M. Models of lipid-protein interactions in membranes. Annu. Rev. Biophys. Biomol. Struct. 22, 145–171 (1993). 20. Andersen, O. S. & Koeppe, R. E. Bilayer thickness and membrane protein function: an energetic perspective. Annu. Rev. Biophys. Biomol. Struct. 36, 107–130 (2007). 21. Takamori, S. et al. Molecular anatomy of a trafficking organelle. Cell 127, 831–846 (2006). 22. Cleverly, D. Z. & Lenard, J. The transmembrane domain in viral fusion: Essential role for a conserved glycine residue in vesicular stomatitis virus G protein. Proc Natl Acad Sci USA 95, 3425–3430 (1998). 23. McMahon, H. T. & Boucrot, E. Molecular mechanism and physiological functions of clathrin-mediated endocytosis. Nature Rev. Mol. Cell Biol. 12, 517–533 (2011). 24. Kerr, M. C. & Teasdale, R. D. Defining macropinocytosis. Traffic 10, 364–371 (2009). 25. Saeed, M. F., Kolokoltsov, A. A., Albrecht, T. & Davey, R. A. Cellular entry of ebola virus involves uptake by a macropinocytosis-like mechanism and subsequent trafficking through early and late endosomes. PLoS Pathog. 6, e1001110 (2010). 26. Hunt, C. L., Kolokoltsov, A. A., Davey, R. A. & Maury, W. The Tyro3 receptor kinase Axl enhances macropinocytosis of . J. Virol. 85, 334–347 (2011). 27. Lisanti, M. P. et al. Characterization of caveolin-rich membrane domains isolated from an endothelial-rich source: implications for human disease. J. Cell Biol. 126, 111–126 (1994). 28. Tulleo, L. D. & Kirchhausen, T. The clathrin endocytic pathway in viral infection. EMBO J. 17(16), 4585–4593 (1998). 29. Benedicto, I. et al. Clathrin Mediates Infectious Hepatitis C Virus Particle Egress. J. Virol. 89, doi: 10.1128/JVI.03620-14 (2015). 30. Blanchard, E. et al. Hepatitis C Virus Entry Depends on Clathrin-Mediated Endocytosis. J. Virol. 80, 6964–6972 (2006).

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 15 www.nature.com/scientificreports/

31. Sun, X., Yau, V. K., Briggs, B. J. & Whittaker, G. R. Role of clathrin-mediated endocytosis during vesicular stomatitis virus entry into host cells. J. Virol. 338, 53–60 (2005). 32. Cureton, D. K., Massol, R. H., Saffarian, S. & Kirchhausen, T. L. Vesicular stomatitis virus enters cell through vesicles incompletely coated with clathrin that depend upon actin for internalization. PLoS Pathogens 5, e1000394 (2009). 33. Pelkmans, L. & Helenius, A. Insider information: what viruses tell us about endocytosis. Curr. Opi. Cell Biol. 15, 414–422 (2003). 34. Brindley, M. A. & Maury, W. Equine infectious anemia virus entry occurs through clathrin-mediated endocytosis. J.Virol. 82, 1628–1637 (2008). 35. Sanchez, E. G. et al. African Swine Fever virus uses macropinocytosis to enter host cells. PLoS Pathogens 8, e1002754 (2012). 36. Krzyzaniak, M. A., Zumstein. M. T., Gerez, J. A., Picotti, P. & Helenius, A. Host Cell Entry of Respiratory Syncytial Virus Involves Macropinocytosis Followed by Proteolytic Activation of the F Protein. PLoS Pathogens 9, e1003309 (2013). 37. Haspot, F. et al. Human Cytomegalovirus Entry into Dendritic Cells Occurs via a Macropinocytosis-Like Pathway in a pH- Independent and Cholesterol-Dependent Manner. PLoS One 7, e34795 (2012). 38. Rossman, J. S., Leser, G. P. & Lamb, R. A. Filamentous influenza virus enters cells via macropinocytosis.J. Virol. 86, 10950–10960 (2012). 39. Saeed, M. F., Kolokoltsov, A. A., Albrecht, T. & Davey, R. A. Cellular Entry of Ebola Virus Involves Uptake by a Macropinocytosis- Like Mechanism and Subsequent Trafficking through Early and Late Endosomes. PLoS Pathogens 6, e1001110 (2010). 40. Mulherkar, N., Raaben, M., de la Torre, J. C., Whelan, S. P. & Chandran, K. The Ebola virus glycoprotein mediates entry via a non- classical dynamin-dependent macropinocytic pathway. J. Virol. 419, 72–83 (2011). 41. Nanbo, A. et al. Ebolavirus Is Internalized into Host Cells via Macropinocytosis in a Viral Glycoprotein-Dependent Manner. PLoS Pathogens 6, e1001121 (2010). 42. Cantin, C., Holguera, J., Ferreira, L., Villar, E. & Munoz-Barroso, I. Newcastle disease virus may enter cells by caveolin-mediated endocytosis. J. Gen. Virol. 88, 559–569 (2006). 43. Pelkmans, L. Secrets of caveolin- and lipid raft-mediated endocytosis revealed by mammalian viruses. Biochimica et Biophysica Acta. 1746, 295–304 (2005). 44. Marjomaki et al. Internalization of Echovirus 1 in Caveolin. J. Virol. 76(4), 1856–1865 (2002). 45. Plemper, R. K., Brindley, M. A. & Iorio, R. M. Structural and Mechanistic Studies of Measles Virus Illuminate Paramyxovirus Entry. PLoS Pathogens 7, e1002058 (2011). 46. Laliberte, J. P., Weisberg, A. S. & Moss, B. The Membrane Fusion Step of Vaccinia Virus Entry Is Cooperatively Mediated by Multiple Viral Proteins and Host Cell Components. PLoS Pathogens 7, e1002446 (2011). 47. McCartney, A. W., Greenwood, J. S., Fabian, M. R., White, K. A. & Mullen, R. T. Localization of tomatobushy stunt virus replication protein p33 reveals a peroxisomes to endoplasmic reticulum sorting pathway. Plant Cell 17, 3513–3531 (2005). 48. Wei, T. et al. Sequential recruitment of the endoplasmic reticulum and chloroplast for plant Potyvirus replication. J. Virol. 84, 799–809 (2010). 49. Schepetilnikov, M. V., Manske, U., Solovyev, A. G., Zamyatnin, A. A. & Morozov, S. Y. The hydrophobic segment of potato virus X TGBp3 is a major determinant of the protein intracellular trafficking. J. Gen. Virol. 86, 2379–2391 (2005). 50. Hwang, Y. T., McCartney, A. W., Gidda, S. K. & Mullen, R. T. Localization of Carnation Italian ringspot virus replication protein p36 to the mitochondrial outer membrane is mediated by an internal targeting signal and the TOM complex. BMC Cell. Biol. 9, 1–26 (2008). 51. Laliberte, J. F. & Sanfacon, H. Cellular remodelling during plant virus infection. The Ann. Rev. Phytopathology 48, 69–91 (2010). 52. Rochon, D. et al. The p33 auxiliary replicase protein of Cucumber necrosis virus targets peroxisomes and infection inducesde novo peroxisome formation from the endoplasmic reticulum. J. Virol. 452–453, 133–42 (2014). 53. De Assis Filho, F. M., Paguio, O. R., Sherwood, J. L., Deom & Symptom, C. M. Induction by Cowpae chlorotic mottle virus on Vigna unguiculata is determined by amino acid residue 151 in the coat protein. J. Gen. Virol. 83, 879–883 (2002). 54. Martinez-Turino, S. & Harnandez, C. Analysis of the subcellular targeting of the smaller replicase protein of Pelargonium flower break virus. Virus Res. 163, 580–591 (2011). 55. Miller, D. J., Schwartz, M. D. & Ahlquist, P. Flock House Virus RNA Replicates on Outer Mitochondrial Membranes in Drosophila Cells. J. Virol. 75, 11664–11676 (2001) 56. Nikolovski, N. et al. Putative glycosyltransferases and other plant Golgi apparatus proteins are revealed by LOPIT proteomics. Plant Physiology 160, 1037–1051 (2012). 57. Bansal, S. & Mittal, A. A statistical anomaly indicates symbiotic origins of eukaryotic membranes. Mol. Biol. Cell 26, 1238–1248 (2015). 58. Mittal, A., Jayaram, B., Shenoy, S. R. & Bawa, T. S. A stoichiometry driven universal spatial organization of backbones of folded proteins: Are there Chargaff’s rules for protein folding? J. Biomol. Struct. Dyn. 28, 133–142 (2010). 59. Mittal, A. & Jayaram, B. Backbones of Folded Proteins Reveal Novel Invariant Amino Acid Neighborhoods. J. Biomol. Struct. Dyn. 28, 443–454 (2011). 60. Mittal, A. & Jayaram, B. The Newest View on Protein Folding: Stoichiometric and Spatial Unity in Structural and Functional Diversity. J. Biomol. Struct. Dyn. 28, 669–674 (2011). 61. Mittal, A. & Jayaram, B. A possible molecular metric for biological evolvability. J. Biosci. 37, 573–577 (2012). 62. Mittal, A. & Acharya, C. Protein folding: is it simply surface to volume minimization? J. Biomol. Struct. Dyn. 31, 953–955 (2013). 63. Mittal, A. & Acharya, C. Extracting signatures of spatial organization for biomolecular nanostructures. J. Nanosci. Nanotechnol. 12, 8249–8257 (2012) 64. Bansal, S. & Mittal, A. Extracting curvature preferences of lipids assembled in flat bilayers shows possible kinetic windows for genesis of bilayer asymmetry and domain formation in biological membranes. J. Membrane Biol. 246, 557–570 (2013). 65. Schweitzer, Y., Shemesh, T. & Kozlov, M. M. A Model for Shaping Membrane Sheets by Protein Scaffolds.Biophys. J. 109, 564–73 (2015). Acknowledgements SS is grateful to Council of Scientific and Industrial Research (CSIR), Govt. of India for fellowship support. SS is also grateful to her husband Mr. Sudhanshu Shekhar Singh for assistance in writing executable codes for algorithms developed in this work, and to Suneyna Bansal for helpful discussions. The authors are very thankful to Dr. Tim Stevens (Department of Biochemistry, University of Cambridge) for online scientific discussions with SS. Finally, the authors are very grateful to the anonymous reviewers for their extremely valuable constructive inputs that have substantially raised the quality of this manuscript. Author Contributions A.M. designed the study and assisted in analyses of the results. S.S. collected the data and performed the analysis. A.M. and S.S. wrote the manuscript.

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 16 www.nature.com/scientificreports/

Additional Information Supplementary information accompanies this paper at http://www.nature.com/srep Competing financial interests: The authors declare no competing financial interests. How to cite this article: Singh, S. and Mittal, A. Transmembrane Domain Lengths Serve as Signatures of Organismal Complexity and Viral Transport Mechanisms. Sci. Rep. 6, 22352; doi: 10.1038/srep22352 (2016). This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Scientific Reports | 6:22352 | DOI: 10.1038/srep22352 17 1

Supplementary Information

Transmembrane Domain Lengths Serve as Signatures of Organismal Complexity and Viral Transport Mechanisms

by

Snigdha Singh and Aditya Mittal 2

Index for Supplementary information

S. No. Description of Contents Page Nos.

1 Figures S1 – S8 with legends 3 – 19

2 Species of fungi 20 – 21

3 Table S1 22

4 Species of plants 22

5 Table S2 23

6 Species of non-mammalian vertebrates 24

7 Table S3 24

8 Species of mammals 25

9 Table S4 26 – 27

10 Table S5 28 – 36

3

Supplementary Figure S1 (A) (B) -3 -3 -2 -2

-1 -1

0 0 FUNGI PLANTS

(KD scale)(KD 1 scale)(KD 1 ER ER 2 Golgi 2 Golgi TGN/endo Nucleus Mean HydrophobicityMean (kcal/mol) 3 PM HydrophobicityMean (kcal/mol) 3 PM -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 4 4 Position from cytosolic side to exoplasmic side Of Position from cytosolic side to exoplasmic side Of TMDs TMDs Golgi Nucleus ER ER TGN Golgi (C)PM (D) PM -3 -3

-2 -2

-1 -1 0 0 NON-MAMMALIAN (KD scale) MAMMALS VERTEBRATES 1 (KD scale) 1 ER ER 2 Golgi 2 Golgi TGN/endo Mean HydrophobicityMean (kcal/mol) TGN/endo 3 PM Mean HydrophobicityMean (kcal/mol) 3 PM -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 4 4 Position from cytosolic side to exoplasmic side Of Position from cytosolic side to exoplasmic side Of TMDs TMDs Golgi Golgi ER ER TGN TGN (E) PM (F) PM 35 6 30 5 25 4 20 3 15 2 10 1 membrane Plasma of 5 Mean hydrophobicMean lengths 0 0 Root Square Mean Difference Fungi Plants Non-mammalian Mammals Fungi Plants Non-mammalian Mammals Vertebrates Vertebrates

Figure S1: Hydrophobicity profiles of TMD lengths of organelles and plasma membranes in fungi, plants, non-mammalian vertebrates and mammals using the Kyte-Doolittle scale.

Note that (A) – (D) above correspond to main Figs. 2A – 2D respectively. Further, (E) above corresponds to main Fig. 2F and (F) above corresponds to main Fig. 3E. The analyses shown in main figures was done using the GES scale – this figure confirms that the results found by us are independent of hydrophobicity scales. 4

Supplementary Figure S2

(A) 35 ER ƒNucleus 30 Golgi TGN/endo 25 PM

20

15

10

lengths hydrophobic Mean 5

0 Fungi Plants

(B) 35 ER 30 Golgi TGN/endo 25 PM

20

15

10

Mean hydrophobic lengths lengths hydrophobic Mean 5

0 Non-mammalian Mammals Vertebrates

Figure S2:

(A) Mean hydrophobic lengths of ER, Golgi, TGN, Plasma Membranes (PM) in fungi and

ER, Golgi, Nucleus, PM in plants.

(B) Mean hydrophobic lengths of ER, Golgi, TGN, PM in non-mammalian vertebrates and

ER, Golgi,TGN, PM in mammals. 5

Supplementary Figure S3 (continued on next page)

(A) (B) Fungi Plants -4 -2 0 2 4 6 8 1012 141618 20 22 24 26 28 30 -4 -2 0 2 4 6 8 1012 141618 20 22 24 26 28 30 I L I V L F V W F Y W C Y A C G A P G T P ER S ER T M S H M Q H N Q E N D E K D R K R

I I L L V V F F W W Y Y C C A A G G P P T T

S Golgi S

Golgi M M H H Q Q N N E E D D K K R R

I I L L V V F F W W Y Y C C A A G G P P T T S S M M H

H Nucleus Endosomes Q Q N N E E D D K K R R

I I L L V V F F W W Y Y C C A A G PM G PM P P T T S S M M H H Q Q N N E E D D K K R R Cytosol Transmembrane Cytosol Transmembrane

0.4 0.32 0.35 0.28 0.3 0.24 0.25 0.2 0.2 0.16 Abundance Abundance 0.15 0.12 0.1 0.08 0.05 0.04 0 0 Figure S3 (A) for fungi, (B) for plants: Amino acids (Y-axes) are listed in the order of decreasing hydrophobicity index values. X-axes represent position w.r.t. cytosol. White represents zero and dark grey color a maximum of one. Along the length of the TMDs there is a change in the abundance of hydrophobic residues. 6

Supplementary Figure S3 (“Explanation” continued on next page)

(C)Non-Mammalian Vertebrates (D) Mammals -4 -2 0 2 4 6 8 1012 141618 20 22 24 26 28 30 -4 -2 0 2 4 6 8 1012 141618 20 22 24 26 28 30

I I L L V V F F W W Y Y C C A A G G P ER P

ER T T S S M M H H Q Q N N E E D D K K R R

I I L L V V F F W W Y Y C C A A G G P P T T Golgi Golgi S S M M H H Q Q N N E E D D K K R R

I I L L V V F F W W Y Y C C A A G G P P T T S S M M H H Q Q N N Endosomes E E D D Endosomes K K R R

I I L L V V F F W W Y Y C C A A G

PM P PM G P T T S S M M H H Q Q N N E E D D K K R R Cytosol Transmembrane Cytosol Transmembrane

0.32 0.4 0.28 0.35 0.24 0.3 0.2 0.25 0.16 0.2

Abundance 0.12 Abundance 0.15 0.08 0.1 0.04 0.05 0 0 Figure S3 (C) for non-mammalian vertebrates, (D) for mammals: Amino acids (Y-axes) are listed in the order of decreasing hydrophobicity index values. X-axes represent position w.r.t. cytosol. White represents zero and dark grey color a maximum of one. Along the length of the TMDs there is a change in the abundance of hydrophobic residues. 7

Supplementary Figure S3 - Explanation

Positional analyses of amino acid compositions of TMDs from different organelles show interesting compositional differences between and along the TMDs. This analysis served as an extremely important computational positive control for our work – we were able to confirm the findings of Sharpe et al. (2010) with a larger dataset (on fungi and vertebrates) as well as the methodology developed in this work. Leucine is found to be more abundant in TMD region of ER, Golgi, TGN/endo and PM but less abundant in cytosolic region. Valine is abundant in TMD region in plasma membrane, TGN/endo. The regions abundant in hydrophobic residues are larger for TGN/endo and plasma membrane proteins than ER and

Golgi proteins, indicating a difference in TMD length. Polar residues are abundant in cytosolic regions.

(A) In fungi regions enriched in hydrophobic residues are shortest for Golgi proteins. Leucine is most abundant in TMD regions of ER and Golgi. (B) In plants regions enriched in hydrophobic residues are similar in ER and golgi TMDs. (C) In non-mammalian vertebrates regions enriched in hydrophobic residues are shortest for Golgi proteins. Leucine is most abundant in ER rather than all other organelles.

(D) In mammals also regions enriched in hydrophobic residues are shortest for Golgi proteins. Regions occupied by hydrophobic residues in membrane proteins of ER, endosomes and plasma membrane are almost similar. Summary of compositional Analysis: In fungi, regions enriched in hydrophobic residues are shortest for golgi proteins and longest for plasma membrane proteins indicating a difference in TMD length. Further, different hydrophobic residues were not uniformly distributed through the hydrophobic

TMD core. For example, leucine is most abundant in TMD regions of ER and Golgi in fungi. In plants, regions enriched in hydrophobic residues are same for ER and golgi proteins indicating much less difference in TMD length of both the organelles, as also observed from the above hydrophobicity graphs.

In plants, leucine is most abundant in TMD regions of ER, golgi and plasma membrane. In non- mammalian vertebrates the regions enriched in hydrophobic residues are longest for endosome proteins rather than plasma membrane proteins indicating a difference in TMD length. Valine is most abundant in the TMD regions of ER and endosomes. Leucine is most abundant in the TMD region of ER in non- mammalian vertebrates as compared to other organelles. In mammals the regions enriched in hydrophobic residues are shortest for golgi proteins compared to all other organelles and plasma membranes. Statistical validity of above results was confirmed by results shown in supplementary Fig. S4. 8

Supplementary Figure S4

(A) (B) T-test: Golgi vs PM Hydrophobicity T-test: ER vs PM Hydrophobicity 80 80 Fungi Fungi 70 70 Plants Plants 60 60 Non‐mammalian Non‐mammalian 50 Vertebrates 50 Vertebrates Mammals Mammals 40 40 30

-log(p-value) 30 -log(p-value) 20 20 10 10 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to exoplasmic side Of Position from cytosolic side to exoplasmic side Of TMDs TMDs (C) (D) T-test: TGN vs PM Hydrophobicity T-test: ER vs Golgi Hydrophobicity 80 80 Fungi Fungi 70 70 Plants Non‐mammalian 60 60 vertebrates Non‐mammalian 50 Mammals 50 vertebrates Mammals 40 40

-log(p-value) 30 30 -log(p-value) 20 20 10 10 0 0 -5 0 5 10 15 20 25 30 ‐5 0 5 1015202530 Position from cytosolic side to exoplasmic side Of Position from cytosolic side to exoplasmic side Of (E)TMDs (F) TMDs T-test: ER vs TGN Hydrophobicity T-test: Golgi vs TGN Hydrophobicity 80 80 Fungi Fungi 70 70 Non‐mammalian Non‐mammalian 60 Vertebrates 60 Vertebrates Mammals 50 50 Mammals

40 40

-log(p-value) 30 -log(p-value) 30

20 20

10 10 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to exoplasmic side Of Position from cytosolic side to exoplasmic side Of TMDs TMDs

Figure S4: Independent (two-sample) t-tests to compare the mean residue hydrophobicity of residues at positions relative to cytosolic edge of TMDs.

(A) Golgi vs. PM: For fungi the hydrophobicity values of the Golgi and plasma membrane

TMDs were highly significantly different between positions 15 and 25. For non-mammalian 9 vertebrates, the difference between Golgi and plasma membrane TMDs was significant from positions 17 to 25 but this difference was more significant in the case of mammals. In mammals and even in plants there is a significant difference in mean hydrophobicity profiles of amino acid residues relative to their positions i.e. there is a difference in the composition of transmembrane domains between golgi and plasma membrane. (B) ER vs. PM: For fungi the hydrophobicity values of ER and plasma membrane TMDs were highly significantly different between positions 18 and 25. For plants and mammals these differences are more significant than fungi and non-mammalian vertebrates. (C) TGN/endo vs. PM: In fungi the difference between the mean hydrophobicity profiles of amino acid residues was significant up to some extent but in plants, non-mammalian vertebrates and mammals these differences were not significant. (D) ER vs. Golgi: As compared to fungi and non-mammalian vertebrates the difference in mean hydrophobicity profiles of transmembrane domains of

Golgi and endoplasmic reticulum in mammals is most significant. (E) ER vs TGN/endo: In fungi the difference between the mean hydrophobicity profiles of amino acid residues was significant from positions 15 to 24. (F) Golgi vs TGN/endo: In fungi the difference between the mean hydrophobicity profiles of amino acid residues was highly significant from positions 15 to 24. In non-mammalian vertebrates the difference was significant from positions 16 to 26 and in mammal the difference between the hydrophobicity profiles is significant from positions 16 to 29.

Results of comprehensive abundance analysis for each amino acid along the TMDs are shown in Fig. S5. 10

Supplementary Figure S5 (continued on next page)

(A) ER-Fungi ER-Plants ER-Non-mammalian Vertebrates ER-Mammals

0.4 0.4 0.4 0.4 Isoleucine Isoleucine Isoleucine Isoleucine Leucine Leucine 0.35 0.35 0.35 Leucine 0.35 Valine Leucine Valine Valine 0.3 Glycine Valine Glycine 0.3 Glycine 0.3 Glycine 0.3 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15

Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

Golgi-Fungi Golgi-Plants Golgi-Non-mammalian Vertebrates Golgi-Mammals 0.4 0.4 0.4 0.4 Isoleucine Isoleucine Isoleucine Isoleucine 0.35 Leucine 0.35 Leucine 0.35 Leucine 0.35 Leucine Valine Valine Valine Valine Glycine 0.3 Glycine 0.3 Glycine 0.3 0.3 Glycine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

TGN/endo-Fungi Nucleus-Plants TGN/endo-Non-mammalian Vertebrates TGN/endo-Mammals 0.4 0.4 0.4 0.4 Isoleucine Isoleucine Isoleucine Isoleucine Leucine 0.35 Leucine 0.35 Leucine 0.35 0.35 Leucine Valine Valine Valine Valine Glycine 0.3 Glycine 0.3 Glycine 0.3 Glycine 0.3 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0 .05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

PM-Fungi PM-Plants PM-Non-mammalian Vertebrates PM-Mammals 0.4 0.4 0.4 0.4 Isoleucine Isoleucine Isoleucine Isoleucine Leucine Leucine 0.35 0.35 0.35 Leucine 0.35 Leucine Valine Valine Valine Valine Glycine 0.3 0.3 Glycine 0.3 Glycine 0.3 Glycine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0 .05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs Figure S5: (A) Analysis of the abundance of isoleucine, leucine, valine and glycine along the

TMDs from ER, TGN/endo, Golgi, Nucleus and Plasma membrane in fungi, plants, non-

mammalian vertebrates and mammals.

11

Supplementary Figure S5 (continued on next page)

(B) ER-Fungi ER-Plants ER-Non-mammalian Vertebrates ER-Mammals 0.4 0.4 0.4 0.4 Phenylalanine Phenylalanine Phenylalanine Phenylalanine 0.35 Tryptop han 0.35 Tryptop han 0.35 Tryptop han 0.35 Tryptop han Tyrosine Tyrosine Tyrosine Tyrosine 0.3 Cysteine 0.3 0.3 0.3 Cysteine Cysteine Cysteine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2

Abundance 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

Golgi-Fungi Golgi-Plants Golgi-Non-mammalian Vertebrates Golgi-Mammals 0.4 0.4 0.4 0.4 Phenylalanine Phenylalanine Phenylalanine Phenylalanine 0.35 Tryp to p han 0.35 Tryp to phan 0.35 Tryp to phan 0.35 Tryptop han Tyro sine Tyro sine Tyro sine Tyrosine 0.3 Cysteine 0.3 Cysteine 0.3 Cysteine 0.3 Cysteine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2

Abundance 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

TGN/endo-Fungi Nucleus-Plants TGN/endo-Non-mammalian Vertebrates TGN/endo-Mammals 0.4 0.4 0.4 0.4 Phenylalanine Phenylalanine Phenylalanine Phenylalanine 0.35 Tryptop han 0.35 Tryptophan 0.35 Tryptop han 0.35 Tryp to phan Tyrosine Tyrosine Tyrosine Tyro sine 0.3 Cysteine 0.3 0.3 Cysteine 0.3 Cysteine Cysteine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2

Abundance 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

PM-Fungi PM-Plants PM-Non-mammalian Vertebrates PM-Mammals 0.4 0.4 0.4 0.4 Phenylalanine Phenylalanine Phenylalanine Phenylalanine 0.35 Tryptop han 0.35 Tryp to phan 0.35 Tryptop han 0.35 Tryptophan Tyro sine Tyrosine Tyro sine Tyrosine Cysteine 0.3 Cysteine 0.3 Cysteine 0.3 Cysteine 0.3 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs Figure S5: (B) Analysis of the abundance of phenylalanine, tryptophan, tyrosine and cysteine along the TMDs from ER, TGN/endo, Golgi, Nucleus and Plasma membrane in fungi, plants, non-mammalian vertebrates and mammals. 12

Supplementary Figure S5 (continued on next page)

(C) ER-Fungi ER-Plants ER-Non-mammalian Vertebrates ER-Mammals 0.4 0.4 0.4 0.4 Alanine Alanine Alanine Alanine 0.35 Proline 0.35 Proline 0.35 Proline 0.35 Proline Threo nine Threonine Threonine Threonine 0.3 Serine 0.3 0.3 0.3 Serine Serine Serine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

Golgi-Fungi Golgi-Plants Golgi-Non-mammalian Vertebrates Golgi-Mammals 0.4 0.4 0.4 0.4 Alanine Alanine Alanine Alanine 0.35 Proline 0.35 Proline 0.35 Proline 0.35 Proline Threo nine Threo nine 0.3 0.3 Threonine 0.3 Threo nine 0.3 Serine Serine Serine Serine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

TGN/endo-Fungi Nucleus-Plants TGN/endo-Non-mammalian Vertebrates TGN/endo-Mammals 0.4 0.4 0.4 0.4 Alanine Alanine Alanine Alanine 0.35 Proline 0.35 Proline 0.35 Proline 0.35 Proline Threonine Threo nine Threo nine Threo nine 0.3 0.3 0.3 Serine 0.3 Serine Serine Serine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2

Abundance 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

PM-Fungi PM-Plants PM-Non-mammalian Vertebrates PM-Mammals 0.4 0.4 0.4 0.4 Alanine Alanine Alanine Alanine 0.35 Proline 0.35 Proline 0.35 Proline 0.35 Proline Threo nine Threonine Threonine 0.3 0.3 0.3 Threo nine 0.3 Serine Serine Serine Serine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

Figure S5: (C) Analysis of the abundance of alanine, proline, threonine and serine along the

TMDs from ER, TGN/endo, Golgi, Nucleus and Plasma membrane in fungi, plants, non- mammalian vertebrates and mammals. 13

Supplementary Figure S5 (continued on next page)

(D)

ER-Fungi ER-Plants ER-Non-mammalian Vertebrates ER-Mammals 0.4 0.4 0.4 0.4 Methionine Methionine Methionine Methionine 0.35 Histidine 0.35 Histidine 0.35 Histidine 0.35 Histidine Glutamine Glutamine Glutamine Glutamine 0.3 Asparagine 0.3 0.3 0.3 Asparagine Asparagine Asparagine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

Golgi-Fungi Golgi-Plants Golgi-Non-mammalian Vertebrates Golgi-Mammals 0.4 0.4 0.4 0.4 Methionine Methionine Methionine Methionine Histidine 0.35 0.35 Histidine 0.35 Histidine 0.35 Histidine Glutamine 0.3 0.3 Glutamine 0.3 Glutamine 0.3 Glutamine Asparagine Asparagine Asparagine Asparagine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

TGN/endo-Fungi Nucleus-Plants TGN/endo-Non-mammalian Vertebrates TGN/endo-Mammals 0.4 0.4 0.4 0.4 Methionine Methionine Methionine Methionine 0.35 Histidine 0.35 Histidine 0.35 Histidine 0.35 Histidine Glutamine Glutamine Glutamine Glutamine 0.3 0.3 0.3 Asparagine 0.3 Asparagine Asparagine Asparagine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

PM-Fungi PM-Plants PM-Non-mammalian Vertebrates PM-Mammals 0.4 0.4 0.4 0.4 Methionine Methionine Methionine Methionine 0.35 0.35 0.35 0.35 Histidine Histidine Histidine Histidine Glutamine Glutamine 0.3 Glutamine 0.3 Glutamine 0.3 0.3 Asparagine Asparagine Asparagine Asparagine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

Figure S5: (D) Analysis of the abundance of methionine, histidine, glutamine and asparagine along the TMDs from ER, TGN/endo, Golgi, Nucleus and Plasma membrane in fungi, plants, non-mammalian vertebrates and mammals. 14

Supplementary Figure S5 (“Explanation” continued on next page)

(E) ER-Fungi ER-Plants ER-Non-mammalian Vertebrates ER-Mammals 0.4 0.4 0.4 0.4 Glutamic acid Glutamic acid Glutamic acid Glutamic acid 0.35 Aspartic acid 0.35 Aspartic acid 0.35 Aspartic acid 0.35 Asparatic acid Lysine Lysine Lysine Lysine 0.3 Arginine 0.3 0.3 0.3 Arginine Arginine Arginine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2

Abundance 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

Golgi-Fungi Golgi-Plants Golgi-Non-mammalian Vertebrates Golgi-Mammals 0.4 0.4 0.4 0.4 Glutamic acid Glutamic acid Glutamic acid Glutamic acid 0.35 Aspartic acid 0.35 Aspartic acid 0.35 Aspartic acid 0.35 Aspartic acid Lysine Lysine Lysine Lysine 0.3 0.3 Arginine 0.3 Arginine 0.3 Arginine Arginine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

TGN/endo-Fungi Nucleus-Plants TGN/endo-Non-mammalian Vertebrates TGN/endo-Mammals 0.4 0.4 0.4 0.4 Glutamic acid Glutamic acid Glutamic acid Glutamic acid 0.35 Asparatic acid 0.35 Aspartic acid 0.35 Aspartic acid 0.35 Aspartic acid Lysine Lysine Lysine 0.3 0.3 Lysine 0.3 0.3 Arginine Arginine Arginine Arginine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs

PM-Fungi PM-Plants PM-Non-mammalian Vertebrates PM-Mammals 0.4 0.4 0.4 0.4 Glutamic acid Glutamic acid Glutamic acid Glutamic acid 0.35 Asparatic acid 0.35 Asparatic acid 0.35 Aspartic acid 0.35 Aspartic acid Lysine Lysine Lysine Lysine 0.3 0.3 0.3 Arginine 0.3 Arginine Arginine Arginine 0.25 0.25 0.25 0.25 0.2 0.2 0.2 0.2 0.15 0.15 0.15 0.15 Abundance 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 0 0 0 0 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 -5 0 5 10 15 20 25 30 Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to Position from cytosolic side to exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs exoplasmic side Of TMDs Figure S5: (E) Analysis of the abundance of glutamic acid, aspartic acid, lysine and arginine along the TMDs from ER, TGN/endo, Golgi, Nucleus and Plasma membrane in fungi, plants, non-mammalian vertebrates and mammals. 15

Supplementary Figure S5 - Explanation

Abundance of amino acids: We have analysed the abundance of all 20 amino acids along the TMDs from all the organelles. Isoleucine and leucine are uniformly distributed through the golgi proteins of fungi, plants, non-mammalian vertebrates and mammals but they are not symmetrically distributed through the plasma membrane in fungi, plants and non-mammalian vertebrates. Valine and glycine are both asymmetrically distributed through the golgi proteins and plasma membrane proteins in all the organelles. Leucine is most abundant in all the organelles in the TMD regions of all the organisms which is clearly seen in the compositional analysis of TMDs (Supplementary Fig. S5A). Phenylalanine was also found most abundant in the core region of TMDs in ER and golgi of fungi and plants (Supplementary Fig. S5B).

Prolines were found mostly abundant at the edge of the TMDs rather than in the core

(Supplementary Fig. S5C). As expected, polar but somewhat neutral residues i.e. Methionine,

Histidine, Glutamine and Asparagine were found sparsely populated in the TMD region in all the organelles (Supplementary Fig. S5D). Further, as expected, charged residues i.e.

Glutamic acid, Aspartic acid, Lysine and Arginine are found negligible in the TMD regions but abundant in the non-TMD regions in all the organelles (Supplementary Fig. S5E). 16

Supplementary Figure S6

Hydrophobicity-Plants -3 -2 -1 0 1 2 3 4 5 ƒPeroxisomes 6 HydrophobicityMean (kcal/mol) 7 -5 0 5 10 15 20 25 30 8 Position from cytosolic side to exoplasmic side Of TMDs Hydrophobicity-Plants -3 -2

-1 0 1 2 3 4 Outer chloroplast 5 membrane

Mean HydrophobicityMean (kcal/mol) Inner chloroplast 6 membrane 7 -5 0 5 10 15 20 25 30 8 Position from cytosolic side to exoplasmic side Of TMDs Hydrophobicity-Plants -3 -2 -1 0 1 2 3

4 Outer mitochondrial 5 membrane

HydrophobicityMean (kcal/mol) Inner mitochondrial 6 membrane 7 -5 0 5 10 15 20 25 30 8 Position from cytosolic side to exoplasmic side Of TMDs

Figure S6: Positional analysis of TMDs of peroxisomes, chloroplast and mitochondria in plants. 17

Supplementary Figure S7

Hydrophobicity - Macropinocytosis

-3 -2

-1 0 1 2 3 4 5

Mean HydrophobicityMean (kcal/mol) 6

7 -5 0 5 10 15 20 25 30 8 Position from cytosolic side to exoplasmic side Of TMDs Hydrophobicity - Clathrin-mediated endocytosis -3 -2 -1 0 1 2 3 4 5 Mean HydrophobicityMean (kcal/mol) 6 7 -5 0 5 10 15 20 25 30 8 Position from cytosolic side to exoplasmic side Of TMDs

Hydrophobicity - Caveolin-mediated endocytosis -3 -2 -1 0 1 2 3 4 5 6 7

Mean HydrophobicityMean (kcal/mol) 8 9 -5 0 5 10 15 20 25 30 10 Position from cytosolic side to exoplasmic side Of TMDs Figure S7: Hydrophobicity graphs of proteins (listed in Table S5, calculations were done after removing redundant sequences as described in methodology) involved in different internalization pathways – macropinocytosis, clathrin-mediated and caveolin-mediated endocytosis. 18

Supplementary Figure S8

(A) Hydrophobicity- Animal Virus (B) Hydrophobicity-Animal Virus -4 ‐4 -2 ‐2 0 0 2 2 4 4 6 Vaccinia Virus 6 Equine infectious anemia

(kcal/mol) Hydrophobicity (kcal/mol) 8 A21 protein 8 Virus(outer membrane protein) 10 10

-5 0 5 10 15 20 25 30 Mean ‐5 0 5 1015202530 Mean Hydrophobicity 12 12 Position relative to cytosolic edge Position relative to cytosolic edge of TMDs of TMDs (Entry through Macropinocytosis) (Entry through Clathrin-mediated endocytosis)

(C)

35

30

25

20

15

10

5 in Enveloped Viruses in Enveloped

0

Variation in Mean TMD Length Pathway Pathway Pathway Pathway I II III IV Hydrophobicity-Plant Virus Hydrophobicity-Plant Virus (D) (E) -4 ‐4 -2 ‐2 0 0 2 2 4 4 TuMV Potato Virus X 6 (kcal/mol) 6

(TGBp3) (kcal/mol) ( 6K2 protein) 8 8 10 10 Mean Hydrophobicity -5 0 5 10 15 20 25 30 ‐5 0 5 1015202530 12 12 Mean Hydrophobicity Position relative to cytosolic edge Position relative to cytosolic edge of TMDs of TMDs (Replication site ER) (Replication site ER)

(F)

25

20

15

10

5 of Plant Viruses

0 VRS VRS VRS VRS VRS

Variation in Mean TMD Length TMD Mean in Variation I II III IV V Figure S8: (A) Hydropathy plot - Vaccinia virus (mode of entry into their host cells is through macropinocytosis) (B) Hydropathy plot - Equine infectious anemia virus (mode of 19 entry into their host cells is through clathrin-mediated endocytosis) (C) Variation in mean

TMD lengths of animal viruses in different entry pathways: Bars show the mean hydrophobic lengths of TMDs (±std) of all the viruses analyzed in this study following different pathways.

Viruses entering through clathrin-mediated endocytic pathway into their host cells have shortest TMD lengths where as viruses penetrating through plasma membrane have longest

TMD lengths. (D) Hydropathy plot of Potato virus X (replication site is endoplasmic reticulum) (E) Hydropathy plot of Turnip Mosaic virus (replication site is endoplasmic reticulum) (F) Variation in mean TMD lengths of plant viruses in different replication sites:

Bars show the mean hydrophobic lengths (±std) of TMDs of different plant viruses in various replication sites. Mean TMD length of those plant viruses is shortest whose replication site is chloroplast where as mean TMD length is longest for those plant viruses whose replication site is peroxisome. However, the data is not populated enough to make any strong conclusions.

Figure S8 clearly indicates that on the basis of TMD lengths and the pattern of hydrophobicity graphs (hydropathy plots) of viral proteomes, one can predict the entry pathways (strongly) and replication sites (quite weakly) for animal and plant viruses respectively. 20

Supplementary Figure S9

Filtered membrane protein sequences for Analysis (containing TMD and flanking sequences)

Pick one protein sequence at a time

Find all subsequences of hydrophobicity values <= max value. (Max value depends on different hydrophobicity scales). Note the starting point and lengths of all subsequences found.

Define variables current_best_seq_size = 0, current_best_seq_pos=0, candidate_sequence = [ ]

Set candidate_sequence = [Find the first/next subsequence of length >= head]

Find the next subsequence of length >= body i.e. body_sequence

Are the candidate_sequence and body_sequence separated by at most one hydrophobicity value more than max value? N

N If length(candidate_sequence) > current_best_seq_size? Y

Append body_sequence to candidate_sequence. i.e., candidate_sequence = [candidate_sequence, hydrophobicity value> max value, body_sequence]

Y N Are there any more sub Y sequences?

Set current_best_seq_size = length(candidate_sequence), current_best_seq_pos = starting position of candidate_sequence.

N Y Are there any more sub Stop sequence?

Figure S9: Algorithm for refining edges of TM spans – Implementation was done in

MATLAB (Mathworks Inc.). For implementation of the above algorithm, we first define a 21

“sequence” and a “subsequence” as follows – Def 1. A sequence of k numbers is a list of numbers in the form [n1, n2, …, nk], e.g. [1, 2, 3, 1, 1, 0] is a sequence of 6 numbers. Def 2.

A subsequence (of a sequence) of size r starting from position p is the part of sequence containing r consecutive values starting from the pth position. For example, the sub sequence of size 3 starting from position 4 for example in Def 1 is [1, 1, 0].

When starting a new sub sequence i.e. group of amino acids in a single sequence whose hydrophobicity values <= max value (Max value depends on different hydrophobicity scales and decides the location of amino acids i.e. whether they present in TMDs or outside the

TMDs), we first look for a subsequence of size head or more i.e. consecutive amino acid hydrophobicity values that are less than or equal to max value in the beginning of subsequence . Once we find it, we call it the candidate_subsequence. Then we find the next body_sequence of size body or more i.e. number of consecutive values (hydrophobicity values of amino acids that are less than or equal to max value between any two values that are more than max values. If the two are separated by at most one amino acid whose hydrophobicity value > max value, we redefine candidate_subsequence =

[candidate_subsequence, one amino acid, body_sequence]. We keep repeating this till the sequence is broken by at least two large entries i.e. two consecutive amino acids whose hydrophobicity values are more than max values, or till we reach the end of whole sequence.

Then we check if our candidate subsequence is larger than any other subsequence found earlier i.e. our current group of consecutive amino acids whose hydrophobicity values are less than max value than any other group of amino acids found earlier. If yes, then we replace our

“largest so far sequence” by candidate_sequence. Then we check if there are any more subsequences. If yes, we find the next subsequence of size head or more, call it candidate_subsequence and repeat. 22

Species of fungi from which proteome sequences were collected:

S. cerevisiae , T. phaffii, V. polyspora, T. delbrueckii, Z. rouxii, T. blattae, K. lactis, C. tenuis, D. hansenii, S. passalidarum , M. guilliermondii , S. pombe, C. dubliniensis, B. oryzae

, S. stipitis, B. sorokiniana, B. zeicola , W. sebi , C. militaris , L. maculans , S. japonicus y, P. teres , K. pastoris , P. tritici-repentis Pt-1C-BFP, C. lusitaniae, K. pastoris, P. tritici-repentis

Pt-1C-BFP, C. lusitaniae ATCC 42720, C. neoformans var. neoformans JEC21, A. capsulatus NAm1, A. niger CBS 513.88, A. dermatitidis SLH14081, P. strigosozonata HHB-

11173 SS5, P. chrysogenum Wisconsin 54-1255, A. terreus NIH2624, A. otae CBS 113480,

E. lata UCREL1, N. fischeri NRRL 181, C. puteana RWD-64-598 SS2, T. rubrum CBS

118892, Y. lipolytica, M. globosa CBS 7966, C. immitis RS, U. reesii 1704, A. benhamiae

CBS 112371, M. brunnea f. sp. 'multigermtubi' MB_m1, C. posadasii C735 delta SOWgp, T. verrucosum HKI 0517, B. dendrobatidis JAM81, A. clavatus NRRL 1, P. flocculosa PF-1, F. graminearum PH-1, T. mesenterica DSM 1558, A. nidulans FGSC A4, L. elongisporus

NRRL YB-4239, C. globosum CBS 148.51, T. stipitatus ATCC 10500, P. anserina S mat+, M. anisopliae ARSEF 23, B. dendrobatidis JAM81, C. thermophilum var. thermophilum DSM

1495, S. macrospora k-hell, E. pusillum Z07020, T. terrestris NRRL 8126, G. trabeum ATCC

11539, S. commune H4-8, M. thermophila ATCC 42464, N. parvum UCRNP2, C. albicans

SC5314, S. sclerotiorum 1980, C. fioriniae PJ7, M. acridum CQMa 102, T. reesei QM6a, A. flavus NRRL3357, P. carnosa HHB-10118-sp, M. larici-populina 98AG31, C. yegresii CBS

114405, B. compniacensis UAMH 10762, T. versicolor FP-101664 SS1, F. mediterranea

MF3/22, S. hirsutum FP-91666 SS1, N. haematococca mpVI 77-13-4, M. roreri MCA 2997,

P. graminis f. sp. tritici CRL 75-36-700-3, P. fijiensis CIRAD86, D. squalens LYAD-421 SS1,

C. psammophila CBS 110553, P. nodorum SN15, C. epimyces CBS 606.96, P. nodorum

SN15, C. epimyces CBS 606.96, C. coronata CBS 617.96, A. delicata TFB-10046 SS5, Z. tritici IPO323, P. fici W106-1, T. minima UCRPA7, G. ATCC 20868, S. lacrymans var. 23 lacrymans S7.9, C. cinerea okayama7#130, L. bicolor S238N-H82, A. bisporus var. burnettii

JB137-S8, A. bisporus var. bisporus H97, M. farinosa CBS 7064, C. glabrata CBS 138, P. carnosa HHB-10118-sp, N. castellii CBS 4309, K. africana CBS 2517, V. polyspora DSM

70294, A. gossypii ATCC 10895, L. thermotolerans, E. cymbalariae DBVPG#7215, N. dairenensis CBS 421, M. farinosa CBS 7064, S. stipitis CBS 6054, M. guilliermondii ATCC

6260, D. hansenii CBS767, S. passalidarum NRRL Y-27907, C. lusitaniae ATCC 42720, Y. lipolytica, U. reesii 1704, T. rubrum CBS 118892, T. verrucosum HKI 0517, A. benhamiae

CBS 112371, A. otae CBS 113480, A. gypseum CBS 118893, C. posadasii C735 delta

SOWgp, N. parvum UCRNP2, T. stipitatus ATCC 10500, A. delicata TFB-10046 SS5, A. dermatitidis SLH14081, T. terrestris NRRL 8126, E. lata UCREL1, A. fumigatus Af293, A. nidulans FGSC A4, P. nodorum SN15, P. chrysogenum Wisconsin 54-1255, T. melanosporum

Mel28, C. globosum CBS 148.51, P. tritici-repentis Pt-1C-BFP, B. sorokiniana ND90Pr, M. oryzae 70-15, S. sclerotiorum 1980, P. flocculosa PF-1, P. fijiensis CIRAD86, T. minima

UCRPA7, V. alfalfae VaMs.102, U. maydis 521, S. turcica Et28A, C. thermophilum var. thermophilum DSM 1495, E. pusillum Z07020, G. lozoyensis ATCC 20868, A. capsulatus

NAm1, M. acridum CQMa 102, M. brunnea f. sp. 'multigermtubi' MB_m1, Z. tritici IPO323,

S. hirsutum FP-91666 SS1, B. compniacensis UAMH 10762, M. anisopliae ARSEF 23, C. militaris CM01, C. gloeosporioides Nara gc5, K. lactis NRRL Y-1140, U. reesii 1704, V. alfalfae VaMs.102, C. puteana RWD-64-598 SS2, D. squalens LYAD-421 SS1. 24

Supplementary Table S1: Reference proteins from S.cerevisieae

ER Golgi TGN/endo PM NCBI accession NCBI accession NCBI accession NCBI accession

NP_009343.1 NP_010916.1 NP_014161.1 NP_010898.1 NP_009730.3 NP_015485.1 NP_011312.1 NP_013774.1 NP_009878.1 NP_010531.1 NP_009536.1 NP_011528.3 NP_010618.1 NP_013592.1 NP_015404.1 NP_011537.1 NP_010914.3 NP_014965.3 NP_012095.1 NP_012126.1 NP_010936.1 NP_012352.2 NP_014862.3 NP_013026.2 NP_011011.3 NP_011257.1 NP_013894.1 NP_010708.1 NP_011026.3 NP_011659.3 NP_013185.1 NP_011046.3 NP_012609.3 NP_013436.1 NP_012058.1 NP_012349.2 NP_014116.1 NP_012060.1 NP_011477.1 NP_014536.1 NP_012261.1 NP_012721.1 NP_014650.1 NP_012299.1 NP_009571.1 NP_015400.1 NP_012532.3 NP_012251.3 NP_012905.1 NP_013465.1 NP_013308.1 NP_012396.1 NP_013884.1 NP_009758.3 NP_009626.1 NP_012987.3 NP_010371.1 NP_014369.3 NP_012665.3 NP_015272.1 NP_014423.1 NP_010771.1 NP_013167.1 NP_014742.1 NP_014008.1 NP_009764.3 NP_014288.3 NP_014405.1 NP_015150.2 NP_011946.1 NP_012768.1

Species of plants from which proteome sequences were collected: A. thaliana, A. lyrata subsp. Lyrata, E. salsugineum, V. vinifera, T. cacao, P. trichocarpa, G. max, R. communis, C. arietinum, P. persica, C. sativus, F. vesca subsp. Vesca, P. vulgaris, C. clementina, C. sinensis, M. domestica, P. mume, S. tuberosum, S. lycopersicum, C. melo, A. trichopoda, O. sativa

Japonica Group, S. italic, S. bicolor, B. distachyon, O. brachyantha, C. rubella, T. cacao, Z. mays, M.

Truncatula 25

Supplementary Table S2: Reference proteins from A.thaliana

ER Golgi Nucleus PM

NCBI accession NCBI accession NCBI accession NCBI accession

NP_565035.1 NP_179585.3 NP_564846.1 NP_188368.2 NP_171714.2 NP_173257.2 NP_566380.2 NP_197162.1 NP_564083.1 NP_179754.1 NP_196118.1 NP_196833.2 NP_173608.1 NP_566611.1 NP_563923.1 NP_177697.1 NP_180150.1 NP_563804.4 NP_564287.1 NP_180241.1 NP_180103.2 NP_683468.1 NP_176892.1 NP_175957.1 NP_178281.1 NP_192086.1 NP_001154799.1 NP_196345.1 NP_188924.3 NP_177220.2 NP_178405.2 NP_190214.1 NP_568917.1 NP_193730.1 NP_566705.1 NP_197789.1 NP_177650.1 NP_192084.1 NP_567929.1 NP_181242.3 NP_177736.1 NP_177220.2 NP_177046.1 NP_001031499.1 NP_177463.1 NP_171983.2 NP_187611.2 NP_179051.1 NP_172041.2 NP_177618.2 NP_175600.2 NP_174119.1 NP_173401.1 NP_172600.1 NP_191525.2 NP_175787.1 NP_175601.2 NP_565249.1 NP_176531.1 NP_172580.2 NP_172854.1 NP_564275.1 NP_175592.2 NP_564672.2 NP_197349.2 NP_190218.1 NP_176675.1 NP_565817.2 NP_181242.3 NP_178590.1 NP_564345.1 NP_179051.1 NP_564905.1 NP_564255.1 NP_175600.2 NP_175469.2 NP_564297.1 NP_175592.2 NP_172633.1 NP_175565.2 NP_190218.1 NP_173458.1 NP_563728.1 NP_190742.1 NP_027545.1 NP_177675.1 NP_569046.1 NP_179363.1 NP_177675.1 NP_172468.3 NP_187425.1 NP_564109.1 NP_198672.1 NP_187425.1 NP_565926.1 NP_190723.1 NP_188924.3 NP_565960.1 NP_180241.1 NP_568202.1 NP_566170.1 NP_180747.1 NP_198817.1 NP_566675.1 NP_568395.1 NP_564345.1 NP_197962.1 NP_179627.2 NP_196868.1 NP_850241.1 NP_565485.1 NP_187467.1 NP_191796.1 NP_189024.1

26

Species of non-mammalian vertebrates mammals from which proteome sequences were collected:

G. gallus, D. rerio, P. nyererei, M. zebra, A. platyrhynchos, T. guttata, S. salar, F. albicollis,

F. peregrines, X. (Silurana) tropicalis, P. humilis, F. cherrug, G. fortis, T. rubripes, X. laevis,

O.mykiss, O.latipes, Z. albicollis, C. livia, P. nyrerie, C. picta bellii, M. gallopavo, O. niloticus, F. albicollis, M. Undulatus

Supplementary Table S3: Reference proteins from G. gallus

ER Golgi TGN/endo PM

NCBI accession NCBI accession NCBI accession NCBI accession

NP_001025791.2 NP_001012812.1 XP_424186.2 NP_001019749.1 NP_001026449.1 NP_989759.1 XP_426353.4 NP_001025811.1 NP_001161480.1 NP_001026698.1 XP_003643604.2 XP_424067.4 NP_001006296.1 NP_001012812.1 XP_004938165.1 NP_990189.1 NP_001006561.1 NP_001026647.1 XP_414709.4 XP_004940194.1 NP_990628.1 NP_001072970.1 NP_989958.1 NP_990202.1 NP_001006143.1 NP_989842.1 NP_001004387.1 XP_001234609.3 XP_003643564.1 NP_001001603.1 NP_990500.1 NP_989980.1 XP_004937051.1 NP_001001603.1 NP_001001782.1 XP_001236152.2 NP_001264423.1 NP_989996.1 XP_001236152.2 NP_990548.1 XP_004938384.1 XP_414360.4 NP_990571.1 NP_001180224.1 XP_419082.3 NP_990564.1 XP_003643030.2 XP_416434.4 NP_990572.1 NP_001075887.1 NP_001026316.1 NP_001161219.1 NP_989592.2 NP_001005571.1 NP_001264360.1 NP_990543.1 NP_001026394.1 NP_001026658.1 NP_990709.1 NP_001007947.1 NP_990452.1 XP_417853.3 NP_001029988.2 NP_001025561.1 XP_430156.3 NP_001026749.1 XP_422670.2 NP_001012842.1 XP_415569.4 NP_001026187.1 NP_001006143.1 NP_001035000.1 XP_004939048.1 NP_990533.1 XP_422836.2 NP_990534.1 NP_001239028.1 NP_001107233.1 NP_001004383.1 NP_001264768.1

27

Species of mammals from which proteome sequences were collected:

H. sapiens , B. acutorostrata scammoni, P. catodon, C. lanigera, S. scrofa, S. harrisii, C. lupus familiaris, M. putorius furo, D. novemcinctus, M. auratus, T. chinensis, P. tigris altaica, P. anubis, T. syrichta, P. catodon, O. anatinus, R. norvegicus, T. truncates, C. hircus,

P. maniculatus bairdii, M. musculus, B. bubalis, O. cuniculus, E. telfairi, B. mutus, M. ochrogaster, E. edwardii, E.europaeus, H. glaber, S. araneus, C. porcellus, O. princeps, C. sabaeus, B.taurus, T. chinensis, A. melanoleuca, M. domestica, O. degus, S. harrisii, D. novemcinctus, O. garnettii, J. jaculus, G. gorilla gorilla, N. leucogenys, P. troglodytes, C. jacchus, C. sabaeus, I. tridecemlineatus, S. boliviensis boliviensis, P. tigris altaica, E. caballus, C. cristata, F. catus, D. novemcinctus, M. davidii, T. manatus latirostris, C. lupus familiaris, B. bubalis, P. maniculatus bairdii, B. mutus, O. rosmarus divergens, C. ferus, C. hircus, V. pacos, H. glaber, C. lanigera, P. alecto, P. hodgsonii, M. ochrogaster, P. catodon,

C. asiatica, L. vexillifer, O. aries, C. simum simum, P. abelii, S. scrofa, B. acutorostrata scammoni, O. orca, O. princeps, S. araneus, E. fuscus, O. rosmarus divergens, E. europaeus,

O. afer afer, M. mulatta, V. pacos, P. tigris altaica, M. putorius furo, C. cristata, L. weddellii.

Supplementary Table S4: Reference proteins from H.sapiens

ER Golgi TGN/endo PM

NCBI accession NCBI accession NCBI accession NCBI accession

NP_005207.2 NP_005104.3 NP_002950.3 NP_000623.2 NP_057021.2 NP_003587.1 NP_001013049.1 NP_000352.1 NP_005056.3 NP_003586.3 NP_055793.1 NP_001984.1 NP_847894.1 NP_542780.1 NP_000867.2 NP_001992.1 NP_001065.2 NP_036332.2 NP_057340.2 NP_000065.1 NP_061949.3 NP_775811.2 NP_038479.1 NP_000751.1 NP_001066.1 NP_000139.1 XP_006716052.1 NP_724781.1 NP_001064.1 NP_835368.1 NP_061725.1 NP_002987.1 NP_066962.2 NP_000141.1 NP_061725.1 NP_003174.3 NP_002941.1 NP_116053.3 NP_061732.1 NP_000034.1 NP_004353.1 NP_001008978.1 NP_114088.2 NP_000386.1 NP_003892.2 NP_002025.2 NP_958816.1 NP_001241.1 28

NP_003181.3 NP_003887.3 NP_061721.2 NP_000607.1 NP_659425.1 NP_110392.1 NP_002380.3 NP_003054.1 NP_009141.2 NP_000112.1 NP_060331.3 NP_001482.1 NP_002200.2 NP_004269.1 NP_057675.1 NP_003801.1 NP_067026.3 NP_005898.2 NP_075598.2 XP_004331703.1 NP_001534.1 NP_000132.3 NP_000771.2 NP_004742.1 NP_005182.1 NP_079374.2 NP_004767.1 NP_000867.2 NP_068747.1 NP_001488.2 NP_001772.1 NP_001077.2 NP_009186.1 NP_002429.1 NP_055489.1 NP_004473.2 NP_001450.2 NP_006450.2 NP_004258.2 NP_001056.1 NP_006143.2 NP_006867.1 NP_004098.1 NP_036346.1 NP_000890.1 NP_059132.1 NP_000407.1 NP_002399.1 NP_003255.2 NP_065112.1 NP_001018074.1 NP_078917.2 NP_000409.1 XP_005261665.1 NP_000192.2 NP_002401.1 NP_001069.1 NP_000866.1 NP_000197.1 NP_000408.1 NP_001720.1 NP_001549.2 NP_006395.2 NP_000879.2 NP_001101.1 NP_003807.1 NP_001973.2 NP_002118.1 NP_002829.3 NP_006130.1 NP_001093256.1 NP_061324.1 NP_065434.1 NP_000011.2 NP_000940.1 NP_000630.1 NP_068813.1 NP_001265515.1 NP_002990.2 NP_004320.2 NP_004451.2 29

Supplementary Table S5: List of proteins associated with macropinocytosis, clathrin- mediated – and caveolin-mediated – endocytosis

Endocytic Accession Protein Name Species Pathway Number platelet-derived growth factor receptor betaA XP_014939043.1 Acinonyx jubatus integrin beta-1 isoform 1A NP_002202.2| in Homo sapiens integrin beta-1 XP_008154521.1 Eptesicus fuscus tyrosine-protein kinase receptor UFO isoform X2 XP_003812512.1 Pan paniscus tyrosine-protein kinase receptor UFO isoform X3 XP_008137790.1 Eptesicus fuscus tyrosine-protein kinase receptor UFO isoform X1 XP_009192815.1 Papio anubis tyrosine-protein kinase receptor UFO XP_012316765.1 Aotus nancymaae integrin beta-1 isoform X1 XP_007189912.1 Balaenoptera acutorostrata scammoni platelet-derived growth factor receptor beta XP_013367670.1 Chinchilla lanigera isoform X2 platelet-derived growth factor receptor beta XP_012360448.1 Nomascus leucogenys isoform X2 tyrosine-protein kinase receptor UFO XP_004060847.1 Gorilla gorilla gorilla Macropinocytosis tyrosine-protein kinase receptor UFO isoform X1 XP_007650518.1 Cricetulus griseus LOW QUALITY PROTEIN: integrin beta-1-like XP_012786696.1 Ochotona princeps LOW QUALITY PROTEIN: integrin beta-1-like XP_009243525.1 Pongo abelii platelet-derived growth factor receptor alpha XP_006714102.1 Homo sapiens isoform X2 platelet-derived growth factor receptor alpha XP_004608091.1 Sorex araneus platelet-derived growth factor receptor beta-like XP_005661871.2 Sus scrofa tyrosine-protein kinase receptor UFO XP_012589186.1 Condylura cristata platelet-derived growth factor receptor beta-like XP_006751633.1 Leptonychotes weddellii low-density lipoprotein receptor-related protein 3 XP_012881751.1 Dipodomys ordii isoform X1 low-density lipoprotein receptor-related protein 8 XP_008762161.1 Rattus norvegicus isoform X4 low-density lipoprotein receptor-related protein 8 XP_011993212.1 Ovis aries musimon isoform X5

low-density lipoprotein receptor-related protein 6 XP_014307184.1 Myotis lucifugus

low-density lipoprotein receptor isoform 1 NP_000518.1 Homo sapiens precursor low-density lipoprotein receptor-related protein XP_013211038.1 Microtus ochrogaster 1B low-density lipoprotein receptor-related protein 4

Clathrin-mediated XP_011518406.1 Homo sapiens isoform X3 transferrin receptor protein 2 isoform XP_004840245.1 Heterocephalus glaber X4Heterocephalus glab low-density lipoprotein receptor-related protein 5 XP_013369679.1 Chinchilla lanigera isoform X1 transferrin receptor protein 1Tarsius syrichta] XP_008062434.1 Tarsius syrichta

low-density lipoprotein receptor-related protein 2 XP_004674925.1 Condylura cristata 30

low-density lipoprotein receptor class A domain- XP_012408029.1 Sarcophilus harrisii containing protein 3

low-density lipoprotein receptor-related protein 2 XP_012860404.1 Echinops telfairi

transferrin receptor protein 2 isoform X2 XP_011912650.1 Cercoce busatys transferrin receptor protein 1 isoform 1 NP_003225.2 Homo sapiens receptor protein 1 NP_001009312.1 Felis catus receptor protein 1 isoform 2 NP_001300894.1 Homo sapiens low-density lipoprotein receptor-related protein XP_008587166.1 Galeopterus variegatus 1B-like transferrin receptor protein 2 isoform X5 XP_012666270.1 Otolemur garnettii MAM and LDL-receptor class A domain- XP_011225420.1 Ailuropoda melanoleuca containing protein 1 transferrin receptor protein 2 XP_005371498.1 Microtus ochrogaster low-density lipoprotein receptor-related protein 8 XP_013203988.1 Microtus ochrogaster isoform X1 low-density lipoprotein receptor-related protein XP_007114085.1 Physeter catodon 6-like low-density lipoprotein receptor-related protein XP_006060785.1 Bubalus bubalis 2-like low-density lipoprotein receptor isoform X4 XP_006875187.1 Chrysochloris asiatica transferrin receptor protein 2 XP_003510681.2 Cricetulus griseus low-density lipoprotein receptor-related protein XP_012505828.1 Propithecus coquereli 1B-like LOW QUALITY PROTEIN: transferrin receptor XP_008982367.1 Callithrix jacchus protein 1

low-density lipoprotein receptor-related protein 4 XP_004627344.1 Octodon degus

low-density lipoprotein receptor-related protein XP_011758124.1 Macaca nemestrina 1B isoform X4 low-density lipoprotein receptor-related protein XP_007174157.1 Balaenoptera acutorostrata scammoni 2-like low-density lipoprotein receptor-related protein XP_006753857.1 Myotis davidii 2-like LOW QUALITY PROTEIN: low-density XP_012612840.1 Microcebus murinus lipoprotein receptor-related protein 5 low-density lipoprotein receptor-related protein XP_005654214.1 Sus scrofa 1B-like transferrin receptor protein 2 XP_004399094.1 Odobenus rosmarus divergens low-density lipoprotein receptor-related protein XP_010362122.1 Rhinopithecus roxellana 1B-like low-density lipoprotein receptor-like isoform X1 XP_011236420.1 Mus musculus low-density lipoprotein receptor isoform X2 XP_012397333.1 Sarcophilus harrisii lipoprotein receptor isoform 2 precursor NP_001239587.1 Mus musculus low-density lipoprotein receptor-related protein XP_014641797.1 Ceratotherium simum simum 2-like low-density lipoprotein receptor-related protein XP_013207424.1 Microtus ochrogaster 8-like LOW QUALITY PROTEIN: low-density XP_013831482.1 Capra hircus lipoprotein receptor-related protein 5 31

low-density lipoprotein receptor-related protein XP_005200458.2 Bos taurus 1B-like transferrin receptor protein 1 isoform X2 XP_012496738.1 Propithecus coquereli transferrin receptor protein 1 XP_004644608.1 Octodon degus low-density lipoprotein receptor-related protein 4 XP_007497461.1 Monodelphis domestica isoform X2 low-density lipoprotein receptor isoform X2 XP_007488582.1 Monodelphis domestica low-density lipoprotein receptor isoform X2 XP_012585162.1 Condylura cristata low-density lipoprotein receptor-related protein XP_003763886.1 Sarcophilus harrisii 1B-like low-density lipoprotein receptor-related protein XP_003474549.2 Cavia porcellus 10 transferrin receptor protein 1 isoform X1 XP_005639505.1 Canis lupus familiaris receptor protein 1 NP_001254781.1 Heterocephalus glaber transferrin receptor protein 1 XP_001364120.1 Monodelphis domestica transferrin receptor protein 2 XP_001505111.1 Equus caballus low-density lipoprotein receptor-related protein 4 XP_013002923.1 Cavia porcellus isoform X2

low-density lipoprotein receptor-related protein 4 XP_001519450.1 Ornithorhynchus anatinus

low-density lipoprotein receptor isoform X1 XP_011526312.1 Homo sapiens low-density lipoprotein receptor-related protein 2 XP_008542526.1 Equus przewalskii isoform X4

low-density lipoprotein receptor-related protein 2 XP_011815992.1 Colobus angolensis palliatus

low-density lipoprotein receptor-related protein XP_011237519.1 Mus musculus 1B isoform X6 low-density lipoprotein receptor-related protein XP_003581884.2 Bos taurus 1B isoform X1 low-density lipoprotein receptor-related protein XP_006727745.1 Leptonychotes weddellii 1B-like low-density lipoprotein receptor-related protein XP_004032799.1 Gorilla gorilla gorilla 2-like low-density lipoprotein receptor-related protein 2 XP_011902308.1 Cercocebus atys isoform X3 low-density lipoprotein receptor-related protein 4 XP_005578074.1 Macaca fascicularis isoform X3

low-density lipoprotein receptor-related protein 4 XP_005064978.2 Mesocricetus auratus

low-density lipoprotein receptor class A domain- XP_005661084.2 Sus scrofa containing protein 3 isoform X1

low-density lipoprotein receptor-related protein 4 XP_002763723.2 Callithrix jacchus

low-density lipoprotein receptor-related protein XP_007658590.1 Ornithorhynchus anatinus 8-like low-density lipoprotein receptor XP_012789970.1 Sorex araneus low-density lipoprotein receptor XP_011748364.1 Macaca nemestrina low-density lipoprotein receptor XP_012381542.1 Dasypus novemcinctus LOW QUALITY PROTEIN: low-density XP_009245765.1 Pongo abelii lipoprotein receptor-related protein 6 32

low-density lipoprotein receptor-related protein XP_008252213.1 Oryctolagus cuniculus 5-like low-density lipoprotein receptor-related protein XP_014585043.1 Equus caballus 5-like LOW QUALITY PROTEIN: low-density XP_008988734.1 Callithrix jacchus lipoprotein receptor-related protein 4-like

low-density lipoprotein receptor-related protein 5 XP_014923340.1 Acinonyx jubatus

low-density lipoprotein receptor-related protein XP_007664888.1 Ornithorhynchus anatinus 2-like low-density lipoprotein receptor class A domain- XP_007113202.1 Physeter catodon containing protein 3-like low-density lipoprotein receptor-related protein XP_006768719.1 Myotis davidii 1B-like low-density lipoprotein receptor-related protein XP_007489534.1 Monodelphis domestica 2-like low-density lipoprotein receptor-related protein XP_007652067.1 Cricetulus griseus 1B isoform X3 LOW QUALITY PROTEIN: low-density lipoprotein receptor class A domain-containing XP_014448558.1 Tupaia chinensis protein 3 low-density lipoprotein receptor-related protein 3 XP_005653323.1 Sus scrofa isoform X1 low-density lipoprotein receptor-related protein XP_005001140.1 Cavia porcellus 8-like low-density lipoprotein receptor XP_010327865.1 Saimiri boliviensis boliviensis low-density lipoprotein receptor XP_012364768.1 Nomascus leucogenys low-density lipoprotein receptor XP_012889401.1 Dipodomys ordii MAM and LDL-receptor class A domain- XP_014394873.1 Myotis brandtii containing protein 1 low-density lipoprotein receptor class A domain- XP_014393681.1 Myotis brandtii containing protein 3 LOW QUALITY PROTEIN: MAM and LDL- XP_013001437.1 Cavia porcellus receptor class A domain-containing protein 1 low-density lipoprotein receptor XP_010372129.1 Rhinopithecus roxellana transferrin receptor protein 1 XP_006884261.1 Elephantulus edwardii transferrin receptor protein 1 XP_012662345.1 Otolemur garnettii transferrin receptor protein 2 XP_001371634.2 Monodelphis domestica LOW QUALITY PROTEIN: transferrin receptor XP_014941216.1 Acinonyx jubatus protein 2 transferrin receptor protein 2 XP_014709450.1 Equus asinus LOW QUALITY PROTEIN: transferrin receptor XP_008696730.1 Ursus maritimus protein 2 transferrin receptor protein 1-like XP_003585042.2 Bos taurus

low-density lipoprotein receptor-related protein 8 XP_014310147.1 Myotis lucifugus

low-density lipoprotein receptor-related protein 8 XP_011540398.1 Homo sapiens isoform X8

low-density lipoprotein receptor-related protein 8 XP_006768452.1 Myotis davidii

low-density lipoprotein receptor XP_009191821.1 Papio anubis 33

low-density lipoprotein receptor-related protein 4 XP_006984456.1 Peromyscusmaniculatus bairdii

low-density lipoprotein receptor-related protein 4 XP_006499306.1 Mus musculus isoform X1

low-density lipoprotein receptor-related protein 4 XP_007519216.1 Erinaceuseuropaeus

LOW QUALITY PROTEIN: low-density XP_012807363.1 Jaculus jaculus lipoprotein receptor-related protein 8 low-density lipoprotein receptor-related protein 2 NP_110454.1 Rattus norvegicus precursor LOW QUALITY PROTEIN: low-density XP_006905916.1 Pteropus alecto lipoprotein receptor low-density lipoprotein receptor-related protein 2 XP_007669389.1 Ornithorhynchus anatinus isoform X2 LOW QUALITY PROTEIN: low-density XP_007494986.1 Monodelphis domestica lipoprotein receptor-related protein 2

low-density lipoprotein receptor-related protein 2 XP_013837666.1 Sus scrofa

LOW QUALITY PROTEIN: low density XP_007508013.1 Monodelphis domestica lipoprotein receptor-related protein 1

low-density lipoprotein receptor-related protein 2 XP_006902182.1 Elephantulus edwardii

low-density lipoprotein receptor-related protein 2 XP_005346576.1 Microtus ochrogaster

low-density lipoprotein receptor-related protein 2 XP_008256929.1 Oryctolagus cuniculus

low-density lipoprotein receptor-related protein XP_012968045.1 Mesocricetus auratus 1B low-density lipoprotein receptor-related protein XP_012381523.1 Dasypus novemcinctus 2-like low-density lipoprotein receptor-related protein XP_004473967.1 Dasypus novemcinctus 1B low-density lipoprotein receptor-related protein XP_004375173.1 Trichechus manatus latirostris 1B LOW QUALITY PROTEIN: low-density XP_012381521.1 Dasypus novemcinctus lipoprotein receptor-related protein 2 low-density lipoprotein receptor-like XP_008562823.1 Galeopterus variegatus low-density lipoprotein receptor-like XP_008709151.1 Ursus maritimus low-density lipoprotein receptor-related protein XP_013376941.1 Chinchilla lanigera 1-like low-density lipoprotein receptor-related protein XP_014309132.1 Myotis lucifugus 1-like low-density lipoprotein receptor-related protein NP_001101313.1 Rattus norvegicus 1B low-density lipoprotein receptor-related protein XP_004867723.1 Heterocephalus glaber 1B-like low-density lipoprotein receptor-related protein XP_005000327.1 Cavia porcellus 1-like low-density lipoprotein receptor-related protein XP_012317768.1 Aotusnancymaae 2-like low-density lipoprotein receptor-related protein XP_004321464.1 Tursiops truncatus 1B-like 34

low-density lipoprotein receptor-like XP_007539201.1 Erinaceus europaeus low-density lipoprotein receptor-related protein 1 XP_014393380.1 Myotis brandtii isoform X5 LOW QUALITY PROTEIN: low-density XP_011378606.1 Pteropus vampyrus lipoprotein receptor low-density lipoprotein receptor-related protein XP_009180405.1 Papio anubis 1B-like low-density lipoprotein receptor-related protein XP_008589038.1 Galeopterus variegatus 1B-like low-density lipoprotein receptor-related protein XP_011225837.1 Ailuropoda melanoleuca 2-like low-density lipoprotein receptor-related protein XP_002812501.1 Pongo abelii 1B-like low-density lipoprotein receptor-related protein XP_012505827.1 Propithecus coquereli 1B-like low-density lipoprotein receptor-related protein XP_011371513.1 Pteropus vampyrus 4-like low-density lipoprotein receptor-related protein XP_008567790.1 Galeopterus variegatus 5-like protein low-density lipoprotein receptor precursor NP_786938.1 Rattus norvegicus

low-density lipoprotein receptor-related protein 5 XP_001374330.1 Monodelphis domestica

low-density lipoprotein receptor-related protein 5 XP_006742753.1 Leptonychotes weddellii

low-density lipoprotein receptor-related protein 5 XP_012423423.1 Odobenus rosmarus divergens

low-density lipoprotein receptor-related protein 5 XP_004618628.1 Sorex araneus

low-density lipoprotein receptor XP_007526109.1 Erinaceus europaeus

low-density lipoprotein receptor-related protein 5 XP_013976412.1 Canis lupus familiaris

low-density lipoprotein receptor-related protein 5 XP_005578910.1 Macaca fascicularis

LOW QUALITY PROTEIN: transferrin receptor XP_003422538.2 Loxodonta africana protein 2 LOW QUALITY PROTEIN: low-density XP_014717111.1 Equus asinus lipoprotein receptor-related protein 5-like protein low-density lipoprotein receptor XP_007994273.1 Chlorocebus sabaeus low-density lipoprotein receptor-related protein XP_011216123.1 Ailuropoda melanoleuca 6-like transferrin receptor protein 1 XP_011816639.1 Colobus angolensis palliatus LOW QUALITY PROTEIN: low-density XP_007079407.1 Panthera tigris altaica lipoprotein receptor low-density lipoprotein receptor-related protein 2 XP_011509487.1 Homo sapiens isoform X3 low-density lipoprotein receptor-related protein XP_010620608.1 Fukomys damarensis 2-like low-density lipoprotein receptor-related protein XP_014962409.1 Ovis aries musimon 4-like isoform X2 low-density lipoprotein receptor-related protein XP_008528431.1 Equus przewalskii 2-like low-density lipoprotein receptor-like isoform X3 XP_011236422.1 Mus musculus 35

low-density lipoprotein receptor-related protein XP_003483638.1 Sus scrofa 1B-like

low-density lipoprotein receptor-related protein 3 XP_005616955.1 Canis lupus familiaris

low-density lipoprotein receptor-related protein XP_008523411.1 Equus przewalskii 2-like low-density lipoprotein receptor-related protein 2 XP_010824740.1 Bos taurus isoform X2 low-density lipoprotein receptor-related protein XP_003793073.1 Otolemur garnettii 1B-like low-density lipoprotein receptor-like XP_004321103.1 Tursiops truncatus low-density lipoprotein receptor-related protein XP_014587742.1 Equus caballus 1B-like LOW QUALITY PROTEIN: low-density XP_009180408.1 Papio anubis lipoprotein receptor-related protein 1B-like low-density lipoprotein receptor-related protein XP_012966161.1 Mesocricetus auratus 4-like isoform X1 low-density lipoprotein receptor-related protein XP_007076389.1 Panthera tigris altaica 6-like LOW QUALITY PROTEIN: low-density XP_014440794.1 Tupaia chinensis lipoprotein receptor-related protein 5-like protein low-density lipoprotein receptor-related protein XP_013837390.1 Sus scrofa 1B-like low-density lipoprotein receptor-related protein XP_006040727.1 Bubalus bubalis 2-like low-density lipoprotein receptor-related protein XP_537364.2 Canis lupus familiaris 10

low-density lipoprotein receptor-related protein 8 XP_006737735.1 Leptonychotes weddellii

low-density lipoprotein receptor-related protein XP_007109711.1 Physeter catodon 1B-like low-density lipoprotein receptor-like XP_003945116.2 Saimiri boliviensis boliviensis caveolin-2 isoform c NP_937855.1 Homo sapiens PREDICTED: caveolin-2 isoform X2 XP_007980865.1 Chlorocebus sabaeus PREDICTED: caveolin-2 isoform X1 XP_005205450.1 Bos taurus PREDICTED: caveolin-2 isoform X2 XP_005959438.1 Pantholops hodgsonii PREDICTED: hypothetical protein XP_003407275.1 Loxodonta africana LOC100655716 PREDICTED: caveolin-2 isoform X2 XP_008534172.1 Equus przewalskii

Caveolin-mediated PREDICTED: caveolin-2 isoform X1 XP_005628405.1 Canis lupus familiaris caveolin-2 NP_001162006.1 Pongo abelii PREDICTED: caveolin-2-like isoform X3 XP_005080993.1 Mesocricetus auratus