<<

BMC Biology BioMed Central

Research article Open Access Mapping the human membrane proteome: a majority of the human membrane can be classified according to function and evolutionary origin Markus Sällman Almén, Karl JV Nordström, Robert Fredriksson and Helgi B Schiöth*

Address: Department of Neuroscience, Functional Pharmacology, Uppsala University, Uppsala, Sweden Email: Markus Sällman Almén - [email protected]; Karl JV Nordström - [email protected]; Robert Fredriksson - [email protected]; Helgi B Schiöth* - [email protected] * Corresponding author

Published: 13 August 2009 Received: 15 April 2009 Accepted: 13 August 2009 BMC Biology 2009, 7:50 doi:10.1186/1741-7007-7-50 This article is available from: http://www.biomedcentral.com/1741-7007/7/50 © 2009 Almén et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Membrane proteins form key nodes in mediating the cell's interaction with the surroundings, which is one of the main reasons why the majority of drug targets are membrane proteins. Results: Here we mined the human proteome and identified the membrane proteome subset using three prediction tools for alpha-helices: Phobius, TMHMM, and SOSUI. This dataset was reduced to a non-redundant set by aligning it to the human genome and then clustered with our own interactive implementation of the ISODATA algorithm. The genes were classified and each group was manually curated, virtually evaluating each sequence of the clusters, applying systematic comparisons with a range of databases and other resources. We identified 6,718 human membrane proteins and classified the majority of them into 234 families of which 151 belong to the three major functional groups: receptors (63 groups, 1,352 members), transporters (89 groups, 817 members) or enzymes (7 groups, 533 members). Also, 74 miscellaneous groups with 697 members were determined. Interestingly, we find that 41% of the membrane proteins are singlets with no apparent affiliation or identity to any human protein family. Our results identify major differences between the human membrane proteome and the ones in unicellular organisms and we also show a strong bias towards certain membrane topologies for different functional classes: 77% of all transporters have more than six helices while 60% of proteins with an enzymatic function and 88% receptors, that are not GPCRs, have only one single membrane spanning α-helix. Further, we have identified and characterized new gene families and novel members of existing families. Conclusion: Here we present the most detailed roadmap of gene numbers and families to our knowledge, which is an important step towards an overall classification of the entire human proteome. We estimate that 27% of the total human proteome are alpha-helical transmembrane proteins and provide an extended classification together with in-depth investigations of the membrane proteome's functional, structural, and evolutionary features.

Page 1 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

Background either as receptors, transporters, or enzymes and that a Integral membrane proteins play a key role in detecting majority of all membrane proteins can be assigned to a and conveying outside signals into cells, allowing them to family of evolutionary related proteins while 41% of the interact and respond to their environment in a specific membrane proteins are found as single genes without manner. They form principal nodes in hormonal and neu- close relatives. Furthermore, we describe and classify new ronal signaling and attract large interest in therapeutic protein groups and novel members of existing families, interventions as the majority of drug targets are associated such as the putative solute carrier family AMAC and a to the cell's membrane. Although the human genome has novel putative calcium-channel gamma subunit. been public for several years, the exact number and iden- tity of all protein coding genes have been hard to deter- Results mine [1]. One of the most referenced papers regarding the We created a dataset of 13,208 human membrane pro- percentage of membrane proteins in proteomes is from teins based on consensus predictions of α-helices with 2001 where the membrane topology prediction method three applications, Phobius [12], TMHMM [2], and TMHMM was applied on a number of proteomes from SOSUI [13], in all 69,731 sequences in the human pro- different species to estimate the membrane protein con- teome dataset provided by IPI (v3.39). The predictions of tent, for example, Caenorhabditis elegans (31%), Escherichia the individual applications and the consensus approach coli (21%) and Drosophila melanogaster (20%) [2]. How- are shown in Figure 1. The dataset was reduced to a non- ever, the human or any other vertebrate's proteome was redundant set of 6,684 protein sequences by aligning all not included in this study. The original human genome predicted membrane proteins to the human genome with sequence project estimated 20% of the total gene count of BLAT [14] and removing all but the longest representative 31,778 genes to code for membrane proteins [3]. More protein for each genomic location. The non-redundant recently, four commonly used membrane topology pre- dataset was categorized into groups using an automatic diction methods were applied to the human proteome approach where all protein sequences were compared [4]. Based on the range of predictions by the different with each other and clustered according to similarity with methods 15 to 39% of the human proteome was dedi- our own implementation of the ISODATA algorithm cated to be membrane proteins, clearly illustrating how combined with manual curation where data from litera- difficult it is to estimate the number with automatic ture and public databases were considered. This extended approaches. The membrane proteomes of E. coli and Sac- charomyces cerevisiae have previously been described in a fairly comprehensive manner [5,6]. Recent overviews of membrane bound proteins discuss important membrane protein groups such as the G-protein coupled receptors (GPCR), Aquaporins, Ion channels, ATPases, their struc- ture and topology [7,8]. While several individual protein and gene families have been relatively well described, for example, the GPCRs [9] and Voltage-gated ion channels [10], there is a considerable number of genes that have remained unexplored.

We report the first detailed roadmap of the gene repertoire of human membrane bound proteins. We used 69,731 protein sequences from the International Protein Index (IPI) dataset, representing the total human proteome, to create an informative classification for the majority of the non-redundant transmembrane (TM) proteins. IPI is a top level domain, aiming to provide a union of the primary resources for proteins, as such it can be considered to con- SchematicresultsFigure 1 diagram of the transmembrane protein prediction Schematic diagram of the transmembrane protein tain all known protein sequences to current knowledge prediction results. Three different transmembrane predic- [11]. The analysis was performed in a two-step classifica- tion applications were applied on the International Protein tion procedure, involving automatic prediction and classi- Index (IPI) Human v.3.39 dataset of 69,731 protein fication in silico, combined with manual curation for each sequences. A consensus approach, where only the proteins of the protein groups, virtually sequence for sequence predicted as transmembrane by two applications (red color applying systematic comparisons with a range of data- in diagram) were considered, resulted in 13,208 predicted bases and other resources. We find that a large proportion transmembrane proteins. of the membrane proteins can be assigned a function

Page 2 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

the final membrane protein dataset to cover 6,718 pro- Recently, Clamp and colleagues provided a new updated teins, with 3,399 proteins of the final membrane dataset set of the human gene catalog [1]. This dataset contains a being categorized into one of 234 protein families or total of 23,789 genes from the Ensembl catalog of which groups and assigned a functional class (see Figure 2 and 19,523 are classified as valid protein-coding. We used this Materials and Methods for more details). To determine the dataset to classify our membrane proteins as true protein quality of our classification we used the sequence compar- coding or non-coding. Our analysis shows that 956 pro- isons from the clustering to find the median identity for teins of our membrane protein dataset were not present in the best hit among the classified and the unclassified pro- Clamp's dataset. They are considered to be invalid protein teins, respectively. The classified proteins have a signifi- coding genes together with the 402 proteins which repre- cant median identity greater than 35% (P < 0.001) and the sent confirmed invalid genes. When the invalid proteins unclassified have a median identity less than 13% (P < are excluded, 5,359 proteins remain of our membrane 0.001) using a non-parametric Wilcoxon signed rank test. protein dataset. This exclusion confirms the strength and quality of our classification as 1,106, or 81%, of the invalid genes constitute 33% of the unclassified proteins.

SchematicFigure 2 overview of the classification process of all human membrane proteins Schematic overview of the classification process of all human membrane proteins. The classification process had two general steps: an automatic and a manual or semi-manual. The automatic step can be divided into four parts, represented by blue boxes. First a dataset representing the human proteome was downloaded from the International Protein Index Trans- membrane proteins were predicted from the proteome by using three different TM helix prediction softwares: Phobius, SOSUI, and TMHMM. Proteins predicted to contain at least one TM helix by two of the softwares were assigned for further analysis. Splice variants were removed using BLAT to align all protein sequences to the human genome. The longest protein sequence for each genomic location, defined as a gene, was selected and clustered using a local implementation of the ISO- DATA algorithm. Pfam and GO terms, describing molecular function, were downloaded from IPI and used to provide an initial view of the created clusters' function and family affiliation. This information was used to divide them into three functional classes (receptors, enzymes, and transporters) and one miscellaneous class. In the manual classification step the clusters were compared with group databases, specialized in the three functional groups, and to family databases that provide information about protein families and their members. These resources are shown by the green bars in the figure. By combining the results from the clustering with members found in databases a final result could be compiled for the different protein families and groups.

Page 3 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

Thus, when making this more stringent estimation of pro- receptors and divided them into 63 groups (Figure 3). tein numbers, we have classified 3,145 (59%) of the valid Most of these families can be placed in one of four super- membrane proteins. families; G protein-coupled receptors (901 proteins), Recep- tor type tyrosine kinases (72 proteins), Receptors of the Receptors immunoglobulin superfamily and related (149 proteins) and A receptor is a protein that mediates a cellular response Scavenger receptors and related (63 proteins). The remain- upon binding of a ligand. We identified 1,352 proteins as

TheFigure human 3 receptors The human receptors. The figure shows the major families of membrane proteins that are classified to primarily function as receptors, proteins that trigger a cellular response upon binding of specific ligands. The tree structures give a comprehensive view of subfamilies in the largest families. The number of genes and function for each family have been determined by combin- ing results from clustering, using the ISODATA algorithm, with data from the literature and public databases. Primarily the Human Plasma Membrane Receptome database and HGNC have been used. Consensus TM helix numbers have been set through evaluation of data from external resources among the literature and databases together with prediction results from Phobius, SOSUI, and TMHMM. The Other table contains receptor families that do not fit into any of the larger receptor fami- lies.

Page 4 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

ing 167 receptors, in 20 groups, are annotated as Other Availability and further analysis receptors. The classification together with predictions by Phobius, SOSUI, and TMHMM for each protein is available in Addi- Transporters tional file 1. Sequences in FASTA format for each class and Transporters perform the movement of a substrate across a BLAST database for the whole dataset can be found in membranes by utilizing electrochemical gradients or Additional file 2. To help readers in making their own energy from chemical reactions. We identified 817 trans- analysis of the dataset a short user guide is provided as porters and placed them in 89 groups (Figure 4). The Additional file 3. For additional data or support with groups have been primarily arranged into three major extended searches and analysis of the dataset and classifi- functional classes: Channels (247 proteins), Solute carriers cation, feel free to contact the corresponding author. (393 proteins), and Active transporters (81 proteins). The remaining eight groups are annotated as Other transporters Discussion (51 proteins). Here we also include 42 auxiliary transport We provide a non-redundant dataset for the human mem- proteins in 9 groups that modulate the activity of other brane proteome and a qualitative functional classification transporters rather than performing the transport them- for all major groups and families containing 6,718 pro- selves. teins. Comparison with the most recent and reliable set of the genes in the human genome [1] suggests that the Enzymes 5,359 validated protein coding α-helical transmembrane Enzymes are proteins with the ability to catalyze a chemi- proteins comprise 27% of the entire human proteome. cal reaction. We have identified 533 enzymes (Figure 5). The relative number of membrane proteins coded in the They have been classified based on the EC system, which human genome was estimated to be 20% after the finish- classifies enzymes performing a similar type of reaction ing of the human genome sequence [3]. The overall por- into six major classes: Oxidoreductases (123 proteins), tion of membrane bound proteins has thus increased with Transferases (194 proteins), Hydrolases (178 proteins), 7 percentage points, whereas the estimated number of Lyases (17 proteins), Isomerases (8 proteins) and Ligases (7 genes has decreased significantly from 31,778 [3] to proteins). An additional six enzymes belong to multiple 19,523 [1]. This may suggest that the identity of mem- classes. brane proteins in general has been more reliable than sol- uble proteins, which may reflect that they receive Miscellaneous relatively large attention in medical research because of The 74 protein families that did not fit into any of the 3 their role as drug targets. Our number of 27% is within the major functional classes were gathered in a class called two previously suggested spans of 15 to 39% [4] mem- Miscellaneous. This class contains 697 proteins and is fur- brane proteins in the human proteome and 20 to 30% [2] ther divided into four subclasses: Ligands (57 proteins), in any proteome, regardless of species. It is notable that Other (272 proteins), Structural/Adhesion proteins (187 our clustering and classification resulted in 3,145 (59%) proteins) and Proteins of unknown function (181 proteins). valid proteins of the total dataset being identified to belong to 234 families or groups with at least two mem- Membrane topology bers, while 41% of the data set are single genes with no We created an overview of the occurrence of structures clear identity (significantly lower than 13% (P < 0.001)) with a certain membrane topology, that is, the number of to any other human membrane protein coding gene. We TM α-helices and the localization of the N-terminal, in the are not aware of any exact estimate of the relative percent- different functional classes based on predictions with age of genes belonging to protein families within the Phobius (Figure 6). This clearly shows that 1 TM proteins human genome but one previous estimation has sug- are the most numerous structures and that the number gested that at least 40% of the human genes are members generally decreases with increasing numbers of TM heli- of gene families [15]. ces. However, the 7 TM structure is an outlier, having the second highest count. It is also possible to see a trend Comparison of membrane proteome composition where some topologies are more common for certain The largest functional group of membrane proteins is the classes. Receptors have in general a 1 TM or 7 TM topology Receptor class, constituting 23% of our membrane pro- representing 29% and 52% of all receptors respectively, teome dataset and 40% of the classified proteins. This is whereas 71% of the transporters have more than 6 TM. It in large contrast with the membrane proteome of E. coli is also evident that the fraction of unclassified proteins is where receptors count for only 5% of the membrane pro- greater for low numbers of TM helices. 1 TM and 2 TM teome, leaving transporters as the most prominent group topologies contribute to 77% of the unclassified proteins. with 40%, compared with 15% in humans and 32% in S. cerevisiae's membrane proteomes [5,6]. The estimated number of membrane receptor proteins has increased

Page 5 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

TheFigure human 4 transporters The human transporters. The figure shows the major families of membrane proteins that primarily function as transport- ers, proteins that facilitate the movement of substrates across a membrane. The tree structures represent three major classes of transporters that differ in energy dependence and utilization during transport. The number of genes and function for each family have been determined by combining results from clustering, using the ISODATA algorithm, with data from the literature and public databases. Primarily TCDB and HGNC were used. Consensus TM helix numbers have been set through evaluation of data from external resources among the literature and databases together with prediction results from Phobius, SOSUI, and TMHMM. The Other table contains transporter families that do not fit into any of the larger families. The table of Auxiliary transport proteins contains families that are not essential for transport, but modulate other transport proteins through differ- ent interactions.

Page 6 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

TheFigure human 5 enzymes The human enzymes. The figure shows the major families of membrane proteins that primarily function as enzymes, pro- teins that catalyze a chemical reaction. The enzymes were divided into six major classes, depending of the character of the chemical reaction they are involved in. This was performed according to the EC system. Each of the six classes is represented by a tree structure, showing some of the hierarchical order of subclasses within each class. The number of genes for each class and subclass, and function, have been determined by combining results from clustering, using the ISODATA algorithm, with data from the literature and public databases. Primarily the BRENDA database was used. Consensus TM helix numbers have been set through evaluation of data from external resources among the literature and databases together with prediction results from Phobius, SOSUI, and TMHMM. The box contains proteins that belong to more than one class.

from about 35 in E. coli to over 1,000 in humans. The tems for communication between cells in more complex, GPCRs (7 TM) count for the largest expansion, with 67% multicellular organisms [16-18]. of the human receptors compared with zero in bacteria and three proteins in S. cerevisiae [16]. The major expan- Membrane topology and protein function sion of the receptors within the metazoan lineage may Common for all membrane proteins in this study are that reflect the need for a diverse repertoire of signaling sys- their amino acids form membrane crossing hydrophobic

Page 7 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

Figure 6 (see legend on next page)

Page 8 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

MiscellaneousFigure 6 (see humanprevious protein page) groups Miscellaneous human protein groups. The five boxes represent functional classes of proteins that do not fit within the definition of receptors, enzymes, or transporters. Structure/Adhesion proteins are those that build up structure between or within cells or mediate adhesion between the cell and the surroundings. Ligand proteins are groups that mainly function as lig- ands, structures that bind to receptors. Vesicle membrane proteins are proteins found in the membrane of cellular vesicles. The large Other group are protein groups of various functions that do not fit together with the other groups. Proteins of 'unknown function' show groups of related proteins for which no known function has been found. The number of genes and their function for each group have been determined by combining results from clustering, using the ISODATA algorithm, with data from the literature and public databases. Primarily HGNC and UniProt were used. Consensus TM helix numbers have been set through evaluation of data from external resources among the literature and databases together with prediction results from Phobius, SOSUI, and TMHMM.

α-helices and their function depends on the number of of the human transporter are representative for transport- helices and the orientation of the N- and C-terminal. Our ers in general, considering a number of distant species. results show a clear bias in favor of certain topologies There are also receptors with a high number of TM helices. within the membrane proteome. We find that the major- The 7 TM structure of the large GPCR group, which con- ity of the proteins with odd numbers of TM helices have tains 67% of the human receptors, undergoes a conforma- their N-terminal positioned on the outside of the mem- tion change upon ligand binding, which propagates the brane, whereas the opposite is true for those with an even signal into the cell. Other topologies among the receptors number (Figure 7). Hence, it is more common for pro- other than 1 TM and 7 TM are only found in seven of the teins to have their C-terminal oriented to the inside of the 61 classified receptor groups, a total of 26 proteins. One membrane. We find that this is true for 75% of the TM of these are the two Hedgehog receptors, Patched, with a proteins predicted by Phobius. Previously it has been 12 TM topology which otherwise is almost exclusively reported that 82% of the TM proteomes of S. cerevisiae found among transporters. In the 74 groups of the Miscel- have the C-terminal located on the cells' inner side [6]. It laneous class, some interesting observations regarding is common that the C-terminal part of membrane pro- membrane topology are made (Figure 6). In the group of teins interacts with other intracellular proteins, for exam- Structure/Adhesion proteins, with 187 members, all 12 ple, G-proteins and many accessory proteins bind to the families show either a 1 TM or 4 TM topology. Most C-terminal of GPCRs and this is essential for signaling numerous are the nine families of 1 TM proteins, 73% of [19]. Many proteins have only one single TM helix (47%) the group, which among others contains the large cad- and while some of these TM regions seem to have a pri- herin and protocadherin families of 30 and 38 proteins, mary role to simply anchor the protein to the membrane, respectively. These proteins generally form homologous that is, no signal or substrate is relying on the TM helices connections between cells, that is, two identical proteins to cross the membrane, several form oligomers that can from the cells adhere to each other [20]. They are impor- participate in the signal process. Non-GPCR receptors are tant during development and for keeping tissue integrity common 1 TM proteins (Phobius predicts 397 receptors and morphology, since cells expressing the same proteins to have this topology) and 60% of the proteins in the stick together. In addition, there are also some minor 1 enzyme group are also found here, but only 57 transport- TM families, for example, Sarcoglycan, that are found in ers. On the other hand, multi-TM proteins are often structures that anchor the muscle fibers to the extracellular highly dependent of the arrangement of their TM helices matrix [21]. The remaining 27% of this group are 4 TM that form complex structures. Transporters are the most proteins of Structure/Adhesion character that can be obvious group, which in general has high TM numbers; divided into three families; gap junctions (connexins), 77% of the transporters have at least a 6 TM topology and claudins, and five proteins of the EMP/PMP22/LIM fam- 76% of the classified proteins with at least 8 TM helices ily. are transporters. This is also true for a majority of the fam- ilies classified as transporters (Figure 4). In the TCDB The members of the two latter families belong to the same database, which holds transporter families from all organ- Pfam domain (PF00822) and are likely to share common isms, 70% of 2,847 α-helical channels, secondary trans- descent. In the Pfam database we find that 58% of the porters and active transporters, have at least a 6 TM 1,113 non-human proteins containing any of the Pfam topology according to predictions by Phobius. Further, 12 families found in the Structure/Adhesion class, except TM was the most common topology in TCDB (17%), immunoglobulins, are predicted to possess only one TM which is in analogy with 16% in human (Figure 7 and while 20% are predicted to have a 4 TM topology. This Additional file 1). Thus, a high number of TM helices is a suggests that the relative frequency of 1 TM and 4 TM good predictor for transporter function and the topologies Structure/Adhesion proteins is similar in humans and

Page 9 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

other organisms. Proteins with a 4 TM structure are also it is evident that a larger portion of the short proteins with found in the 'Other' and 'Protein of unknown function' fewer than seven helices, and especially 1 TM proteins, are groups, constituting 25% and 16% of the respective uncharacterized. Such bias has been shown for other groups. Hence, the 4 TM proteins are over-represented in organisms in previous studies [5,6]. the Miscellaneous class, but they are also found to a rather high extent among transporters, especially in the families Identification of uncharacterized proteins and families of ion channels. It should be noted that the 4 TM proteins Our clustering resulted in the identification of new pro- of transporters and the Miscellaneous class generally are tein groups and novel members of existing families. Dis- predicted to have opposite topology: the transporters have cussions about individual members in existing families their N-termini located on the outside of the membrane are found in Additional file 4. Here we want to highlight whereas the Miscellaneous proteins have it on the inside. five clusters with a total of 41 sequences found in the Mis- The 4 TM topology proteins which are found in the Mis- cellaneous class where no previous relationship within cellaneous class are not classical receptors or enzymes, but the families have been reported. These families are simply rather involved in the formation of structures (for exam- termed New TM Group (NTMG): NTM1G1, NTM1G2, ple, claudin), vesicle trafficking (for example, synaptogy- NTMG1, NTMG2, and NTM5G1; they are more or less rin and synaptophysin) or organizing other proteins of uncharacterized with little annotation and no similarity to the membrane (for example, tetraspanins) [22-24]. There- any Pfam domain. The NTM5G1 family contains three fore they could have been ignored in pharmacological proteins; two are found in a cluster at chromosome 11 research efforts, which explain why many of the 4 TM pro- and one at chromosome 4. Two are predicted to be 5 TM teins are found in this group. In general, it seems that 4 proteins and one 3 TM. They show high identity to the C- TM topology proteins are relatively under-studied. Overall terminal end of the 11 TM protein Unc93B1. This protein

TransmembraneFigure 7 topology analysis Transmembrane topology analysis. The graphs show the distribution of proteins with different membrane topologies for the membrane proteome. Proteins with the N-terminal located on the outside of the membrane are plotted upwards and those with the N-terminal on the inside are plotted downwards. The colors of the bars represent the proportion of different functional groups and unclassified proteins for a topology. The topologies have been predicted for each protein with Phobius. A. The graph shows the distributions of proteins predicted to have one TM helix. B. The graph shows the distribution of pro- teins predicted to have multiple TM helices. Forty-three proteins with more than 14 TM helices have been excluded from graph B.

Page 10 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

is found in chromosome 11, implicating an expansion of with SOSUI and TMHMM to avoid false-positive predic- the family through a gene duplication event. Recently tion as suggested by Ahram and colleagues [4]. Candidate Unc93B1 was reported to be involved in trafficking of toll membrane proteins were initially selected as those pre- receptors to endolysosomes and is proposed to be dicted to have TM helices by at least two applications. The involved in immunodeficiency, but when its homolog TM predictions received from the three individual applica- was initially characterized in C. elegans it was found to be tions and after the consensus approach can be found in involved in muscle contraction, which suggests multiple Figure 1. The number of protein sequences was reduced by functions for the putative family [25,26]. The three novel aligning all protein sequences to the human genome with proteins have orthologs in several species, which supports BLAT and selecting the longest sequence from each group them as valid proteins and they might represent a novel of overlapping sequences [14]. Consequently, only one subfamily of truncated Unc93B1 homologues (data not representative protein sequence for each genomic locus shown). Such truncated genes were discussed by Kashuba was kept and alternative splice variants and so on were and colleagues as they found clones with high similarity discarded. This non-redundant membrane proteome to the 3' part of the Unc93B1 gene [27]. None of the other dataset was used in the clustering. NTMG families has any similarities with known proteins and can thus be considered as virtually uncharacterized. Clustering Considering the extent of our clustering methods we find The clustering was performed with a local implementa- it unlikely that larger groups of closely related proteins are tion of the ISODATA algorithm [28] with improvements left to be discovered within the human TM proteome, according to Philips [29]. The algorithm initially chooses although it cannot be entirely excluded. However, there a number of random data as cluster centroids and assigns are probably still several distant members of existing fam- all the remaining data to the closest cluster centroid. The ilies and diverse novel families that could be identified in clusters are evaluated; clusters with fewer than three mem- the future using more sensitive techniques than sequence bers are dropped and clusters where the standard devia- comparisons. tion of the internal distances is above an empirically determined threshold are split. Finally, new centroids are Conclusion calculated as the average of the data in each cluster, respec- We annotated the majority of the human membrane tively. This procedure is iterated for a selected number of bound proteome linked with functional properties and times or until the clusters do not change between two iter- family or group affiliation. This classification represents ations. In our implementation, each protein was repre- most major families and groups found in the membrane sented by a vector containing distances to every other proteome and provides the most detailed and updated protein in the dataset. Distances were calculated as the count of their members. Overall, our clustering and func- score from a Needlemann-Wunsch global alignment [30] tional classification approach is likely to be useful in order normalized by the alignment length as produced by the to create a detailed map of the entire human genome. implementation in the EMBOSS package run with stand- ard parameters [31]. In the clustering, Euclidean distance Methods was used between data. The method was stabilized by Retrieval of the initial membrane proteome dataset making several runs. Consensus clusters were created by IPI Human version 3.39 was downloaded from EBI con- letting the data that shared clusters in all runs define ker- taining 69,731 protein sequences [11]. Membrane topol- nels. Each datum was assigned to the kernel which it clus- ogy was predicted for the IPI Human dataset by the use of tered with the most times and clusters were merged if a three different applications to improve the accuracy [4]: datum was associated to several clusters in more than 50% Phobius, TMHMM, and SOSUI [2,12,13]. Phobius and of the runs. BLAST [32] and HMMER [33] was used to TMHMM both use hidden Markov models (HMM) to pre- confirm and mine each accepted cluster. HMMs for each dict membrane topology, but different training sets have cluster were constructed by iteratively constructing multi- been used and Phobius also uses a HMM to predict signal ple alignments with Kalign [34] and removing columns peptides. SOSUI evaluates hydrophobicity represented in less than 10% of the sequences, or in at and amphiphilicity for its predictions and complements least two sequences if the cluster contains less than 10 the HMM methods as it is not dependent on training sets. sequences. A tagged BLAST database was constructed from These three programs predict the topology of TM proteins the clusters and a background consisting of the proteins spanning the membrane with α-helices. Thus, we consider for which no TM helices were predicted. The background only such membrane proteins in this study. set was reduced during the procedure, as non-TM mem- bers of families were discovered, for example, kinases First Phobius was used to predict TM helices and signal [35]. Proteins which had at least four of the top five peptides for the IPI Human dataset. The predicted signal unique BLAST hits with E-values below 0.01 and match- peptides were cut out of the sequences before prediction ing the best hits cluster-HMM were automatically assigned

Page 11 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

to that cluster. Exceptions were made if the query cluster Multiple sequence alignments for each cluster were cre- contained less than five transcripts; then it was enough to ated following the same method as described for the clus- hit all query transcripts in order for a protein to be tering. The multiple sequence alignments were used to assigned to that cluster. The whole cluster procedure was build and calibrate HMMs with HMMER, using standard repeated seven times on the remaining unclustered pro- parameters. The TCDB dataset was searched against the teins after each iteration. The created clusters were used as cluster HMMs with HMMER. A cluster was assigned the a starting point for the classification. same transporter class as a sequence if the E-value was below 10-6 and the HMM was the best hit for the Classification sequence. In addition, TCDB were manually investigated The sequences of the clusters were assigned with GO terms through the website and families found to be missing in [36], describing molecular function, and Pfam [37] fami- the clustering were gathered from the IPI dataset. lies retrieved from the Gene Ontology website and the IPI human annotations. This information, together with the Enzymes: the BRENDA database IPI annotation, was used to receive an initial view of the The BRENDA human dataset was downloaded from the clusters' protein family affiliations and the general func- website and aligned against the human genome using tion of the families. The clusters were sorted into one of BLAT. Splice variants were removed and the longest repre- four classes depending on their type function: receptors, sentative for each genomic location was kept, using the transporters, enzymes, or miscellaneous. The clusters with same approach as described in the retrieval of the mem- proteins that fitted into more than one class, such as lig- brane proteome dataset. The BLAT results for the BRENDA and-gated ion channels (that is, transporter and receptor dataset were checked for overlap with the cluster dataset. function) and receptor-type kinases (that is, enzyme and If two sequences overlapped they were considered to be receptor function), were chosen and sorted according to representatives for the same gene and the cluster sequence one function. This was done by evaluating our data and was annotated as an enzyme of the same class as the literature to make the best and most objective choice pos- BRENDA sequence. Clusters were then annotated with the sible. During the manual classification the clusters were same enzyme class as their containing sequences. compared in terms of members and function to external references in two steps. Step 2: comparison with family specific resources In the second step, comparisons were made between the Step 1: comparison with group databases clusters and resources containing family oriented data During the first step we used general group databases spe- such as records of family members and information about cialized in receptors (HPMR) [18], transporters (TCDB) family function and structure. The resources were both [38] or enzymes (BRENDA) [39]. The databases were general (for example, HGNC [40] and UniProt [41]) and examined with slightly different approaches (see below), family specific (for example, web-resources such as Kin- due to differences in content and availability. Base [35] and SLC-tables [42] or the literature). This was performed by manual inspection and/or bioinformatic Receptors: the HPMR database approaches, similar to Step 1. The clusters of the miscella- The HPMR database is accessible through its web interface neous families were carefully examined and annotated and no sequence dataset could be downloaded. Thus, the after comparison with HGNC and UniProt. The group was comparisons between families assigned among the clus- divided into four groups: Structure/Adhesion, Ligand, ters and HPMR were performed by purely manual inspec- Other, and Proteins of unknown function. The two latter tion. The HPMR website was examined and the different groups contain protein families with functions that did families of receptors were identified and compared with not fit in the other groups and protein families of the cluster dataset, allowing clusters to be classified as the unknown function. The manual curation allowed for correct receptor family. Families found to be missing sequences and families that were missing in the clusters to among the clusters were manually added by gathering be added. This made it possible to construct as complete sequences from the IPI dataset. a record as possible for the major protein families and classify it according to functionality. If possible, the Transporters: the TCDB database number of consensus TM helices was set for each family. The TCDB database provides a hierarchical classification This was performed by manual evaluation of the predic- system, annotation, and information for each class of dif- tion results from TMHMM, SOSUI, and Phobius together ferent transporters. A dataset with representative protein with data from the literature. sequences for each transporter class is provided. However, the sequences are from various organisms and human Estimating the size of the membrane proteome sequences are not available for all transporters. The IPI dataset has the ambition to contain all known pro- tein sequences, not only the well established. Thus, the

Page 12 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

use of IPI in the analysis limits the possibility to miss less characterized proteins for the cost of more false positives, Additional file 4 such as protein sequences from gene prediction artifacts This text describes putative novel protein families and members of and non-coding genes that occurs in this dataset. This existing families. Click here for file complicates the estimation of the number of membrane [http://www.biomedcentral.com/content/supplementary/1741- proteins. To address this issue we have used a dataset cre- 7007-7-50-S4.doc] ated by Clamp and colleagues based on the union of the human gene catalog of Ensembl 35 and 48 where all genes have been classified as protein-coding or non-coding [1,43]. All transcript sequences representing the genes in Acknowledgements this dataset were downloaded from the Ensembl FTP-site We thank Professor Gunnar von Heijne at Stockholm University for valua- [44] and aligned against the human genome with BLAT. ble comments on the manuscript, and Michele Clamp in Erik Landers lab at The transcripts were checked for overlap with our Broad Institute of Massachusetts Institute of Technology, and Harvard, USA membrane dataset. If a membrane dataset sequence over- for providing classification tables for the Ensemble 48 genes. The studies lapped a Clamp-sequence it inherited its class. Conse- were supported by the Swedish Research Council, The Novo Nordisk Foundation, Swedish Royal Academy of Sciences, and Magnus Bergvall quently, we can make a more critical estimation of the Foundation. RF was supported by the Göran Gustafssons foundation. number of genes coding for membrane proteins. References Abbreviations 1. Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad- HMM: hidden Markov model; IPI: International Protein Toh K, Lander ES: Distinguishing protein-coding and noncod- ing genes in the human genome. Proc Natl Acad Sci USA 2007, Index; NTMG: New TM Group; TM: transmembrane. 104:19428-19433. 2. Krogh A, Larsson B, von Heijne G, Sonnhammer EL: Predicting transmembrane protein topology with a hidden Markov Authors' contributions model: application to complete genomes. J Mol Biol 2001, MSA carried out the study, analyzed the data, participated 305:567-580. in the design, and drafted the manuscript. KJVN imple- 3. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris mented the cluster algorithm and participated in the K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, design of the study and in the writing of the manuscript. McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, RF and HS conceived the study, and participated in its Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, et al.: Initial sequencing and analysis of the human genome. Nature 2001, design and the writing of the manuscript. All authors read 409:860-921. and approved the final manuscript. 4. Ahram M, Litou ZI, Fang R, Al-Tawallbeh G: Estimation of mem- brane proteins in the human proteome. In Silico Biol 2006, 6:379-386. Additional material 5. Daley DO, Rapp M, Granseth E, Melen K, Drew D, von Heijne G: Global topology analysis of the Escherichia coli inner mem- brane proteome. Science 2005, 308:1321-1323. Additional file 1 6. Kim H, Melen K, Osterberg M, von Heijne G: A global topology map of the Saccharomyces cerevisiae membrane proteome. The table contains the International Protein Index (IPI) accession Proc Natl Acad Sci USA 2006, 103:11142-11147. numbers and classification for the final membrane protein dataset 7. von Heijne G: Membrane-protein topology. Nature Rev 2006, together with predictions by Phobius, SOSUI, and TMHMM. 7:909-918. Click here for file 8. von Heijne G: The membrane protein universe: what's out [http://www.biomedcentral.com/content/supplementary/1741- there and why bother? J Intern Med 2007, 261:543-557. 9. Lagerstrom MC, Schioth HB: Structural diversity of G protein- 7007-7-50-S1.xls] coupled receptors and significance for drug discovery. Nat Rev Drug Discov 2008, 7:339-357. Additional file 2 10. William A, Catterall KG, Chandy DE, Clapham GA, Gutman F, Hof- ZIP-archive containing FASTA files for all classes and a BLAST data- mann AJ, Harmar DR, Abernethy MS: International Union of Pharmacology: approaches to the nomenclature of voltage- base for the entire dataset. gated ion channels. Pharmacol Rev 2003, 55:573-574. Click here for file 11. Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, [http://www.biomedcentral.com/content/supplementary/1741- Apweiler R: The International Protein Index: an integrated 7007-7-50-S2.zip] database for proteomics experiments. Proteomics 2004, 4:1985-1988. 12. Kall L, Krogh A, Sonnhammer EL: A combined transmembrane Additional file 3 topology and signal peptide prediction method. J Mol Biol A short user guide for searching and further analysis of the dataset 2004, 338:1027-1036. and classification. 13. Hirokawa T, Boon-Chieng S, Mitaku S: SOSUI: classification and Click here for file secondary structure prediction system for membrane pro- teins. Bioinformatics 1998, 14:378-379. [http://www.biomedcentral.com/content/supplementary/1741- 14. Kent WJ: BLAT – the BLAST-like alignment tool. Genome Res 7007-7-50-S3.doc] 2002, 12:656-664. 15. Futuyma D: The origin of genetic variation. In Evolution Sunder- land, MA: Sinauer Associates Inc; 2005:163-164.

Page 13 of 14 (page number not for citation purposes) BMC Biology 2009, 7:50 http://www.biomedcentral.com/1741-7007/7/50

16. Fredriksson R, Schioth HB: The repertoire of G-protein-coupled 41. Bairoch A, Bougueleret L, Altairac S, Amendolia V, Auchincloss A, receptors in fully sequenced genomes. Mol Pharmacol 2005, Argoud-Puy G, Axelsen K, Baratin D, Blatter MC, Boeckmann B, Bol- 67:1414-1425. leman J, Bollondi L, Boutet E, Quintaje SB, Breuza L, Bridge A, deCas- 17. Manning G, Plowman GD, Hunter T, Sudarsanam S: Evolution of tro E, Ciapina L, Coral D, Coudert E, Cusin I, Delbard G, Dornevil D, protein kinase signaling from yeast to man. Trends Biochem Sci Roggli PD, Duvaud S, Estreicher A, Famiglietti L, Feuermann M, 2002, 27:514-520. Gehant S, Farriol-Mathis N, et al.: The Universal Protein 18. Ben-Shlomo I, Yu Hsu S, Rauch R, Kowalski HW, Hsueh AJ: Signal- Resource (UniProt) 2009. Nucleic Acids Res 2009, 37:D169-174. ing receptome: a genomic and evolutionary perspective of 42. Hediger MA, Romero MF, Peng JB, Rolfs A, Takanaga H, Bruford EA: plasma membrane receptors involved in signal transduction. The ABCs of solute carriers: physiological, pathological and Sci STKE 2003, 2003:RE9. therapeutic implications of human membrane transport 19. Bockaert J, Marin P, Dumuis A, Fagni L: The 'magic tail' of G pro- proteins. Introduction. Pflugers Arch 2004, 447:465-468. tein-coupled receptors: an anchorage for functional protein 43. Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, networks. FEBS Lett 2003, 546:65-72. Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fern- 20. Ivanov DB, Philippova MP, Tkachuk VA: Structure and functions andez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, of classical cadherins. Biochemistry (Mosc) 2001, 66:1174-1186. Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kin- 21. Judge LM, Haraguchiln M, Chamberlain JS: Dissecting the signaling sella R, Kokocinski F, Kulesha E, Lawson D, Longden I, et al.: Ensembl and mechanical functions of the dystrophin-glycoprotein 2009. Nucleic Acids Res 2009, 37:D690-697. complex. J Cell Sci 2006, 119:1537-1546. 44. The Ensembl FTP-site [ftp://ftp.ensembl.org/pub/] 22. Krause G, Winkler L, Mueller SL, Haseloff RF, Piontek J, Blasig IE: Structure and function of claudins. Biochim Biophys Acta 2008, 1778:631-645. 23. Hubner K, Windoffer R, Hutter H, Leube RE: Tetraspan vesicle membrane proteins: synthesis, subcellular localization, and functional properties. Int Rev Cytol 2002, 214:103-159. 24. Berditchevski F, Odintsova E: Tetraspanins as regulators of pro- tein trafficking. Traffic 2007, 8:89-96. 25. Kim YM, Brinkmann MM, Paquet ME, Ploegh HL: UNC93B1 deliv- ers nucleotide-sensing toll-like receptors to endolysosomes. Nature 2008, 452:234-238. 26. Levin JZ, Horvitz HR: The Caenorhabditis elegans unc-93 gene encodes a putative transmembrane protein that regulates muscle contraction. J Cell Biol 1992, 117:143-155. 27. Kashuba VI, Protopopov AI, Kvasha SM, Gizatullin RZ, Wahlestedt C, Kisselev LL, Klein G, Zabarovsky ER: hUNC93B1: a novel human gene representing a new gene family and encoding an unc- 93-like protein. Gene 2002, 283:209-217. 28. Ball GH, Hall DJ: ISODATA, A Novel Method of Data Analysis and Pattern Classification Menlo Park: Stanford Research Institute; 1965. 29. Philips S: Reducing the compuTation time of the Isodata and K-means unsupervised classification algorithms. Geoscience and Remote Sensing Symposium: 24–28 June 2002: Toronto, Canada . 30. Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48:443-453. 31. Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16:276-277. 32. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410. 33. Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14:755-763. 34. Lassmann T, Sonnhammer EL: Kalign – an accurate and fast mul- tiple algorithm. BMC Bioinformatics 2005, 6:298. 35. Manning G, Whyte DB, Martinez R, Hunter T, Sudarsanam S: The protein kinase complement of the human genome. Science 2002, 298:1912-1934. 36. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel- Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet Publish with BioMed Central and every 2000, 25:25-29. 37. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, scientist can read your work free of charge Forslund K, Eddy SR, Sonnhammer EL, Bateman A: The Pfam pro- "BioMed Central will be the most significant development for tein families database. Nucleic Acids Res 2008, 36:D281-288. disseminating the results of biomedical research in our lifetime." 38. Saier MH Jr, Tran CV, Barabote RD: TCDB: the Transporter Classification Database for membrane transport protein Sir Paul Nurse, Cancer Research UK analyses and information. Nucleic Acids Res 2006, 34:D181-186. Your research papers will be: 39. Chang A, Scheer M, Grote A, Schomburg I, Schomburg D: BRENDA, AMENDA and FRENDA the enzyme information available free of charge to the entire biomedical community system: new content and tools in 2009. Nucleic Acids Res 2009, peer reviewed and published immediately upon acceptance 37:D588-592. 40. Eyre TA, Ducluzeau F, Sneddon TP, Povey S, Bruford EA, Lush MJ: cited in PubMed and archived on PubMed Central The HUGO Gene Nomenclature Database, 2006 updates. yours — you keep the copyright Nucleic Acids Res 2006, 34:D319-321. Submit your manuscript here: BioMedcentral http://www.biomedcentral.com/info/publishing_adv.asp

Page 14 of 14 (page number not for citation purposes)