Prediction of Membrane Proteins in Post-Genomic Era
Total Page:16
File Type:pdf, Size:1020Kb
Prediction of Membrane Proteins in Post-Genomic Era Daisuke Kihara1* & Minoru Kanehisa2 1) Donald Danforth Plant Science Center 893 North Warson Road St. Louis, Missouri 63141, USA [email protected] Fax: 314-812-8075 2) Institute for Chemical Research, Kyoto University Gokasho, Uji, Kyoto, 611-0011, Japan [email protected] Fax: 81-774-38-3269 *) To whom correspondence should be addressed Running title: Membrane Proteins in Genome Recent Res. Developments in Protein Engineering. 1: 179-196 (2001) 1 ABSTRACT protein. Membrane proteins have important roles in living cells, such as transport, energy The current status of the prediction methods of production, cell signaling and cell adhesion. In membrane proteins in the post-genomic era is the genomic context, membrane proteins have described. In the first section, prediction attracted more attention, because they discussed methods of transmembrane segments in that in microbial genomes, the distribution of proteins are considered, including the TSEG transporters, one of the major members of the program, which we have recently developed. membrane protein family, reflects the Several prediction methods for the tertiary environment that each organism inhabits structure of membrane proteins are also [10,11]. mentioned. Finally, we show the unique feature We begin this text by reviewing the of the distribution of membrane proteins in a prediction methods of TM segments in proteins, genome. Our prediction of the function of which are also used in detecting TM proteins in several membrane proteins is also shown. genome sequences. The prediction of TM segments is also a starting point of the tertiary INTRODUCTION structure prediction of a membrane protein. Here, we describe the TSEG program, which The beginning of the scientific research of this we have developed recently. The proper way to century is characterized by the report that the measure the performance of prediction methods human genome sequencing project has been is also mentioned. Next, methods for predicting completed [1,2]. Beyond the human genome, a the tertiary structure of TM proteins are surge of world-wide genome projects in the last reviewed. Currently, tertiary structures of only decade have been producing complete genome a few TM proteins are solved by experimental sequences of an increasing number of methods, and most of them have helical TM organisms. At this point (February 2001), segments. Limiting the objects of prediction to complete genome sequences of 48 organisms those proteins with helical TM segments are available in the KEGG database [3], which (exceptions at this point are porin, which has β- is a compilation of genome sequences and barrel structure [12] and bacterial toxin proteins pathways we have been maintaining on the web. which penetrate into membranes and form Complete genome sequences enable a membrane pores [13-16]), this prediction comprehensive study of proteins or organisms procedure can be simplified to the assembly of through a catalog of proteomes, e.g., dynamic semi-rigid helical rods. In a later section, we genome rearrangement of a pair of closely show a genome sequence analysis in terms of related organisms [4,5], the spread and membrane proteins. The membrane protein evolution of a particular protein family among content of a genome, and their physical organisms [6,7], and physical principles of how distribution in a genome is the main issue here. related genes are encoded in the genome [8,9]. We have also made predictions for the In this post-genomic era, every computational functions of membrane proteins based on the structure/function prediction method should observations. take into account its application to genome sequences. 1. Prediction Methods of Transmembrane Throughout this manuscript, trans- Segments membrane (TM) protein is meant by membrane 2 The TM segment prediction method originates amino acid propensity to construct prediction from the hydropathy plot by Kyte & Doolittle methods [30-32]. Using the Neural network is (1982) [17]. The idea is rather simple: another promising approach to these kinds of recognizing highly hydrophobic stretches in the two-dimensional structure predictions [33-35]. sequence as TM segments using a sliding Using multiple sequence alignments helps window of a certain length (7-21 amino acids). improving prediction accuracy [30,31,34,35]. Another important contribution of this paper is Kroth et al. applied a hidden Markov model the derivation of the hydrophobicity index of [36,37]. amino acids, which is still commonly used. It The topology of membrane proteins can was derived from water-vapor transfer free be derived from some biochemical energies [18-20] and interior-exterior experimental evidence, even if their tertiary distribution of amino acids [21]. Various structures are not solved [38-40]. But it is often improvements have been performed on the the case that contradictory results are suggested hydropathy analysis since then: taking by other experiments. If one also considers that amphiphilicity of TM helices into account [22, any prediction method has limited accuracy, it 23], using different or various hydrophobicity is optimal for a prediction method to output not indices [24,25], employing discriminant a single prediction but a list of possibilities with analysis [26]. Hydropathy-type methods can be certainty measures, so that further experiments generalized as follows: given a hydrophobicity can be designed to distinguish among several index ft for each amino acid type t, weights hn topology models. Jones et al. [32] elegantly for the position n in the sliding window and a applied a dynamic programming algorithm for threshold B which is constant: this purpose. TopPred II by von Heijne [41] ranks several topology models according to the ( j,m) ( j,m−n) D = ∑ M hn − B , 1 ≤ m ≤ L j (1) ‘positive-inside rule’ (see below, in Topology n Prediction section). We also mention here that (j,m-n) (j,m) neural network-based and hidden Markov where M = fp(j,m-n), and p denotes the model-based methods can assign reliability residue type at position m in chain j. Lj is the index to the predictions made for each residue. length of the chain j. The position m in chain j (j,m) is decided to be transmembrane if D > 0. TSEG program Treating ft and hn as variables, Edelman [27] derived their optimal values by minimizing the In this section, we describe TSEG, which we following quadratic equation S (which have developed recently. When annotating corresponds to minimizing prediction errors): biological functions of genes in a genome, the prediction of higher order structures can be 2 S = ∑()D ( j,m) − q ( j,m) (2) used in order to compensate for the limitation j,m of the conventional sequence similarity search [42-44]. This is especially true for membrane (j,m) Here, q = 1 when (j,m) in the training set is proteins, since the number of TM segments in a (j,m) transmembrane, otherwise q = 0. protein can be related to a functional subclass It has been observed that not only the in some cases, such as seven-TM receptors or inside of TM segments but also their flanking six-TM transporters. One of the objectives of regions have preferable amino acids, e.g. developing the Transmembrane SEGment aromatic residues [28,29]. One can utilize the prediction program [45] was to enhance the 3 functional identification of TM proteins. To segment, and ω is the angle in degrees. Figure 1 capture detailed properties of TM segments, is the model of the membrane protein used in our method is based on a classification of TM the prediction, according to the locations of the segments in a database. In fact, not all TM five groups of TM segments. Here, membrane segments are equally hydrophobic. For example, proteins with more than fourteen TM segments TM segments of single spanning TM proteins are excluded because there were not enough are known to be highly hydrophobic and have sequences in the database to execute statistical less amphiphilicity [22], whereas the last TM calculations. The most hydrophobic group, segments in seven-TM proteins are relatively group1, appears only in the single spanning less hydrophobic and often difficult to detect by membrane proteins. Group2, which has prediction methods [30,32]. Thus, we have relatively high hydrophobicity, appears at the classified TM segments first by the total N-terminal TM segments in most of the number of TM segments in a protein and the membrane protein classes (except for the 12TM order in which they appear in the protein proteins). It is possible that these TM segments sequence, and at last merged similar ones into are involved in the initiation of membrane the same group. The second feature is that the insertion, which may correspond to what TSEG enumerates possible models as ranked Eisenberg et al. [22] called ‘initiators’. The last by their scores, where a model is distinguished segment of the seven TM proteins belongs to by the number of TM segments in a protein and the least hydrophobic group 5, which also represented by the order of different groups appears in the eighth segment of the nine TM (types) of TM segments. A model of globular proteins. proteins is included as well. The prediction procedure is based on the We have classified TM segments into detection of different TM segment groups using five groups according to their average hydro- different discriminant functions, followed by phobicity and amphiphilicity (or AP value), matching with the 15 models shown in Figure 1. using the Mahalanobis distance of the linear In the first stage, a query sequence is applied to discriminant analysis [46]. The dataset of 2876 each model and the best candidates for TM non-redundant TM protein sequences used here segments in the model are selected. This was extracted from Swiss-Prot rel.