Prediction of Protein-Protein Interactions Using Protein Signature Profiling
Total Page:16
File Type:pdf, Size:1020Kb
Article Prediction of Protein-Protein Interactions Using Protein Signature Profiling Mahmood A. Mahdavi and Yen-Han Lin* Department of Chemical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada. Protein domains are conserved and functionally independent structures that play an important role in interactions among related proteins. Domain-domain inter- actions have been recently used to predict protein-protein interactions (PPI). In general, the interaction probability of a pair of domains is scored using a trained scoring function. Satisfying a threshold, the protein pairs carrying those domains are regarded as “interacting”. In this study, the signature contents of proteins were utilized to predict PPI pairs in Saccharomyces cerevisiae, Caenorhabditis ele- gans, and Homo sapiens. Similarity between protein signature patterns was scored and PPI predictions were drawn based on the binary similarity scoring function. Results show that the true positive rate of prediction by the proposed approach is approximately 32% higher than that using the maximum likelihood estimation method when compared with a test set, resulting in 22% increase in the area un- der the receiver operating characteristic (ROC) curve. When proteins containing one or two signatures were removed, the sensitivity of the predicted PPI pairs in- creased significantly. The predicted PPI pairs are on average 11 times more likely to interact than the random selection at a confidence level of 0.95, and on aver- age 4 times better than those predicted by either phylogenetic profiling or gene expression profiling. Key words: protein-protein interaction, protein signature, ROC curve Introduction Protein-protein interaction (PPI) is the key element utilized as input to predict more accurate interac- of any biological process in a living cell. Proteins in- tions in another organism (4 ). Intermolecular or in- teract through their functional subunits (1 ). Protein tramolecular interactions among protein families that domains, active sites, and motifs (collectively called share one or more domains are implemented to in- signatures) are sub-sequence functional and conserved fer interactions among proteins (5 ). Domain-domain patterns that are essential to the functioning of in- relationships are used to predict interactions at the dividual cells and are the interfaces used in interac- protein level. In the association method (6 ), interact- tions at the protein level (2 ). With the completion of ing domains are learned from a dataset of experimen- genome sequences of many organisms, genome-wide tally determined interacting proteins, where one pro- characterization of protein signatures is now practi- tein contains one domain and its interacting partner cal. Although proteins are specified by unique amino contains the other domain. The probabilistic model of acid sequences, the signature content of a protein se- maximum likelihood estimation (MLE) outperforms quence is crucial to determining interactions in which the association method through taking the experi- the particular protein is involved. mental errors into account. Following a recursive cal- Protein signature (domain) information has been culation procedure, in MLE method probabilities for used to predict PPI. Naively, when two proteins are domain-domain interactions are predicted based on known to interact, their homologs in other organisms the observation of interactions between their corre- are assumed to interact based on comparative analysis sponding proteins. Then the prediction is extended (3 ). Domain contents of the interacting partners are to the protein level, assuming that two proteins inter- act if and only if at least one pair of domains from *Corresponding author. the two proteins interact (7 ). Potentially interact- E-mail: [email protected] ing domain (PID) pairs are extracted from an ex- Geno. Prot. Bioinfo. Vol. 5 No. 3–4 2007 177 PPI Prediction by Protein Signature Profiling perimentally confirmed protein pair dataset using the tein interaction dataset of yeast. More than half of PID matrix score as a measure of domain interaction those signature pairs (22 out of 40 pairs) contained probability. The information for interacting proteins similar signatures and the rest of them were function- could be enriched about 30 folds with the PID matrix ally close signatures. Non-identical pairs could not (8 ). In another study, the strengths of protein pairs pass the threshold, even though the threshold was are incorporated into the association method to en- considered very loose. Okada et al (20 ) studied the rich the predicting probability (9 ). As many domain role of common domains in the extraction of accu- structures are shared by different organisms, the inte- rate functional associations in interacting partners. gration of data from multiple sources may strengthen It has been shown that, when two proteins share a the reliability of domain associations and protein in- similar domain structure, their interaction probabil- teractions (10 ). Moreover, the combination of protein ity score is higher than that of two proteins with non- interaction data from multiple species, molecular se- similar domains (21 ). In a study of interacting sig- quences, and gene ontology is used to construct a set natures in the SCOP database, interacting signature of high-confidence domain-domain interactions (11 ). pairs were predicted based on the finding that they In all above-mentioned methods, if a probability use significantly higher surfaces to form these inter- score meets a certain threshold, then the domains actions, and interestingly, like-like interacting signa- and subsequently related proteins are considered “in- tures were observed in high frequency (2 ). As re- teracting”. However, these methods do not distin- ported by Ramini and Marcotte (22 ), proteins shar- guish between single-unit proteins and multi-unit pro- ing common signatures possess a high possibility of teins. To overcome this limitation, a method based being co-evolved in a correlated manner. These com- on domain combination was proposed, which pre- mon signatures contribute to the similarity of protein dicts protein interactions according to the interactions families detected through optimal alignment between of multi-domain pairs or the interactions of domain protein family similarity matrices. Therefore, com- groups (12 ). As an alternative approach, machine mon signatures between interacting proteins enhance learning techniques have been used to train support the interaction probability of two proteins. vector machines, called descriptors (13 ). Signatures In this study, we propose a new genome-wide ap- have been used to train descriptors and each descrip- proach to predict PPI based on the observation that tor reflects the amino acid sequence of a protein that, proteins with common signatures are more likely to in turn, consists of several signatures. However, in interact. The signature content of a protein is repre- this machine learning approach, signature is defined sented by a binary profile, called the protein signature as one single amino acid and its two neighbors (three profile, and then the similarity between two profiles consecutive letters), which is totally different from is scored using a binary similarity function. Imposing the definition utilized in this article as functional a threshold based on a pre-determined significance conserved patterns. Recently, interactomes (14–17 ) level, two proteins are considered “interacting” if they and databases, such as the Database of Interacting satisfy the significant threshold value. Furthermore, Proteins (18 ), have been used as reliable sources for by removing proteins with one or two known signa- mining interacting domains, which may contribute tures from the dataset, the false positive rate of the to inferring uncharacterized interacting proteins (19 ). predicted PPI dataset reduces significantly. Different Therefore, signature contents of proteins play a cru- from the domain-based methods that score the rela- cial role in predicting protein interactions. Signature- tionship between two protein domains and extrapo- based PPI prediction techniques rely on statistically late such a relationship to infer PPI, our approach di- significant related signatures. When the interac- rectly scores protein relationships based on the signa- tion probability score between two signatures (in two ture content of each individual protein and the extent different proteins) is greater than a threshold value, of commonality in signature patterns. The more sig- such a relationship is extended to the corresponding natures in common, the higher the similarity score will proteins and the potential interaction is inferred. be between two different profiles. We applied this ap- Close assessment of the protein pairs whose signa- proach to three organisms: Saccharomyces cerevisiae tures possess high interaction probability scores shows (yeast), Caenorhabditis elegans (worm), and Homo that many of these protein pairs share at least one sapiens (human). Although at the time being, a rel- common signature. Sprinzak and Margalit (6 )re- atively small portion of genes in each genome have ported 40 overrepresented signature pairs in the pro- been identified with their signatures, the proposed 178 Geno. Prot. Bioinfo. Vol. 5 No. 3–4 2007 Mahdavi and Lin approach is capable of covering the entire genome as spectively. The predicted PPI pairs and their cor- more genes with known signature contents