Predicting Cellular Localization and Membrane Topology
Total Page:16
File Type:pdf, Size:1020Kb
Predicting cellular localization and membrane topology Bioe 190: Intro to Data Science Fall 2016 References for this lecture • “State-of-the-art in membrane protein prediction” Chen & Rost – Applied Bioinformatics, 2002 • “Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes” Krogh et al, Jnl of Mol. Biol. 2001 (The algorithm behind the TMHMM server.) Anders Krogh Postdoc with David Burkhard Rost Erik Sonnhammer Haussler, and 1st “Theoretical physicist trapped in author on the original The senior author of InParanoid Gunnar von Heijne molecular biology by the bright HMM papers from and a leader of the Quest for colours of life” “Positive Inside Rule” UCSC Orthologs consortium, Pfam. Protein 2D and 3D structure Leader in experimental prediction (among many other investigation of membrane bioinformatics methods) proteins Many different types of membrane proteins http://withfriendship.com/user/sathvi/membrane-protein.php http://www.bch.msu.edu/faculty/garavito/omp_porins.html Globular, fibrous and membrane structures Challenge: bioinformatics methods to predict the membrane topology* are quite rough Reported accuracy (by method developers) has been over- estimated due to limited/skewed benchmark data *labelling of protein: which parts are in the membrane, which parts are cytoplasmic, which parts are extracellular Case study #1: IFITM3 (IFITM: Interferon-inducible transmembrane) UniProt record UniProt topology annotation for IFM3_HUMAN TMHMM predicts a different topology Experimental investigation of IFITM3 (IFM3_HUMAN) http://www.nature.com/articles/srep24029 3 topology models evaluated Test your comprehension: Which topology does SwissProt predict? Which topology does TMHMM predict? Experimental data supports… Prediction method 1: subcellular localization and topology by homology • Subcellular localization can often be assigned by searching for homologous sequences whose subcellular location is known. • The principle used is that evolution conserves function, and that the membrane localization and orientation/topology are key components of protein function Prediction method 2: analysis of sequence properties • First attempts to classify proteins with respect to cellular localization based on amino acid sequence properties Nishikawa and Ooi (J.Biochem. 1982) – amino acid composition, disulphide bonds, the secondary structural class related to function and localization – Early results were promising, but based on a small sample. http://mendel.imp.univie.ac.at/CELL_LOC/ Extracting information from sequence • Signal peptides: short sequences in the protein used to target the protein for specific cellular compartments. • Signal patches (clusters of amino acids in close proximity in 3D structure, but distant in primary sequence) are also found • Examination of amino acids at structure surface can be particularly helpful; subtle preferences of different amino acids for different environments Prediction by signal peptide detection • Some proteins have sequence signals that determine their translocation to organelles or outside the cell – Claros et al. Curr.Op.Struct.Biol. (1997). ! • These patterns are not clear cut, especially for the intracellular organelle targeting peptides; – prediction accuracy is limited – Nielsen et al. Prot.Eng. (1997) v.10, 1 ! • Combinations of compositional and signal sequence analyses have been used in expert systems for the prediction of cellular localization – Nakai & Kanehisa Genomics (1992); – In general: not systematic and not rigorously tested http://mendel.imp.univie.ac.at/CELL_LOC/ TM/Signal Peptide/Localization prediction servers • Phobius: http://phobius.sbc.su.se -combined topology and signal peptide prediction • TMHMM: http://www.cbs.dtk.dk/services/TMHMM/ -TM helix prediction • TargetP: http://www.cbs.dtu.dk/services/TargetP/ -subcellular localization of eukaryotic proteins • SignalP: http://www.cbs.dtu.dk/services/TargetP/ -predicts the presence and location of signal peptide cleavage sites SignalP and TargetP http://www.cbs.dtu.dk/services/SignalP/ Transmembrane helix prediction Helical membrane proteins • Key components in cell-cell signalling • Mediate transport of ions and solutes across membrane • Crucial for recognition of self • Major class of drug targets – More than 50% of prescription drugs act on GPCRs (G-protein coupled receptors) – Multi-billion dollar industry Many predicted; few known • Solved structures available for very few membrane proteins • Predicted 7-10K helical membrane proteins in human genome (~25% of genome!) Chen and Rost, 2002 Helical membrane proteins challenge bioinformatics • Very little info about 3D structures – Very hard to crystallize – Hardly traceable by nuclear magnetic resonance (NMR) spectroscopy • Relatively easy to identify (rough) location of helices through low-resolution experiments – C-terminal fusion with indicator proteins – Antibody binding Chen and Rost, 2002 Concepts for predicting TM helix location and topology • Hydrophobicity scales provide simple criteria for prediction • TM helices are predominantly non-polar • TM helix length between 12-35 aa • Globular regions between membrane helices typically shorter than 60 aa • “Positive inside rule” von Heijne – Connecting loop regions on inside have more positive charge than loop regions on outside Chen and Rost, 2002 Hydrophobicity scales • Kyte and Doolittle (20 yrs ago) – Hydropathy scale, moving window approach – Window of 19 residues discriminated best between membrane and globular • Other work equally successful • Drawback: methods fail to discriminate between membrane regions and highly hydrophobic globular segments Chen and Rost, 2002 Other clues • Amino acid preferences for membrane and non-membrane proteins – Training data for methods derived from proteins identified as containing TM helices, as well as other secondary structure types – Higher accuracy Chen and Rost, 2002 Including topology helps • TopPred (von Heijne, 1992) – Topology prediction, using hydrophobicity analysis, possible topologies ranked by positive-inside rule • SOSUI (Hirokawa et al, 1998) – Combined KD hydropathy, amphiphilicity, relative and net charges, protein length Chen and Rost, 2002 Including homology helps • Alignment of homologs known to help secondary structure prediction (Rost and Sander, 1993) • Note: for 20-30% of proteins in any genome, no identifiable homologs can be found! • PHDhtm first method using homology info for membrane prediction – Uses neural networks, DP, multiple alignment – “one of the most accurate prediction methods” Chen and Rost, 2002 Including homology helps • TMAP (Persson and Argos, 1996) – Derived amino acid propensities from known TMs • 4-residue caps of membrane helices • 21 residue TM segments • Found at outside of membrane: N D G F P W Y V • Found mostly inside: A R C K – Used these propensities to improve prediction Chen and Rost, 2002 Grammatical rules • TMHMM pioneered building models of predicted membrane proteins in one consistent methodology – Sonnhammer et al 1998, Krogh et al 2001 • Similar concept implemented in HMMTOP – Tusnady and Simon, 1998 • MEMSAT similar to HMMTOP – Jones et al, 1994 Chen and Rost, 2002 Topology questions • The topology of a TM protein indicates its orientation with respect to the membrane: – which regions are outside (extracellular) and which are cytoplasmic • Predicted topologies turn out to be wrong roughly as often as they’re correct… Chen and Rost, 2002 Sequence information aiding TM recognition • Hydrophobic stretches (for lipid bilayer) • “Positive inside rule” – Von Heijne 1986, 1994 – Abundance of positively charged residues • Improved predictions through use of: – sliding windows – Multiple alignment – Neural networks Chen and Rost, 2002 Errors in TM prediction • Under-prediction (False negative) • Over-prediction (False positive) • False merge – two adjacent helices predicted to be one helix • False split – One long helix predicted to be two • Inexact placement of helices Chen and Rost, 2002 Prediction accuracy (1) • Performance accuracy overestimated significantly! – “developers have overrated their methods by 15-50%” Chen et al, unpublished • Why do developers overestimate their method accuracy? – Validation performed on proteins closely related to training sequences (and thus not indicative of performance on novel sequences) Chen and Rost, 2002 Prediction accuracy (2) • “Membrane helices are not entirely conserved across species” – Implies that even related proteins may have different topologies (# TM helices, orientation) and perform different cellular functions • Measures of accuracy of prediction not comparable across methods, due to lack of standard benchmark • Benchmark dataset now available at EBI Chen and Rost, 2002 Chen et al findings • Most TM methods get the number of helices right for most membrane proteins • 86% of TMH residues predicted by best methods • 70-75% of proteins get all TM helices predicted correctly by top methods • Topology correct for only half of all proteins Chen and Rost, 2002 Prediction accuracy (4) • Some papers have claimed that simple hydrophobicity scales are as accurate as more sophisticated methods – Chen et al disagree Chen and Rost, 2002 Prediction accuracy (5) • All methods confuse membrane helices with signal peptides – Best separation provided by ALOM2 (Nakai and Kanehisa) • Optimized to sort proteins into classes of sub- cellular localization Since Rost’s paper, the Phobius server was developed to integrate TM and signal peptide prediction