Biochemistry Biostatistics and Bioinformatics Protein Sequence Analysis

1 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis Description of Module Subject Name Biochemistry Paper Name 13 Biostatistics and Bioinformatics Module Name/Title 05 Protein Sequence Analysis 2 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis 1. Objectives: In this module, the students will: 1. Understand protein sequence analysis for Biochemistry and Molecular biology experiments 2. Learn downloading annotated protein sequences from UniProtKB 3. Compute various Biochemical parameters for a given protein sequence 4. Learn prediction of post-translational modifications 5. Learn prediction of signal peptide and transmembrane helices in a given protein sequence 6. Learn downloading raw protein sequences using genome browser. 7. Conduct analysis of a given protein sequence to find repeats using RADAR and visualize the presence of direct repeats using DotPlot 8. Using InterProScan to finds family, domains, repeats and sites in a given protein sequence. 9. Use PeptideCutter to search peptide bonds cutting enzymes and/ or chemicals for cleaving sites in an input protein sequence 2. Concept Map Protein Sequence Analysis Downloading UniProtKB Sequence Protein Parameter Computation PTM Prediction Signal and TM peptide Prediction Repeat Analysis and Visualization Using IntroProScan Using PeptideCutter 3. Protein Sequence Analysis 3 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis Protein sequence analysis for Biochemistry and Molecular biology experiments begins with obtaining a sequence in the laboratory or from sequence database. This is followed by computing various Biochemical parameters, prediction of signal peptide and transmembrane helices as well as prediction of post- translational modifications. To visualize the presence of repeats, DotPlot analysis is conducted. To gain additional information from known databases, PredictProtein tool for detecting various features and InterPro tool for functional analysis of protein classified into families is used. Finally, for protein identification using Mass Spectroscopy, PeptideCutter tool is used to search peptide bonds cutting enzymes and/ or chemicals for cleaving sites in a protein sequence to be identified. Back to Concept Map 3.1. Downloading an annotated protein sequence Visit http://www.expasy.org/ and search UniProtKB for “Glycophorin A Human” and click search button. In the result page click on In the ensuing page choose GLPA_HUMAN entry at serial number 2 4 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis and click hyperlink to reach Glycophorin A page. The most important in this page is Display side bar, where one could jump to any of the feature listed. The features include, function, names & taxonomy, subcellular location, post-translational medications & processing, interactions with other proteins, 3-D structures, conserved families and domains, sequence & external links to other sequence databases, publications & literature information. The information for Glycophorin A from human is presented for subcellular location and post-translational modifications/ processing. 5 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis In the Format Tab of main page, select FASTA Canonical Download the FASTA sequence and save as GLPA.FA file using NotePad. The mature peptide is from amino acids 20 to 150, with three domains: one N-terminal extracellular with 16 attached oligosachharide units having nearly 100 sugars, rich in sialic acid, which make the RBC anionic and thus hydrophilic. There is a middle region transmembrane helix and finally C-terminal cytoplasmic domain. The sequence of complete protein is shown next with mature protein highlighted with green background. >sp|P02724|GLPA_HUMAN Glycophorin-A OS=Homo sapiens GN=GYPA PE=1 SV=2 MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAH EVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKK SPSDVKPLPSPDTDVPLSSVEIENPETSDQ Back to Concept Map 6 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis 3.2. Protein Parameter Computation ProtParam tool allows the computation of various physical and chemical parameters for a given protein. The computed parameters include molecular weight, theoretical pI, amino acid composition, atomic composition which are self explanatory. In addition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY) is also calculated. Visit http://web.expasy.org/protparam/ and paste the 131 amino acids mature protein sequence with green background and click button. The results page is shown next 7 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis Extinction Coefficient indicates how much light a protein absorbs (represented by absorbance, A) at a certain wavelength and is useful during protein purification. Lambert-Beer Law, defined A = log (I0/I) = ecl, where, I0 is the intensity of incident light, I is the intensity of transmitted Light, c is concentration of the absorber protein, l is path length through the solution or thickness of cuvette, e is molar extinction coefficient or molar absorbance coefficient at a particular wavelength for a particular absorber protein. Therefore, Molar Extinction coefficient is defined as e= A / cl. For commonly used cuvette of 1 cm path -1 -1 3 -1 -1 length, Unit of molar absorbance coefficient is M cm (dm mol cm ). It has been shown that e280 for amino acids as chromophore is determined by amino acid sequence (Gill, S. C. and von Hippel, P. H., 1989, Calculation of protein extinction coefficients from amino acid sequence data. Analytical Biochemistry, 182, 319–326. Erratum: Analytical Biochemistry, 1990, 189, 283). For each disulphide bond e280 = 125, for TrP (W) e280 = 5500 and for Tyr (Y) e280 = 1490. For the following protein sequence, KYYGNGVTCGKHSCSVDWGKATTCIINNGAMAWATGGHQGNHKC We find 2 disulphide bonds, two tryptophan residues and two tyrosine residues. Therefore, e280 = 2 x 125 + 2 x 5500 + 2 x 1490 = 14230 M-1 cm-1 (dm3 mol-1 cm-1) can be calculated for this sequence. ProtParam reported two extinction coefficients for this sequence: 8 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis From experimental knowledge, it is established that all pairs of Cys residues are forming disulphide bonds -1 -1 (cystines), therefore, e280 = 14230 M cm can be used during purification of this protein. For Glycophorin A, we find no Cys, no tryptophan but four tyrosine residues. Therefore, e280 = 4 x 1490 = 5960 M-1 cm-1 (dm3 mol-1 cm-1) can be calculated for this sequence. ProtParam tool also reported the same Sometimes molar absorbance coefficients are large, therefore 1% or 0.1% solution is used for expressing absorbance coefficient. For Glycophorin A, ProtParam tool reported: Half-life is the estimated time to reduce the amount of a protein to one half after its synthesis within a given cell. This is estimated by ProtParam in three physiological, i.e. mammalian reticulocytes, yeast and E. coli. The estimated half-life for Glycophorin A, ProtParam tool reported The instability index provides an estimate of the stability of a protein in a test tube. The estimated instability index for Glycophorin A, ProtParam tool reported 9 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis An instability index smaller than 40 indicates a stable protein and a value above 40 indicates the protein to be unstable. The aliphatic index of a protein indicates the thermostability of globular proteins and is calculatedfrom the relative volume occupied by small aliphatic side chains of alanine, valine, isoleucine, and leucine. For Glycophorin A, ProtParam tool reported GRAVY (Grand Average of Hydropathy) is average hydropathicity of a protein sequence as defined by Kyte J. and Doolittle R.F. (1982) J. Mol. Biol. 157:105-132, shown next For Glycophorin A, ProtParam tool reported Back to Concept Map 3.3. Post-translational modification (PTMs) analysis on proteins Computational prediction of post-translational modifications including phosphorylation, acetylation, methylation etc. is very useful for Biochemical experimental design. There are several online servers available for prediction of post-trnaslational modifications. The partial list can be reached at ExPASy server available at http://www.expasy.org/proteomics/post-translational_modification. 10 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis 11 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis Some others are listed next. 1. at www.phosphosite.org 2. at http://www.phosida.com/ 3. at http://phospho.elm.eu.org/index.html 4. at http://gps.biocuckoo.org/. Back to Concept Map 3.4. Signal peptide and transmembrane helices prediction Phobius is a combined signal peptide and transmembrane topology prediction tool and is available online at http://phobius.sbc.su.se/. 12 Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis Phobius predicted a signal peptide from 1 to 19 amino acids, followed by extracellular (non-cytoplasmic domain (20-91 amino acids) continuing with a transmembrane domain (92-114 amino acids) to end up inside red blood cell with cytoplasmic domain (115-150 amino acids). The same prediction is presented graphically, next TopPred available at http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::toppred is also used for prediction of membrane proteins based on hydrophobicity values for a given size of amino

Biochemistry Biostatistics and Bioinformatics Protein Sequence Analysis

Contributions to Biostatistics: Categorical Data Analysis, Data Modeling and Statistical Inference Mathieu Emily

Sequence Analysis Instructions

Agenda: Gene Prediction by Cross-Species Sequence Comparison Leila Taher1, Oliver Rinner2,3, Saurabh Garg1, Alexander Sczyrba4 and Burkhard Morgenstern5,*

Software List for Biology, Bioinformatics and Biostatistics CCT

A Comparison of Latent Class and Sequence Analysis

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

The DNA Sequence and Comparative Analysis of Human Chromosome 20

LESSON 9 Analyzing DNA Sequences and DNA Barcoding

Progress in Gene Prediction: Principles and Challenges Srabanti Maji and Deepak Garg*

Sequence Analysis and Genomics 3. Multiple Sequence Alignments

Isolation, Purification and Initial RNA Sequence Analysis of Seminal Fluid Exosomes Between Pregnant and Non-Pregnant Intrauterine Insemination Pregnancies

Lecture 8: RNA-Sequence Analysis: Expression, Isoforms