<<

1

Biostatistics and Biochemistry Protein Sequence Analysis

Description of Module

Subject Name Biochemistry

Paper Name 13 and Bioinformatics

Module Name/Title 05 Protein Sequence Analysis

2

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

1. Objectives: In this module, the students will:

1. Understand protein sequence analysis for Biochemistry and experiments 2. Learn downloading annotated protein sequences from UniProtKB 3. Compute various Biochemical parameters for a given protein sequence 4. Learn prediction of post-translational modifications 5. Learn prediction of signal peptide and transmembrane helices in a given protein sequence 6. Learn downloading raw protein sequences using genome browser. 7. Conduct analysis of a given protein sequence to find repeats using RADAR and visualize the presence of direct repeats using DotPlot 8. Using InterProScan to finds family, domains, repeats and sites in a given protein sequence. 9. Use PeptideCutter to search peptide bonds cutting enzymes and/ or chemicals for cleaving sites in an input protein sequence

2. Concept Map

Protein Sequence Analysis

Downloading UniProtKB Sequence

Protein Parameter Computation

PTM Prediction

Signal and TM peptide Prediction

Repeat Analysis and Visualization

Using IntroProScan

Using PeptideCutter

3. Protein Sequence Analysis

3

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

Protein sequence analysis for Biochemistry and Molecular biology experiments begins with obtaining a sequence in the laboratory or from sequence database. This is followed by computing various Biochemical parameters, prediction of signal peptide and transmembrane helices as well as prediction of post- translational modifications. To visualize the presence of repeats, DotPlot analysis is conducted. To gain additional information from known databases, PredictProtein tool for detecting various features and InterPro tool for functional analysis of protein classified into families is used. Finally, for protein identification using Mass Spectroscopy, PeptideCutter tool is used to search peptide bonds cutting enzymes and/ or chemicals for cleaving sites in a protein sequence to be identified.

Back to Concept Map

3.1. Downloading an annotated protein sequence

Visit http://www.expasy.org/ and search UniProtKB for “Glycophorin A Human” and click search button.

In the result page click on

In the ensuing page choose GLPA_HUMAN entry at serial number 2

4

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

and click hyperlink to reach Glycophorin A page.

The most important in this page is Display side bar, where one could jump to any of the feature listed. The features include, function, names & taxonomy, subcellular location, post-translational medications & processing, interactions with other proteins, 3-D structures, conserved families and domains, sequence & external links to other sequence databases, publications & literature information. The information for Glycophorin A from human is presented for subcellular location and post-translational modifications/ processing.

5

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

In the Format Tab of main page, select FASTA Canonical

Download the FASTA sequence and save as GLPA.FA file using NotePad. The mature peptide is from amino acids 20 to 150, with three domains: one N-terminal extracellular with 16 attached oligosachharide units having nearly 100 sugars, rich in sialic acid, which make the RBC anionic and thus hydrophilic. There is a middle region transmembrane helix and finally C-terminal cytoplasmic domain. The sequence of complete protein is shown next with mature protein highlighted with green background.

>sp|P02724|GLPA_HUMAN Glycophorin-A OS=Homo sapiens GN=GYPA PE=1 SV=2 MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAH EVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFGVMAGVIGTILLISYGIRRLIKK SPSDVKPLPSPDTDVPLSSVEIENPETSDQ

Back to Concept Map

6

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

3.2. Protein Parameter Computation

ProtParam tool allows the computation of various physical and chemical parameters for a given protein. The computed parameters include molecular weight, theoretical pI, amino acid composition, atomic composition which are self explanatory. In addition, extinction coefficient, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (GRAVY) is also calculated. Visit http://web.expasy.org/protparam/ and paste the 131 amino acids mature protein sequence with green background and click button.

The results page is shown next

7

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

Extinction Coefficient indicates how much light a protein absorbs (represented by absorbance, A) at a

certain wavelength and is useful during protein purification. Lambert-Beer Law, defined A = log (I0/I) =

ecl, where, I0 is the intensity of incident light, I is the intensity of transmitted Light, c is concentration of

the absorber protein, l is path length through the solution or thickness of cuvette, e is molar extinction coefficient or molar absorbance coefficient at a particular wavelength for a particular absorber protein.

Therefore, Molar Extinction coefficient is defined as e= A / cl. For commonly used cuvette of 1 cm path -1 -1 3 -1 -1 length, Unit of molar absorbance coefficient is M cm (dm mol cm ). It has been shown that e280 for amino acids as chromophore is determined by amino acid sequence (Gill, S. C. and von Hippel, P. H., 1989, Calculation of protein extinction coefficients from amino acid sequence data. Analytical Biochemistry, 182,

319–326. Erratum: Analytical Biochemistry, 1990, 189, 283). For each disulphide bond e280 = 125, for

TrP (W) e280 = 5500 and for Tyr (Y) e280 = 1490. For the following protein sequence,

KYYGNGVTCGKHSCSVDWGKATTCIINNGAMAWATGGHQGNHKC

We find 2 disulphide bonds, two tryptophan residues and two tyrosine residues. Therefore, e280 = 2 x 125 + 2 x 5500 + 2 x 1490 = 14230 M-1 cm-1 (dm3 mol-1 cm-1) can be calculated for this sequence.

ProtParam reported two extinction coefficients for this sequence:

8

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

From experimental knowledge, it is established that all pairs of Cys residues are forming disulphide bonds -1 -1 (cystines), therefore, e280 = 14230 M cm can be used during purification of this protein.

For Glycophorin A, we find no Cys, no tryptophan but four tyrosine residues. Therefore, e280 = 4 x 1490 = 5960 M-1 cm-1 (dm3 mol-1 cm-1) can be calculated for this sequence. ProtParam tool also reported the same

Sometimes molar absorbance coefficients are large, therefore 1% or 0.1% solution is used for expressing absorbance coefficient. For Glycophorin A, ProtParam tool reported:

Half-life is the estimated time to reduce the amount of a protein to one half after its synthesis within a given cell. This is estimated by ProtParam in three physiological, i.e. mammalian reticulocytes, yeast and E. coli. The estimated half-life for Glycophorin A, ProtParam tool reported

The instability index provides an estimate of the stability of a protein in a test tube. The estimated instability index for Glycophorin A, ProtParam tool reported

9

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

An instability index smaller than 40 indicates a stable protein and a value above 40 indicates the protein to be unstable.

The aliphatic index of a protein indicates the thermostability of globular proteins and is calculatedfrom the relative volume occupied by small aliphatic side chains of alanine, valine, isoleucine, and leucine. For Glycophorin A, ProtParam tool reported

GRAVY (Grand Average of Hydropathy) is average hydropathicity of a protein sequence as defined by Kyte J. and Doolittle R.F. (1982) J. Mol. Biol. 157:105-132, shown next

For Glycophorin A, ProtParam tool reported

Back to Concept Map

3.3. Post-translational modification (PTMs) analysis on proteins

Computational prediction of post-translational modifications including phosphorylation, acetylation, methylation etc. is very useful for Biochemical experimental design. There are several online servers available for prediction of post-trnaslational modifications. The partial list can be reached at ExPASy server available at http://www.expasy.org/proteomics/post-translational_modification.

10

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

11

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

Some others are listed next.

1. at www.phosphosite.org 2. at http://www.phosida.com/ 3. at http://phospho.elm.eu.org/index.html 4. at http://gps.biocuckoo.org/.

Back to Concept Map

3.4. Signal peptide and transmembrane helices prediction

Phobius is a combined signal peptide and transmembrane topology prediction tool and is available online at http://phobius.sbc.su.se/.

12

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

Phobius predicted a signal peptide from 1 to 19 amino acids, followed by extracellular (non-cytoplasmic domain (20-91 amino acids) continuing with a transmembrane domain (92-114 amino acids) to end up inside red blood cell with cytoplasmic domain (115-150 amino acids). The same prediction is presented graphically, next

TopPred available at http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::toppred is also used for prediction of membrane proteins based on hydrophobicity values for a given size of amino acid window. This prediction shows that Glycophorin A is an integral membrane protein. 13

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

TMHMM available at http://www.cbs.dtu.dk/services/TMHMM-2.0/ is used to predict transmembrane helices in proteins. For the sequence of Glycophorin A, it predicted two transmembrane helices. The first is the signal peptide and the second is the membrane anchor helix.

14

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

Back to Concept Map

3.5. Repeat analysis and visualization in protein sequences

To understand repeats analysis, user needs a protein sequence having repeats. EF3314 protein in Enterococcus faecalis is known to contains repeats To download a protein sequence, visit genome browser at http://microbes.ucsc.edu/ and enter Enterococcus faecalis to Select Genome. In the Enterococcus faecalis genome browser, enter EF3314 in the text box. The EF3314 will appear and will show that the gene is encoded in the ‘-‘ strand, i.e. complementary strand . Now click on the EF3314 gene. On the ensuing page at the bottom, click on hyperlink. You will reach on the page displaying the EF3314 protein in FASTA format. Copy the FASTA format sequence and paste in NotePad and save as ‘3314Protein.FA’ by selecting all files as type of file.

RADAR available at http://www.ebi.ac.uk/Tools/pfa/radar/ finds and aligns repeats in protein sequences.

15

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

DotPlot is very useful to appreciate sequence features visually. The most common sequence feature is the presence of direct repeats in a sequence. Repeated sequences (or repetitive elements, or repeats) are patterns of nucleic acids (DNA or RNA) and proteins that occur in multiple copies throughout the sequence. 16

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

The presence of repetitive sequences within a single sequence cannot be appreciated while reading a sequence. But the DotPlot gives a visual presentation if the sequence contains repeat sequences. For example, the protein sequence ‘THISREPEATISREPEATED’ can be analysed using a DotPlot. To construct a DotPlot, one needs to develop a matrix of columns and rows. The number of rows and columns depends on the number of residue in sequence. In the present case there are 20 residues, therefore, a table with 21 rows and 21 column boxes is drawn. In each box, a residue symbol is written in the first row and first column. Then the residues are matched for each of the boxes. In the boxes, where the residues are same, their symbol is written. Any other letter such as star ‘*’dot ‘.’ may also be written. Then the visual presentation reveals that there is one main diagonal showing the identity. Since we have used the same sequence as horizontal sequence and vertical sequence, therefore there is complete identity diagonal. The lines parallel to main diagonal in intrasequence comparison reveals the presence of direct repeat sequences. In the present case, this sequence has ‘ISREPEAT’ residues and this is directly repeating only once, therefore, there is one parallel diagonal on each side of the main identity diagonal. Some times palindromic sequences such ‘RADAR’ may be present. These are visible as perpendiculars cutting the main diagonal or parallel diagonal. The sequence ‘EPE’ is present in this case. In DNA, such sequences represents restriction endonuclease sites.

T H I S R E P E A T I S R E P E A T E D T T T T H H I I I S S S R R R E E E E E E P P P E E E E E E A A A T T T I I I S S S R R R E E E E E E P P P E E A E E E A A A T T T E E E E E E D D

17

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

If the perpendicular diagonals do not cut the main or parallel diagonal then it represents an inverted repeat. One can also use two different sequences, one as vertical and other as horizontal sequence to visualise the identity between two sequences.

To use user need a protein sequence and a DotPlot software. Use protein sequence of cell wall surface anchor protein saved in 3314Protein.FA file. Download DotPlot software named ‘Dotter’, a DotPlot program. The Dotter gives its output which is graphic and easy to visualise the repeats in the sequence. To use Dotter, run dotter on DOS Prompt by typing “dotter 3314Protein.FA 3314Protein.FA” and pressing enter. The direct repeats are visible as parallel lines, as shown next

Back to Concept Map

18

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

3.6. Searching families, domains and sites in protein sequence

InterProScan Sequence Search available at http://www.ebi.ac.uk/interpro/search/sequence-search compares the input sequence with protein sequence signature databases to find domains and sites in the input protein sequence. In addition, it finds the it belongs to,.

EF0710 protein in Enterococcus faecalis is known to contain domains and sites as well it belongs to a protein family. Download EF0710 protein sequence from Enterococcus faecalis genome browser at http://microbes.ucsc.edu/ and paste the sequence in the input text box of InterProScan, as shown next, and click button.

The ensuing results window shows that this protein belongs to a protein family and has domains as well sites.

19

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

Back to Concept Map

3.7. PeptideCutter

PeptideCutter available at http://web.expasy.org/peptide_cutter/ is used to search peptide bonds cutting enzymes and/ or chemicals for sites in an input protein sequence. The tool allows to select enzymes and chemicals to be used and display options for cleavage sites and enzymes as well as chemicals predicted. Simply paste the Glycophorin A sequence in the input textbox, as shown next.

20

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

and click button . The results for selected enzyme(s), trypsin in the present case will be displayed, as shown next

Back to Concept Map

4. Summary

21

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis

In this module, students:

 Understood protein sequence analysis for Biochemistry and Molecular biology experiments  Learnt downloading annotated protein sequences from UniProtKB  Computed various Biochemical parameters for a given protein sequence  Learnt prediction of post-translational modifications  Learnt prediction of signal peptide and transmembrane helices in a given protein sequence  Learnt downloading raw protein sequences using genome browser.  Conducted analysis of a given protein sequence to find repeats using RADAR and visualize the presence of direct repeats using DotPlot  Used InterProScan to find family, domains, repeats and sites in a given protein sequence.  Used PeptideCutter to search peptide bonds cutting enzymes and/ or chemicals for cleaving sites in an input protein sequence

22

Biostatistics and Bioinformatics Biochemistry Protein Sequence Analysis