. 1

Biostatistics and Biochemistry – A Practical Example

Description of Module

Subject Name Biochemistry

Paper Name 13 Biostatistics and Bioinformatics

Module Name/Title 10 Sequence Alignment – a Practical Example

Dr. Vijaya Khader Dr. MC Varadaraj

2

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

1. Objectives: This module aims students -

1. To use different computer programs for creating global and local sequence alignments for a pair of sequences 2. To analyse real alignment of phosphocarrier protein from Escherichia coli (a Gram –ve bacteria) and Enterococcus faecalis (a Gram +ve bacteria) to infer homology

2. Concept Map

Practical sequence alignment

Pair-wise alignment programs

Global alignment Local alignment programs programs

EMBOSS Needle SSearch EMBOSS Stretcher BLAST

A practical example of pairwise alignment of Hpr from E. coli and E. faecalis for inference of homology

3. Practical sequence alignment

To create sequence alignment of real proteins, we need to download a pair of protein sequences. There are several sequence alignment programs to align a pair of sequences. We need to choose the most appropriate program from these available programs. In addition, we need to decide the scoring matrix and gap penalty scheme to be used. Then we need to draw an inference of homology. All this will be discussed below, first for a hypothetical pair of sequences and then for a real practical example.

Back to Concept map

3

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

3.1. Pair-wise sequence alignment programs

There are several sequence alignment programs for global and local alignments. Global alignment tools create an end-to-end alignment of the sequences to be aligned. EMBOSS Needle and EMBOSS Stretcher are freely available online for global alignments. Local alignment tools create a smaller alignment of two sequences. SSEARCH and BLAST are freely available online for local alignments.

3.1.1. EMBOSS Needle: http://www.ebi.ac.uk/Tools/psa/emboss_needle/ : EMBOSS Needle creates an optimal global alignment of two sequences using the Needleman-Wunsch algorithm with affine gap penalty only. Linear gap scorings not implemented because, linear gap scoring scheme is useful for understanding the basic dynamic programming manually, whereas, actual alignment of evolutionary related sequences is better achieved using affine gap penalty scheme. There are separate interfaces for aligning for aligning protein and nucleotide sequences. Both differ only in the scoring scheme selected for aligning the sequences.

STEP 1 is to enter the input sequences for alignment in the following interface:

4

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

STEP 2 is to set the parameters for scoring. It offers to select BLOSUM matrix from BLOSUM30 to BLOSUM90 with any increment of 5. The default is BLOSUM62. In addition, one can select from PAM10 to PAM500 with any increment of 10. The allowed gap open and Gap extend are also shown below:

Finally in STEP 3 – Submit your job and select in case you want to be informed through email, else the results of the alignment will appear in your web browser.

We will use EMBOSS Needle with a practical example, at the end of this module.

Back to Concept map

5

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

3.1.2. EMBOSS Stretcher: http://www.ebi.ac.uk/Tools/psa/emboss_stretcher/ : EMBOSS Stretcher uses a modification of the Needleman-Wunsch algorithm that allows larger sequences to be globally aligned.

STEP 1 is to enter the input sequences for alignment in the similar interface as used for EMBOSS Needle. STEP 2 is to set the parameters for scoring, as shown below:

6

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

The EMBOSS Stretcher allows only a small range of opening (1 to 25) and extend gap ( 1 to 8) penalties, whereas, EMBOSS Needle allows a wide range of opening (1 to 100) and extend gap (0.0005 to 10) penalties. The output format available are same as for EMBOSS Needle.

We will use EMBOSS Stretcher with a practical example, at the end of this module.

Back to Concept map

3.1.3. SSEARCH at http://pir.georgetown.edu/pirwww/search/pairwise.shtml program gives local alignments using Smith-Waterman algorithm of dynamic programming between full lengths of two sequences. It requires pasting the two sequences only and no other input. We will use SSEARCH with a practical example, at the end of this module.

Back to Concept map

3.1.4. Align nucleotide Sequences using BLAST: Blast2Seq available at

http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_PROG_DEF=m egaBlast&BLAST_SPEC=blast2seq allows the alignment of two sequences, the query sequence and subject sequence.

7

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

It allows setting of scoring parameters for using following arbitrary match/ mismatch and gap costs.

Back to Concept map

3.2. Interpretation of pairwise alignment for inferring homology

Let us align following small sequences: THISISAPRRTEINSEQVENCE and ITISANNTHERSEQVENCE

EMBOSS Needle at http://www.ebi.ac.uk/Tools/psa/emboss_needle/ for aligning above mentioned two sequences using BLOSUM62 and affine gap penalty (gap opening = 1 and gap extension = 1), produced following alignment:

8

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

The EMBOSS Needle dynamic programming alignment using the Needleman-Wunsch algorithm with BLOSUM62, opening gap penalty 1 and extending gap penalty 1 produced 14 conserved positions (identities) with one conservative mutation (15 similarities), 3 semi-conservative mutations and 5 gaps having an overlapping alignment length of 23 with a score of 67.0.

The identity between two sequences due to presence of identical residues at corresponding positions in sequence alignment is quantitative and can be used to infer homology from percentage of identical residues present over the length of pairwise sequence alignment. The figure below is an x – y graph showing three regions. The green region shows that the sequence alignments with corresponding sequence identity aligned over involved sequence length are safe to infer homology. In the yellow zone we need to be careful before inferring homology and we may follow statistical assessment in this zone. In the red zone it not safe at all.

In the present case with 60.9% identity over 23 residue alignment length, the alignment falls in yellow colored twilight zone, therefore, we may follow statistical assessment, frequently assessed by calculating -s P-value. The P–value is given by Ke . The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively. The following table gives ranges of P values to draw an inference of homology between two sequences.

P-value range Inference -100 Two sequences are identical Sequences < 10 -100 -50 Two sequences are nearly identical 10 - 10 9

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

-50 -5 Two sequences are having clear homology 10 - 10 -5 -1 Two sequences are having possible distant homology 10 - 10 -1 Two sequences are having are randomly related,, therefore, no homology > 10

The sequence alignment programs reports the similarity score using a particular scoring system with affine -s gap penalties imposed. The P–value is given by Ke . However, K and  are not reported by sequence alignment programs. Therefore, we need to calculate K and  for specific scoring systems with specific affine gap penalties. K and  can be calculated using online program PRSS http://www.ch.embnet.org/software/PRSS_form.html.

The result of pressing Run PRSS is shown below:

10

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

P-value can be calculated to determine the statistically significance for probability of the alignment with alignment score, K and , to infer homology. We can calculate P-value using Microsoft Office Excel as shown below:

The result will as you enter the data and formula:

-07 -50 -5 The value of 9.68189 * 10 is between 10 - 10 , therefore, two sequences are having clear homology, and can be represented as homologous sequence alignment using symbols, as shown below:

Back to Concept map

3.3. A practical example

phosphocarrier protein (Hpr) with a phosphorylated histidine is used in group translocation of sugars by the PEP dependent phosphotransferase system (PTS) through the cell membrane. The histidine residue of phosphocarrier protein (HPr) accepts a phosphate group from PEP to be ultimately transferred to incoming sugar. In addition, HPr can be phosphorylated, at serine residue using ATP, to act as an intermediate in the signalling cascade that regulates transcription of genes related to the carbohydrate-response system. Both functions involve phosphorylation/ dephosphorylation reactions, but at different sites. For sequence alignment in the present module, let us use two orthologous Hpr proteins; one from Escherichia coli K12 and other from Enterococcus faecalis. Download each of these protein sequences as described in “module 03 Molecular Sequence Databases” and save in separate files named as EColiHpr.FA and EfaecalisHpr.FA respectively. These sequences in FASTA format are shown next:

11

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

>E. coli length=90 MTVKQTVEITNKLGMHARPAMKLFELMQGFDAEVLLRNDEGTEAEANSVI ALLMLDSAKG RQIEVEATGPQEEEALAAVIALFNSGFDED >E. faecalis length=88 MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEYKGKSVNLKSIMGV MSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

To visualize the extent of sequence identity between these two sequences, use Dotter software. See Dot plot analysis for visualizing repeats in protein sequences in chapter on “Protein Sequence Analysis”. To use Dotter, run dotter on DOS Prompt by typing “dotter EColiHpr.FA EfaecalisHpr.FA” and pressing enter.

12

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

The DotPlot main window will appear and it will display that the plot with a lot of background or noise. Use context menu, which will appear with a right mouse click (MS Windows), to display window size tool and GreyRamp tool. Set the window size to 30 and adjust the grey ramp tool values between 29 and 36, as shown next:

This plot shows that the two sequences are having identities in the N-terminal and C-terminal regions with non-identities in the middle region of the sequence. The DotPlot alignment window shows that the two sequences are having identities (cyan colour) in the N and C-terminals with a non-identical region in the middle.

The Histidine at position 16, involved in function of Hpr of E. coli is aligned histidine at position 15 of Hpr of E. faecalis. However, this alignment has only 17 identities out of 88 positions aligned. This is due to continuous alignment i.e. without considering insertions and deletions in two sequences. Therefore, we need to align these proteins using specialized sequence alignment programs.

Back to Concept map

Let us take align these two sequences using online EMBOSS Needle and EMBOSS Stretcher for Comparison of global alignments. EMBOSS Needle at http://www.ebi.ac.uk/Tools/psa/emboss_needle/ produced following output:

13

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

The alignment using EMBOSS Needle with BLOSUM62, opening gap 10 and extending gap 1 has 23 identities at 91 aligned positions (25.3%) with score of 93. The 25.3% identities for the overlapping

alignment length of 91 positions fall in yellow colored twilight zone, as shown in the figure. Therefore, it is not safe to infer that two sequences are homologous. Consequently, we need to evaluate the alignment using statistical testing with BLOSUM62 with opening gap 10 and extending gap 1.

Similarly EMBOSS Stretcher at http://www.ebi.ac.uk/Tools/psa/emboss_stretcher/ with default parameters produced following alignment

14

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

The 23.3% identities for the overlapping alignment length of 90 positions, in this case also fall in yellow colour twilight zone. Therefore, it is not safe to infer that two sequences are homologous. Consequently, we need to evaluate the alignment using statistical testing with BLOSUM62 with opening gap 10 and extending gap 1.

-s The statistical testing is undertaken through calculating P–value given by Ke . However, K and  are not reported by some sequence alignment programs. Therefore, we need to calculate K and  for specific scoring systems with specific affine gap penalties. K and  can be calculated using online program PRSS http://www.ch.embnet.org/software/PRSS_form.html. For BLOSUM62 with opening gap 10 and extending -s gap 1, Lambda= 0.2393; K=0.04886 were calculated. Using Ke , P-values for EMBOSS Needle with score -12 -11 93 and EMBOSS Stretcher with score 80, were found to be 2.16161 * 10 and 8.10861 * 10 , -50 -5 respectively. Since these P-values fall within 10 - 10 range, therefore, two sequences are having clear homology. Consequently, statistically, it is safe to draw an inference that the two sequences are homologous sequences.

The alignments produced above shows approximately 25% sequence identity. The table given next, shows that for approximately 25% sequence identity, PAM200 may be appropriate.

PAM Number % Sequence Identity % Observed Mutations 1 99 1 30 75 25 40 69 31 80 50 50 110 40 60 120 38 62 200 25 75 250 20 80

Therefore let us use PAM200 with , open gap=10, extend gap=1. EMBOSS Needle produced following alignment:

15

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

Let us use PAM200 with , open gap=10, extend gap=1. EMBOSS Stretcher produced following alignment:

16

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

PAM200 with , open gap=10, extend gap=1, PRSS produced Lambda= 0.1311; K=0.03651, and are entered in MS Execl worksheet with formula as shown below:

K and  were calculated using online program PRSS http://www.ch.embnet.org/software/PRSS_form.html. -s Using Ke , P-values for EMBOSS Needle with score 134 was

and with EMBOSS Stretcher with score 124, it was

-50 -5 Since these P-values (8.57012E-10 and 3.17941E-09) fall within 10 - 10 range, therefore, two sequences are having clear homology. Consequently, statistically, it is safe to draw an inference that the two sequences are homologous sequences.

Now, let us try to change parameters with EMBOSS Needle, using BLOSUM62, open gap=1, extend gap=1. It produced following alignment of Hpr proteins from E. coli and E. Faecalis, with a score of 153.0 and alignment length of 108 having introduced 18 gaps in Hpr from E. Coli. The number of identities = 34/108 (31.5%).

E.coli_Hpr 1 MTVKQ--TV-EITNKLGMHARPAMKLFELMQ-G--FDAEVLLRN-D-EGT 42 |..|: .| | | |:||||| .| |:| . |:::: | : :| E.faecalis_Hpr 1 MEKKEFHIVAE-T---GIHARPA-TL--LVQTASKFNSDI---NLEYKG- 39

E.coli_Hpr 43 EAEAN--SVIAL-LM-LDSAKGRQ-IEVEAT--GPQEE-EALAAVI-ALF 83 :: .| | | : :| | .. | | .:|..| | .:| |.:||:: .| E.faecalis_Hpr 40 KS-VNLKS-I-MGVMSL-GV-G-QGSDVTITVDG-ADEAEGMAAIVETL- 81 17

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

E.coli_Hpr 84 -NSGFDED 90 ..|..| E.faecalis_Hpr 82 QKEGLAE- 88

Now, let us change parameters for EMBOSS Stretcher; use BLOSUM62 with open gap=1, extend gap=1. It produced following alignment with a score of 152.0 and alignment length of 113 having introduced 23 gaps in Hpr from E. Coli. Number of identities = 35/113 (31.0%)

E.coli_Hpr 1 MT-VKQ--TV-EITNKLGMHARPAMKLFELMQ-G--FDAEVLLRN-DE-- 40 |. |: .| | | |:|||||. | |:| . |:::: | | E.faecalis_Hpr 1 MEK-KEFHIVAE-T---GIHARPAT-L--LVQTASKFNSDI---NL-EYK 38

E.coli_Hpr 41 GTEAEAN--SVIAL-LM-LDSAKGRQ--IEVEA-T--GPQEE-EALAAVI 80 | :: .| | | : :| | .. | | :| . | |. :| |.:||:: E.faecalis_Hpr 39 G-KS-VNLKS-I-MGVMSL-GV-G-QGS-DV-TITVDGA-DEAEGMAAIV 78

E.coli_Hpr 81 -ALFNS-GF-DED 90 .| .. |. | E.faecalis_Hpr 79 ETL-QKEGLA-E- 88

The alignment calculates the number of identities between two sequences. The alignment between Hpr from E. coli and E. faecalis using EMBOSS Stretcher with BLOSUM62, opening gap 1 and extending gap 1 calculates 35/113 (31.0%) with a score of 152. The 31% identities for the 113 alignment length fall in green colored safe zone, as shown in the figure. Therefore, it is safe to infer that two sequences are homologous. Consequently, the functional knowledge known about Hpr from E. coli can be applied in understanding the structure and function of Hpr from E. faecalis.

18

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

In addition, EMBOSS Needle (BLOSUM62, gap open 1 and gap extend 1) alignment gives P-Value = 2.44533 -08 * 10 and EMBOSS Stretcher (BLOSUM62, gap open 1, gap extend 1) alignment gives P-Value = 2.78815 * -08 -50 -5 10 . Since these P-values fall within is between 10 - 10 range, two sequences are having clear homology, Therefore, it is safe to draw an inference that the two sequences are homologous sequences. In addition, the functional residues His-15 and Ser-48 in E. coli are aligned with His-16 and Ser-46 in E. faecalis. Therefore, the alignment appears to be functional alignment and can be used for function assignment. However, the increase in overlap length of global alignment from 91 to 108 and 113 residues, may misguide for structure assignment. One of the major goals of sequence alignment is to find sequence identities and similarities, so as to provide a basis for functional conservation between the sequences based on structure conservation. Homology refers to similarity due to descent from a common ancestor and indicates that known knowledge about the structure and function about one sequence can be applied to infer structure and function of the other sequence. Therefore, it is necessary to know if alignment itself indicates a biochemically significant alignment for assigning structure also.

SSEARCH, available through web interface at: http://pir.georgetown.edu/pirwww/search/pairwise.shtml reported full length alignment between these two sequences with 87 amino acid overlap

-8 Therefore, the E value of 2.1 e-08 (i.e. 2.1 * 10 ), which represents P-value, is significant. The local overlapping alignment length of 87 residues was compared with global alignment with EMBOSS Stretcher with parameters (BLOSUM62, open gap 10, extend gap 1) with overlapping alignment length of 90. The

19

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example

sequences in local alignment are labelled SW_EC for Hpr sequence from E. coli and SW_EF for Hpr sequence from E. faecalis. The global alignment is labelled with SW_EC for Hpr sequence from E. coli and SW_EF for Hpr sequence from E. faecalis. 10 20 30 40 50 SW_EC MTVKQTVEITNKLGMHARPAMKLFELMQGFDAEVLLRNDEGTEAEANSVI |. .|. . |.||||| | . . |.... |. .| .. .|.. SW_EF MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEY-KGKSVNLKSIM

NW_EC MTVKQTVEITNKLGMHARPAMKLFELMQGFDAEVLLRNDEGTEAEANSVI ..|:...|..:.|:|||||..|.:....|::::.|.. :|...... |:: NW_EF MEKKEFHIVAETGIHARPATLLVQTASKFNSDINLEY-KGKSVNLKSIM

60 70 80 90 SW_EC ALLMLDSAKGRQIEVEATGPQEEEALAAVI-ALFNSGFDED ... | ..| .. . . | .| |..||.. .| . |. | SW_EF GVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE

NW_EC ALLMLDSAKGRQIEVEATGPQEEEALAAVI-ALFNSGFDED 90 .::.|...:|..:.:...|..|.|.:||:: .|...|..| NW_EF GVMSLGVGQGSDVTITVDGADEAEGMAAIVETLQKEGLAE- 88

As shown above, alignment has the residues, ‘H’ and ‘S’ important for its function in two bacteria. In these alignments, (using EMBOSS Needle with high affine gap penalty and local alignment using SSearch), the shorter overlap length of alignment appears to indicate a biochemically relevant structure and therefore, function. Therefore, the structural and functional knowledge for Hpr protein in E. Coli, obtained using experimental methods, can be applied for the Hpr protein in E. faecalis.

Back to Concept map

4. Summary

In this module we learnt:  Using online programs, EMBOSS Needle and EMBOSS Stretcher, for global sequence alignment as well as SSEARCH for local alignment, between a pair of sequences.  Inferring homology between two sequences from percentage of identical residues present over the length of pairwise sequence alignment  Using PRSS for calculation of K and Lambda values for pairwise alignment required to calculate P-value  Deriving statistically significance to infer homology between two sequences from calculated P-value  Deriving inference of structural homology for biochemically relevant structure and therefore, function.

20

Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example