Biochemistry Biostatistics and Bioinformatics Sequence Alignment

. 1 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example Description of Module Subject Name Biochemistry Paper Name 13 Biostatistics and Bioinformatics Module Name/Title 10 Sequence Alignment – a Practical Example Dr. Vijaya Khader Dr. MC Varadaraj 2 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example 1. Objectives: This module aims students - 1. To use different computer programs for creating global and local sequence alignments for a pair of sequences 2. To analyse real alignment of phosphocarrier protein from Escherichia coli (a Gram –ve bacteria) and Enterococcus faecalis (a Gram +ve bacteria) to infer homology 2. Concept Map Practical sequence alignment Pair-wise alignment programs Global alignment Local alignment programs programs EMBOSS Needle SSearch EMBOSS Stretcher BLAST A practical example of pairwise alignment of Hpr from E. coli and E. faecalis for inference of homology 3. Practical sequence alignment To create sequence alignment of real proteins, we need to download a pair of protein sequences. There are several sequence alignment programs to align a pair of sequences. We need to choose the most appropriate program from these available programs. In addition, we need to decide the scoring matrix and gap penalty scheme to be used. Then we need to draw an inference of homology. All this will be discussed below, first for a hypothetical pair of sequences and then for a real practical example. Back to Concept map 3 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example 3.1. Pair-wise sequence alignment programs There are several sequence alignment programs for global and local alignments. Global alignment tools create an end-to-end alignment of the sequences to be aligned. EMBOSS Needle and EMBOSS Stretcher are freely available online for global alignments. Local alignment tools create a smaller alignment of two sequences. SSEARCH and BLAST are freely available online for local alignments. 3.1.1. EMBOSS Needle: http://www.ebi.ac.uk/Tools/psa/emboss_needle/ : EMBOSS Needle creates an optimal global alignment of two sequences using the Needleman-Wunsch algorithm with affine gap penalty only. Linear gap scorings not implemented because, linear gap scoring scheme is useful for understanding the basic dynamic programming manually, whereas, actual alignment of evolutionary related sequences is better achieved using affine gap penalty scheme. There are separate interfaces for aligning for aligning protein and nucleotide sequences. Both differ only in the scoring scheme selected for aligning the sequences. STEP 1 is to enter the input sequences for alignment in the following interface: 4 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example STEP 2 is to set the parameters for scoring. It offers to select BLOSUM matrix from BLOSUM30 to BLOSUM90 with any increment of 5. The default is BLOSUM62. In addition, one can select from PAM10 to PAM500 with any increment of 10. The allowed gap open and Gap extend are also shown below: Finally in STEP 3 – Submit your job and select in case you want to be informed through email, else the results of the alignment will appear in your web browser. We will use EMBOSS Needle with a practical example, at the end of this module. Back to Concept map 5 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example 3.1.2. EMBOSS Stretcher: http://www.ebi.ac.uk/Tools/psa/emboss_stretcher/ : EMBOSS Stretcher uses a modification of the Needleman-Wunsch algorithm that allows larger sequences to be globally aligned. STEP 1 is to enter the input sequences for alignment in the similar interface as used for EMBOSS Needle. STEP 2 is to set the parameters for scoring, as shown below: 6 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example The EMBOSS Stretcher allows only a small range of opening (1 to 25) and extend gap ( 1 to 8) penalties, whereas, EMBOSS Needle allows a wide range of opening (1 to 100) and extend gap (0.0005 to 10) penalties. The output format available are same as for EMBOSS Needle. We will use EMBOSS Stretcher with a practical example, at the end of this module. Back to Concept map 3.1.3. SSEARCH at http://pir.georgetown.edu/pirwww/search/pairwise.shtml program gives local alignments using Smith-Waterman algorithm of dynamic programming between full lengths of two sequences. It requires pasting the two sequences only and no other input. We will use SSEARCH with a practical example, at the end of this module. Back to Concept map 3.1.4. Align nucleotide Sequences using BLAST: Blast2Seq available at http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_PROG_DEF=m egaBlast&BLAST_SPEC=blast2seq allows the alignment of two sequences, the query sequence and subject sequence. 7 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example It allows setting of scoring parameters for using following arbitrary match/ mismatch and gap costs. Back to Concept map 3.2. Interpretation of pairwise alignment for inferring homology Let us align following small sequences: THISISAPRRTEINSEQVENCE and ITISANNTHERSEQVENCE EMBOSS Needle at http://www.ebi.ac.uk/Tools/psa/emboss_needle/ for aligning above mentioned two sequences using BLOSUM62 and affine gap penalty (gap opening = 1 and gap extension = 1), produced following alignment: 8 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example The EMBOSS Needle dynamic programming alignment using the Needleman-Wunsch algorithm with BLOSUM62, opening gap penalty 1 and extending gap penalty 1 produced 14 conserved positions (identities) with one conservative mutation (15 similarities), 3 semi-conservative mutations and 5 gaps having an overlapping alignment length of 23 with a score of 67.0. The identity between two sequences due to presence of identical residues at corresponding positions in sequence alignment is quantitative and can be used to infer homology from percentage of identical residues present over the length of pairwise sequence alignment. The figure below is an x – y graph showing three regions. The green region shows that the sequence alignments with corresponding sequence identity aligned over involved sequence length are safe to infer homology. In the yellow zone we need to be careful before inferring homology and we may follow statistical assessment in this zone. In the red zone it not safe at all. In the present case with 60.9% identity over 23 residue alignment length, the alignment falls in yellow colored twilight zone, therefore, we may follow statistical assessment, frequently assessed by calculating -s P-value. The P–value is given by Ke . The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively. The following table gives ranges of P values to draw an inference of homology between two sequences. P-value range Inference -100 Two sequences are identical Sequences < 10 -100 -50 Two sequences are nearly identical 10 - 10 9 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example -50 -5 Two sequences are having clear homology 10 - 10 -5 -1 Two sequences are having possible distant homology 10 - 10 -1 Two sequences are having are randomly related,, therefore, no homology > 10 The sequence alignment programs reports the similarity score using a particular scoring system with affine -s gap penalties imposed. The P–value is given by Ke . However, K and are not reported by sequence alignment programs. Therefore, we need to calculate K and for specific scoring systems with specific affine gap penalties. K and can be calculated using online program PRSS http://www.ch.embnet.org/software/PRSS_form.html. The result of pressing Run PRSS is shown below: 10 Biostatistics and Bioinformatics Biochemistry Sequence Alignment – A Practical Example P-value can be calculated to determine the statistically significance for probability of the alignment with alignment score, K and , to infer homology. We can calculate P-value using Microsoft Office Excel as shown below: The result will as you enter the data and formula: -07 -50 -5 The value of 9.68189 * 10 is between 10 - 10 , therefore, two sequences are having clear homology, and can be represented as homologous sequence alignment using symbols, as shown below: Back to Concept map 3.3. A practical example phosphocarrier protein (Hpr) with a phosphorylated histidine is used in group translocation of sugars by the PEP dependent phosphotransferase system (PTS) through the cell membrane. The histidine residue of phosphocarrier protein (HPr) accepts a phosphate group from PEP to be ultimately transferred to incoming sugar. In addition, HPr can be phosphorylated, at serine residue using ATP, to act as an intermediate in the signalling cascade that regulates transcription of genes related to the carbohydrate-response system. Both functions involve phosphorylation/ dephosphorylation reactions, but at different sites. For sequence alignment in the present module, let us use two orthologous Hpr proteins; one from Escherichia coli K12 and other from Enterococcus faecalis. Download each of these protein sequences as described in “module 03 Molecular Sequence Databases” and save in separate files named as EColiHpr.FA and EfaecalisHpr.FA respectively. These sequences in FASTA format are shown next: 11 Biostatistics and Bioinformatics Biochemistry

Biochemistry Biostatistics and Bioinformatics Sequence Alignment

Bioinformatics Study of Lectins: New Classification and Prediction In

Trichoderma Reesei Complete Genome Sequence, Repeat-Induced Point

Introduction to Bioinformatics (Elective) – SBB1609

A Comprehensive Review and Performance Evaluation of Sequence Alignment Algorithms for DNA Sequences

GPS@: Bioinformatics Grid Portal for Protein Sequence Analysis on EGEE Grid

Software List for Biology, Bioinformatics and Biostatistics CCT

A New Graphical User Interface to EMBOSS

EBI, Expasy, EMBOSS, DTU)

Web & Grid Technologies in Bioinformatics, Computational And

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

A Converter Facilitating Genome Annotation Submission to European Nucleotide Archive Martin Norling1,2, Niclas Jareborg1,3 and Jacques Dainat1,2*

Bioinformatics Approaches for Functional Predictions in Diverse Informatics Environments. Paula Maria Moolhuijzen, Bsc This Thes