Multiple Sequence Alignment

Total Page:16

File Type:pdf, Size:1020Kb

Multiple Sequence Alignment Bioinformatics Algorithms Multiple Sequence Alignment David Hoksza http://siret.ms.mff.cuni.cz/hoksza Outline • Motivation • Algorithms • Scoring functions • exhaustive • multidimensional dynamic programming • heuristics • progressive alignment • iterative alignment/refinement • block(local)-based alignment Multiple sequence alignment (MSA) • Goal of MSA is to find “optimal” mapping of a set of sequences • Homologous residues (originating in the same position in a common ancestor) among a set of sequences are aligned together in columns • Usually employs multiple pairwise alignment (PA) computations to reveal the evolutionarily equivalent positions across all sequences Motivation • Distant homologues • faint similarity can become apparent when present in many sequences • motifs might not be apparent from pairwise alignment only • Detection of key functional residues • amino acids critical for function tend to be conserved during the evolution and therefore can be revealed by inspecting sequences within given family • Prediction of secondary/tertiary structure • Inferring evolutionary history 4 Representation of MSA • Column-based representation • Profile representation (position specific scoring matrix) • Sequence logo Manual MSA • High quality MSA can be carried out automatic MSA algorithms by hand using expert knowledge • specific columns • BAliBASE • highly conserved residues • https://lbgi.fr/balibase/ • buried hydrophobic residues • PROSITE • secondary structure (especially in RNA • http://prosite.expasy.org/ alignment) • Pfam • expected patterns of insertions and • http://pfam.sanger.ac.uk/ deletions • TIGRFAM • http://www.jcvi.org/cgi- bin/tigrfams/index.cgi • Tedious, but • … (some databases are semi-automatic • high-quality source of family and many of the databases construct the information MSA from the structure information) • a benchmark for evaluation of Scoring • How to score an MSA? 푺 푨 = 푮 + ෍ 푪푺(푨풊) • 퐴푖 … 푖-th column • 퐶푆(퐴푖) … score of the 푖-th column • 퐺 … gap function (assumes linear or constant gap penalty) • the score assumes independent columns • Two score types are usually considered • minimum entropy (ME) • sum of pairs (SP) Minimum entropy (1) • ME aims to minimize entropy of each column • columns with low entropy (can be expressed with only few bits) are good for the alignment • the more bits we need to express a column, the more divers the column is Minimum entropy (2) • Probability of a column • assumption of independency between columns and residues within columns 풄풊풂 푷 푨풊 = ෑ 풑풊풂 풂 0 퐴푖 [푗] ≠ 푎 • 푐푖푎…observed counts for residue 푎 in 푖-th column 푐푖푎 = σ푗 ൝ 1 퐴푖 푗 = 푎 • 퐴푖 [푗]… 푗-th symbol in 푖-th column • 푝푖푎… probability of residue 푎 in column 푖 푪푴푬 푨풊 = − ෍ 풄풊풂 퐥퐨퐠 풑풊풂 푴푬 = ෍ 푪푴푬(푨풊) 풂 풊 • completely conserved column would score 0 Sum of pairs • Sum of scores of all possible pairs in a multiple alignment 푨 for a particular scoring matrix • Score for each column is computed as the sum of all pairs of position in that column • Column scores are then summed to get the SP-score |퐴| |퐴| 푆푃 퐴 = ෍ 퐶푆푃 퐴푖 = ෍ ෍ 휎(퐴푖 푘 , 퐴푖 푙 ) 푖=1 푖=1 푘<푙 • 퐴푖 [푘]… 푘-th symbol in 푖-th column • 휎(푥, 푦) … PAM or BLOSUM values for the residue 푥 and 푦 G K N SP - Example T R N S H E • BLOSUM 62 scoring matrix -1 +1 +6 6 SP score drawback • Alignment of 푵 sequences, all containing leucine at given position from functional reasons • BLOSUM62 matrix 흈 푳, 푳 = ퟒ → 푺푷 푨풊 = ퟒ × 푵(푵 − ퟏ)/ퟐ • Let us replace one of the leucines with glycine (incorrect alignment) 흈 푳, 푮 = −ퟒ → the score decreases by ퟖ × (푵 − ퟏ) • 푺푷 푨풊 is worse by a fraction of 8×(푁−1) ퟒ = 4×푁(푁−1)/2 푵 • Relative difference in score between the correct alignment and incorrect alignment decreases with the number of sequences in the alignment • BUT increasing the number of sequences (evidence) should give us more increased relative difference Multidimensional dynamic programming (1) • Generalization of pairwise dynamic programming • 3 sequences: ATGC, AATC,TTGC 0 1 1 2 3 4 x coordinate A - T G C 0 1 2 3 3 4 y coordinate A A T - C 0 0 1 2 3 4 z coordinate - T T G C • Resulting path • (0,0,0) → (1,1,0) → (1,2,1) → (2,3,2) → (3,3,3) → (4,4,4) Multidimensional dynamic programming (2) • Let us assume linear gap penalty model (not affine) • 훾 푔 = 푔푑 for a gap of length 푔 and gap cost 푑 • initialization and backtracking are analogous with the 2D case Multidimensional dynamic programming (3) • 3 edges • 7 edges Computational complexity of MDP • Computation of each cell of the DP matrix takes ퟐ푵 − ퟏ (all possible combinations of gaps column) • Let us assume all the sequences have approximately the same length 푳 • Memory complexity 푶 푳푵 • Time complexity 푶 ퟐ푵푳푵 MDP - exercise • Let’s have sequence of length 50 • Comparison of a pair of sequences using DP takes 0,1s • What is the time needed to compare 4 sequences? • Let’s say we have 1000 years and average sequence length is 50. • How many sequence can afford to compare? Heuristic Algorithms • Progressive alignment methods • iterative building of the alignment • Block-based alignment • Feng & Doolittle • local alignment built by identifying • ClustalW, Clustal Omega blocks of ungapped MSA identified and assembled • Consistency-based methods • DIALIGN • T-Coffee • Mix of approaches • Iterative refinement • MAFFT, MUSCLE • alignment built and then refined be realigning the constituent sequences • Barton & Sternberg Progressive alignment • Framework • First, two sequences are aligned using standard pairwise alignment • The remaining sequences are taken one by one and aligned to the previous ones • Repeated until all sequences are aligned • Parameters • The order in which the sequences are be aligned • Whether only one alignment is kept and sequences are added to it or whether also an alignment can be aligned to another alignment (as if a tree was being built) • The process used to align and score sequences or alignments against the existing ones Star alignment • N sequences 풔ퟏ, … , 풔푵 to be aligned 1. Pick 풔풊 as a starting sequence – center 2. Compute all optimal global alignments between 풔풊 and 풔풋, 푗 ≠ 푖 3. Successively merge sequences into the arising MSA • once a gap always a gap rule • if a gap is introduced into the MSA it stays there forever SA – example (1) S1: ATTGCCATT ATTGCC-ATT-- S2: ATGGCCATT ATTGCCATT ATGGCC-ATT-- S3: ATCCAATTTT ATGGCCATT ATTGCCGATT-- S4: ATCTTCTT ATCTTC--TT-- S5: ATTGCCGATT ATC-CA-ATTTT ATTGCC-ATT ATTGCCATT-- ATTGCCATT ATTGCCGATT ATC-CAATTTT ATTGCCATT ATCTTC-TT credit: Xingquan Zhu, Florida Atlantic University SA – example (2) pairwise alignment multiple alignment ATTGCCATT ATTGCCATT 1. ATGGCCATT ATGGCCATT ATTGCCATT-- ATTGCCATT-- 2. ATGGCCATT-- ATC-CAATTTT ATC-CAATTTT ATTGCCATT-- ATTGCCATT ATGGCCATT-- 3. ATCTTC-TT ATC-CAATTTT ATCTTC-TT-- ATTGCC-ATT-- ATGGCC-ATT-- ATTGCC-ATT 4. ATC-CA-ATTTT ATTGCCGATT ATCTTC--TT-- ATTGCCGATT-- SA - choosing the center • Compute all pairwise alignment and pick sequence 풔풊 with maximum σ풋≠풊 풔(풔풊, 풔풋) • Choosing the sequence which is most similar to all the rest • Compute all pairwise alignments and compute MSA for every 풔풊 and pick the best SA – time complexity • Average sequence length 퐿 • One global alignment computation in 퐎(푳ퟐ) • 푘 sequences → 퐎(풌ퟐ푳ퟐ) pairwise computations • 푙 … upper bound on the MSA length → 퐎(풍풌) for MSA construction 푂 푘2퐿2 + 푙푘 = 푶(풌ퟐ푳ퟐ) SA - exercise • Compute SP for the constructed MSA • Compute SA for the previous example but add sequences to the MSA in different order. Does the order of addition impacts the score? • Compute MSA starting with S5. Does the score change? ATTGCC-ATT ATGGCC-ATT AT--CCAATTTT AT--CTTCTT ATTGCCGATT ATTGCCGATT ATTGCCGATT-- ATTGCCGATT Feng & Doolittle (1) 푆 푎,푎 +푆 푏,푏 • 푆 푎, 푏 = 푚푎푥 2 1. Calculate a distance matrix from all-to-all pairwise • 푆푟푎푛푑 is an expected score alignments (푁(푁 − 1)/2) obtained by randomization • 푆푒푓푓 can be viewed as normalized 2. Convert raw alignment scores into (evolutionary) distances percentage similarity which decreases roughly exponentially to 0 with increasing evolutionary distance. • –log makes the measure linear with 푆표푏푠−푆푟푎푛푑 • 퐷 = − log 푆푒푓푓 × 100 = − log × 100 evolutionary distance 푆푚푎푥−푆푟푎푛푑 3. Construct a guide tree from the distance matrix using Fitch & Margoliash algorithm 4. Align child nodes of each parent (can be sequence- sequence, sequence-MSA, MSA-MSA) in the order they were added to the tree source: Feng, Da-Fei, and Russell F. Doolittle. "Progressive sequence alignment as a prerequisitetto correct phylogenetic trees." Journal of molecular evolution 25.4 (1987): 351-360. Feng & Doolittle (2) • Sequence-sequence is aligned using classical dynamic programming • Sequence-MSA – sequence is aligned with each sequence in the group and the highest scoring alignment defines how the sequence is added to the group • MSA-MSA – as in previous case but all pairs of sequences are tested • When a sequence is added to a group, neutral symbol X is introduced instead of the gap position • allows to align gap positions • neutral – anything aligned with X scores 0 • side effect – the gaps in two MSAs tend to come together in the resulting MSA Profile/MSA Alignment • When adding a sequence to a group it is desirable to take into account the MSA built so far • mismatches at highly conserved positions should be penalized more • 2 MSA (profiles) of 푁 sequences, one from 1. 푛, second 푛 + 1. 푁 ෍ 푺 푨 풊 = ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) 풊 풊 풌<풍≤푵 = ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) + ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) + ෍ ෍ 흈(푨풌 풊 , 푨풍 풊 ) 풊 풌<풍≤풏 풊 풏<풌<풍≤푵 풊 풌≤풏,풏<풍≤푵 • The score of the σ푖 σ푘<푙≤푁 휎(퐴푘 푖 , 퐴푙 푖 ) consists of the in-group scores plus between group scores • when aligning the profiles we can use standard dynamic programming where columns are aligned against columns using the in-between scores • → using position-specific information from the group’s multiple alignment ClustalW • Similar to Feng & Doolittle but uses profile-based building 1.
Recommended publications
  • T-Coffee Documentation Release Version 13.45.47.Aba98c5
    T-Coffee Documentation Release Version_13.45.47.aba98c5 Cedric Notredame Aug 31, 2021 Contents 1 T-Coffee Installation 3 1.1 Installation................................................3 1.1.1 Unix/Linux Binaries......................................4 1.1.2 MacOS Binaries - Updated...................................4 1.1.3 Installation From Source/Binaries downloader (Mac OSX/Linux)...............4 1.2 Template based modes: PSI/TM-Coffee and Expresso.........................5 1.2.1 Why do I need BLAST with T-Coffee?.............................6 1.2.2 Using a BLAST local version on Unix.............................6 1.2.3 Using the EBI BLAST client..................................6 1.2.4 Using the NCBI BLAST client.................................7 1.2.5 Using another client.......................................7 1.3 Troubleshooting.............................................7 1.3.1 Third party packages......................................7 1.3.2 M-Coffee parameters......................................9 1.3.3 Structural modes (using PDB)................................. 10 1.3.4 R-Coffee associated packages................................. 10 2 Quick Start Regressive Algorithm 11 2.1 Introduction............................................... 11 2.2 Installation from source......................................... 12 2.3 Examples................................................. 12 2.3.1 Fast and accurate........................................ 12 2.3.2 Slower and more accurate.................................... 12 2.3.3 Very Fast...........................................
    [Show full text]
  • Bioinformatics 1: Lecture 3
    Bioinformatics 1: Lecture 3 •Pairwise alignment •Substitution •Dynamic Programming algorithm Scoring matrix To prepare an alignment, we first consider the score for aligning (associating) any one character of the first sequence with any one character of the second sequence. A A G A C G T T T A G A C T 0 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 Exact match 0 0 1 0 0 1 0 0 0 0 1/0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 The cost of mutation is not a constant DNA: A change in the 3rd base in a codon, and sometimes the first base, sometimes conserves the amino acid. No selective pressure. Protein: A change in amino acids that are in the same chemical class conserve their chemical environment. For example: Lys to Arg is conservative because both a positively charged. Conservative amino acid changes N Lys <--> Arg C + N` N N` C N C + C C C N` C C C C O C O C C N N Ile <--> Leu C C C C C C C C O C O C C Ser <--> Thr Asp <--> Glu Asn <--> Gln If the “chemistry” of the sidechain is conserved, then the mutation is less likely to change structure/function.
    [Show full text]
  • Sequencing Alignment I Outline: Sequence Alignment
    Sequencing Alignment I Lectures 16 – Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence Alignment What Why (applications) Comparative genomics DNA sequencing A simple algorithm Complexity analysis A better algorithm: “Dynamic programming” 2 1 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G T C C A A T 3 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G | : | : | | : T C C – A A T 4 2 Sequence Alignment: Why The most basic sequence analysis task First aligning the sequences (or parts of them) and Then deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance Similar sequences often have similar origin or function New sequence always compared to existing sequences (e.g. using BLAST) 5 Sequence Alignment Example: gene HBB Product: hemoglobin Sickle-cell anaemia causing gene Protein sequence (146 aa) MVHLTPEEKS AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS TPDAVMGNPK VKAHGKKVLG AFSDGLAHLD NLKGTFATLS ELHCDKLHVD PENFRLLGNV LVCVLAHHFG KEFTPPVQAA YQKVVAGVAN ALAHKYH BLAST (Basic Local Alignment Search Tool) The most popular alignment tool Try it! Pick any protein, e.g.
    [Show full text]
  • Comparative Analysis of Multiple Sequence Alignment Tools
    I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE).
    [Show full text]
  • Chapter 6: Multiple Sequence Alignment Learning Objectives
    Chapter 6: Multiple Sequence Alignment Learning objectives • Explain the three main stages by which ClustalW performs multiple sequence alignment (MSA); • Describe several alternative programs for MSA (such as MUSCLE, ProbCons, and TCoffee); • Explain how they work, and contrast them with ClustalW; • Explain the significance of performing benchmarking studies and describe several of their basic conclusions for MSA; • Explain the issues surrounding MSA of genomic regions Outline: multiple sequence alignment (MSA) Introduction; definition of MSA; typical uses Five main approaches to multiple sequence alignment Exact approaches Progressive sequence alignment Iterative approaches Consistency-based approaches Structure-based methods Benchmarking studies: approaches, findings, challenges Databases of Multiple Sequence Alignments Pfam: Protein Family Database of Profile HMMs SMART Conserved Domain Database Integrated multiple sequence alignment resources MSA database curation: manual versus automated Multiple sequence alignments of genomic regions UCSC, Galaxy, Ensembl, alignathon Perspective Multiple sequence alignment: definition • a collection of three or more protein (or nucleic acid) sequences that are partially or completely aligned • homologous residues are aligned in columns across the length of the sequences • residues are homologous in an evolutionary sense • residues are homologous in a structural sense Example: 5 alignments of 5 globins Let’s look at a multiple sequence alignment (MSA) of five globins proteins. We’ll use five prominent MSA programs: ClustalW, Praline, MUSCLE (used at HomoloGene), ProbCons, and TCoffee. Each program offers unique strengths. We’ll focus on a histidine (H) residue that has a critical role in binding oxygen in globins, and should be aligned. But often it’s not aligned, and all five programs give different answers.
    [Show full text]
  • How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)
    Tutorial: How to generate a publication-quality multiple sequence alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012) 1) Get your sequences in FASTA format: • Go to the NCBI website; find your sequences and display them in FASTA format. Each sequence should look like this (http://www.ncbi.nlm.nih.gov/protein/6678177?report=fasta): >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG 2) In a text editor, paste all your sequences together (in the order that you would like them to appear in the end). It should look like this: >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG >gi|151554658|gb|AAI47965.1| STX3 protein [Bos taurus] MKDRLEQLKAKQLTQDDDTDEVEIAVDNTAFMDEFFSEIEETRVNIDKISEHVEEAKRLYSVILSAPIPE PKTKDDLEQLTTEIKKRANNVRNKLKSMERHIEEDEVQSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKAREETKRAVKYQGQARKKLVIIIVIVVVLLGILAL IIGLSVGLK
    [Show full text]
  • BASS: Approximate Search on Large String Databases
    BASS: Approximate Search on Large String Databases Jiong Yang Wei Wang Philip Yu UIUC UNC Chapel Hill IBM [email protected] [email protected] [email protected] Abstract Similarity search on a string database can be classified into two categories: exact match and approximate match. The In this paper, we study the problem on how to build an index struc- search of exact match looks for substrings in the database, ture for large string databases to efficiently support various types of which is exactly identical to the query pattern while the search string matching without the necessity of mapping the substrings to of approximate match allows some types of imperfection such a numerical space (e.g., string B-tree and MRS-index) nor the re- as substitutions between certain symbols, some degree of mis- striction of in-memory practice (e.g., suffix tree and suffix array). alignment, and the presence of “wild-card” in the query pat- Towards this goal, we propose a new indexing scheme, BASS-tree, tern. We shall mention that supporting approximate match is to efficiently support general approximate substring match (in terms very important to many applications. For instance, biologists of certain symbol substitutions and misalignments) in sublinear time have observed that mutations between certain pair of amino on a large string database. The key idea behind the design is that all acids may occur at a noticeable probability in some proteins positions in each string are grouped recursively into a fully balanced and such a mutation usually does not alter the biological func- tree according to the similarities of the subsequent segments starting tion of the proteins.
    [Show full text]
  • Computational Biology Lecture 8: Substitution Matrices Saad Mneimneh
    Computational Biology Lecture 8: Substitution matrices Saad Mneimneh As we have introduced last time, simple scoring schemes like +1 for a match, -1 for a mismatch and -2 for a gap are not justifiable biologically, especially for amino acid sequences (proteins). Instead, more elaborated scoring functions are used. These scores are usually obtained as a result of analyzing chemical properties and statistical data for amino acids and DNA sequences. For example, it is known that same size amino acids are more likely to be substituted by one another. Similarly, amino acids with same affinity to water are likely to serve the same purpose in some cases. On the other hand, some mutations are not acceptable (may lead to demise of the organism). PAM and BLOSUM matrices are amongst results of such analysis. We will see the techniques through which PAM and BLOSUM matrices are obtained. Substritution matrices Chemical properties of amino acids govern how the amino acids substitue one another. In principle, a substritution matrix s, where sij is used to score aligning character i with character j, should reflect the probability of two characters substituing one another. The question is how to build such a probability matrix that closely maps reality? Different strategies result in different matrices but the central idea is the same. If we go back to the concept of a high scoring segment pair, theory tells us that the alignment (ungapped) given by such a segment is governed by a limiting distribution such that ¸sij qij = pipje where: ² s is the subsitution matrix used ² qij is the probability of observing character i aligned with character j ² pi is the probability of occurrence of character i Therefore, 1 qij sij = ln ¸ pipj This formula for sij suggests a way to constrcut the matrix s.
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • Ple Sequence Alignment Methods: Evidence from Data
    Mul$ple sequence alignment methods: evidence from data Tandy Warnow Alignment Error/Accuracy • SPFN: percentage of homologies in the true alignment that are not recovered (false negave homologies) • SPFP: percentage of homologies in the es$mated alignment that are false (false posi$ve homologies) • TC: total number of columns correctly recovered • SP-score: percentage of homologies in the true alignment that are recovered • Pairs score: 1-(avg of SP-FN and SP-FP) Benchmarks • Simulaons: can control everything, and true alignment is not disputed – Different simulators • Biological: can’t control anything, and reference alignment might not be true alignment – BAliBASE, HomFam, Prefab – CRW (Comparave Ribosomal Website) Alignment Methods (Sample) • Clustal-Omega • MAFFT • Muscle • Opal • Prank/Pagan • Probcons Co-es$maon of trees and alignments • Bali-Phy and Alifritz (stas$cal co-es$maon) • SATe-1, SATe-2, and PASTA (divide-and-conquer co- es$maon) • POY and Beetle (treelength op$mizaon) Other Criteria • Tree topology error • Tree branch length error • Gap length distribu$on • Inser$on/dele$on rao • Alignment length • Number of indels How does the guide tree impact accuracy? • Does improving the accuracy of the guide tree help? • Do all alignment methods respond iden$cally? (Is the same guide tree good for all methods?) • Do the default sengs for the guide tree work well? Alignment criteria • Does the relave performance of methods depend on the alignment criterion? • Which alignment criteria are predic$ve of tree accuracy? • How should we design MSA methods to produce best accuracy? Choice of best MSA method • Does it depend on type of data (DNA or amino acids?) • Does it depend on rate of evolu$on? • Does it depend on gap length distribu$on? • Does it depend on existence of fragments? Katoh and Standley .
    [Show full text]
  • Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods
    International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-9 Issue-1, October 2019 Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods Arunima Mishra, B. K. Tripathi, S. S. Soam MSA is a well-known method of alignment of three or more Abstract: Protein Multiple sequence alignment (MSA) is a biological sequences. Multiple sequence alignment is a very process, that helps in alignment of more than two protein intricate problem, therefore, computation of exact MSA is sequences to establish an evolutionary relationship between the only feasible for the very small number of sequences which is sequences. As part of Protein MSA, the biological sequences are not practical in real situations. Dynamic programming as used aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and in pairwise sequence method is impractical for a large number hence the volume of biological data generated is increasing at an of sequences while performing MSA and therefore the enormous rate. This increase in volume of data poses a challenge heuristic algorithms with approximate approaches [7] have to the existing methods used to perform effective MSA as with the been proved more successful. Generally, various biological increase in data volume the computational complexities also sequences are organized into a two-dimensional array such increases and the speed to process decreases. The accuracy of that the residues in each column are homologous or having the MSA is another factor critically important as many bioinformatics same functionality. Many MSA methods were developed over inferences are dependent on the output of MSA.
    [Show full text]
  • HMMER User's Guide
    HMMER User's Guide Biological sequence analysis using pro®le hidden Markov models http://hmmer.wustl.edu/ Version 2.1.1; December 1998 Sean Eddy Dept. of Genetics, Washington University School of Medicine 4566 Scott Ave., St. Louis, MO 63110, USA [email protected] With contributions by Ewan Birney ([email protected]) Copyright (C) 1992-1998, Washington University in St. Louis. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. The HMMER software package is a copyrighted work that may be freely distributed and modi®ed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Some versions of HMMER may have been obtained under specialized commercial licenses from Washington University; for details, see the ®les COPYING and LICENSE that came with your copy of the HMMER software. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Appendix for a copy of the full text of the GNU General Public License. 1 Contents 1 Tutorial 5 1.1 The programs in HMMER . 5 1.2 Files used in the tutorial . 6 1.3 Searching a sequence database with a single pro®le HMM . 6 HMM construction with hmmbuild . 7 HMM calibration with hmmcalibrate . 7 Sequence database search with hmmsearch . 8 Searching major databases like NR or SWISSPROT .
    [Show full text]