Practical Use of Hmms and Protein Sequence Databases

Total Page:16

File Type:pdf, Size:1020Kb

Practical Use of Hmms and Protein Sequence Databases Temple University Structural Bioinformatics II Practical use of HMMs and Protein Sequence Databases Allan Haldane Ron evy !roup "utline %art 1: HMMs for proteins ● Practical HMM usage: Hmmer% E/p(values and limitations ● HMMs for family alignment and for secondary structure prediction %art 2: Protein Sequence *ariation ● Tour of the protein sequence space: Pfam% Interpro% S*"P ● Protein &volution$ Observations References: ● Biological sequence analysis ● HMMER User guide ● Fundamentals of Molecular Evolution (Graur/Li ● !he #eutral Theory of Molecular Evolution (Kimura) Revie+: HMMs (Hidden Mar,ov Models) ● )ystem moves t"roug" a series of -"idden. states +it" s'eci/ed transition probabilities ● )ystem -emits. o0serva0les, +it" di1erent emmision probabilities for eac" state2 occasionally dishonest casino 3eather Behavior (wikipedia) Hidden states Observables 4m'ortant 5lgorit"ms: ● Forward algorithm: Compute 'robability of a sequence of observables7 summing over all 'ossible pat"s through hidden states ● Viterbi Algorithm: Compute the most likely 'ath though hidden states7 given a sequence of observables ● Baum–Welch algorithm: Given a set of observed sequences and a topology, estimate emission and transition 'robabilities (build an HMM). !"ere are many 'ossible choises of -to'ology. for HMMs7 de'ending on t"e a''lication2 3e +ill focus on t+o ty'es: 1. Secondary9)tructure HMMs ()5M, O))9HMM, TMHMM H 6 ) (2 Profle HMMs (HMMER (for multiple sequence alignments) )imple !ransmembrane Helix HMM 3ant to model +hich parts of a 'rotein sequence are Simple T+o-)tate Model transmembrane helices2 > = Everyt"ing else M = !M "eli: Emission !ransition Probabilities Probabilities Hidden Mar,ov Models for %rediction of %rotein Features2 6"risto'"er Bystro1 and 5nders $rog"2 Met"ods in Molecular Biology7 vol2 ;&< )imple !ransmembrane Helix HMM 3ant to model +hich parts of a 'rotein sequence are Simple T+o-)tate Model transmembrane helices2 > = Everyt"ing else M = !M "eli: 6an improve wit" more complex topologies Hidden Mar,ov Models for %rediction of %rotein Features2 6"risto'"er Bystro1 and 5nders $rog"2 Met"ods in Molecular Biology7 vol2 ;&< !MHMM -Real. "ig" 'erformance HMMs can 0e muc" more com'licated2 -Correctly predicts AB9AC D of transmembrane helices2. Used to estimate that 209<@ D of all genes are !M proteins %osterior probabilities for a single sequence: %redicting !ransmem0rane %rotein !o'ology wit" a Hidden Mar,ov Model: 5''lication to Com'lete Genomes2 $rog" et al2 JMB (@@& HMM) for )econdary )tructure Prediction H ● Simplest idea: < state topology: Helix7 Coil7 Strand ● Real 3orld: 36 "idden state HMM (O))-HMM) ● !opology not chosen by hand: An algorithm tries topologies to fnd the one wit" 0est match with 6 ) E)SP predictions Similar %erformance to P)4%RED Insights into secondary structure 'atternsG 5nalysis of an optimal hidden Mar,ov model for secondary (% correct predictions structure prediction. Martin et al, BM6 Structural Biology (@@F %rofle HMMs (HMMER HMMs used to model and align Multiple Sequence Alignments ● !o'ology of a 'rofle HMM ● Building t"e HMM ● Using t"e HMM to align sequences ● Using t"e HMM to search for sequences in a data0ase ● !ool: "mmer ● %ractical challenges to 0uilding a 'rofle HMM given an M)5 %rofle HMMs (HMMER Review of !o'ology: Delete states (ga' 4nsert states (emit aa7 wit" unbiased probabilities) Match states (emit aa7 with speci/c 'robabilities) %rofle HMMs (HMMER Example of a 'rofle HMM wit" all values illustrated: 3eight of line = transition probability Unaligned (insert 3hat you can do with pro/le HMMs: Basic (Given a Model): ● Align Sequences to kno+n model: ● Find the most 'robable alignment of a unaligned sequence (viter0i ● Score Sequences: ● Compute 'robability that a sequence +as generated 0y the HMM ● Search For Matching Sequences ● Score all sequences in a database7 pick only the high scoring ones ● 6ompute Site Posterior Probability: ● refects the degree of confdence in each individual aligned residue Hig"er9level (#o initial model : ● Build a model given an aligned MS5 ● Baum-3elch algorithm (iterates for+ard/0ac,+ard algorit"ms) ● Build a model given unaligned sequences ● 4teratively run Baum-3elch, and re9align sequences +ith it ● 4terative model 0uilding/search ● 4teratively run Baum-3elch, and collect new sequences from a database using it2 Scoring )equences Given an observed sequence7 what is the probability that it was generated by t"e model? 3e have t+o ty'es of 'robabilities: Observed sequence ● 3e can calculate the probability of a sequence through most likely path through hidden states (*iterbi algorit"m Model *iterbi 'at" ● Or, +e can calculate the 'robability of a sequence averaging over all 'aths (for+ard algorit"m In practice we use the second: “!o maximiJe 'o+er to discriminate true "omologs from nonhomologs in a database search, statistical inference theory says you ought to 0e scoring sequences by integrating over alignment uncertainty7 not just scoring the single best alignment. 3e can compare this to the probability that it would 0e generated by a “random” model ● #ull model (random aa mean residue frequencies in Swiss-Prot I@2C )coring Sequences: Log Odds Ratio !he 'robability of a sequence Probability of sequence in Null model averaging over all 'aths (for+ard (random aa) algorithm) !"e “Log odds ratio. ● )tandard +ay of testing model discriminatory po+er ● *alues >L 1 imply the sequence was likely generated 0y the HMM ● 3e take the log to avoid numerical difculties (numbers can be very small ● 3"en the log is 0ase 2, the value has units called “bits.2 ● Log-odds ratio is also called the -sequence bit score” ● 6an be eMciently computed for HMMs using a modifed For+ard algorithm Scoring Sequences: E9*alues !"e “Log odds ratio. (“sequence 0it score” !his gives us a score for each individual sequence. But if you are testing many, many sequences to fnd matches, some will still "ave "igh scores 0y chance2 E9values: Expected number of sequences having this score or more7 +hen looking in a dataset of siJe #2 !heory: 6an wor, out that the “null” distribution of 0it9scores is e:'onential !his is a +ay to estimate if a high score 5''eared by chance due to having a large dataset2 7 necessary to cali0rate mu/tau )coring Sequences: Summary Given an HMM, and a sequence data0ase of siJe N to search t"roug"7 we can calculate: ● A log-odds ratio (0it score) for each sequence2 L& means its more li,ely to be generated by t"e HMM t"an the null model 2 ● An E9*alue: !he number of times +e e:'ect a sequence +ith this score 0y chance2 E9value < 1 means that on average +e donOt expect this strong a score by 6hance given our dataset siJe, so it is li,ely a homolog2 6ommonly E N @2@1 is used as a signifcance cutof By setting a t"rehold on the E9*alue7 we can flter the database to pick out the "omologous sequences2 ● Bonus com'utation: %er9site 'osterior 'ro0a0ility annotation2 Probability that emmited observable i came from state ,7 averaging over all 'ossible paths2 Computable from for+ard/bac,+ard equations2 (see Biological sequence analysis equation <2&6) 3hat you can do with pro/le HMMs: Basic (Given a Model): ● Align Sequences to kno+n model: ● Find the most 'robable alignment of a unaligned sequence (viter0i ● Score Sequences: ● Compute 'robability that a sequence +as generated 0y the HMM ● Search For Matching Sequences ● Score all sequences in a database7 pick only the high scoring ones Hig"er9level (#o initial model : ● Build a model given an aligned MS5 ● Baum-3elch algorithm (iterates for+ard/0ac,+ard algorit"ms) ● Build a model given unaligned sequences ● 4teratively run Baum-3elch, and re9align sequences +ith it ● 4terative model 0uilding/search ● 4teratively run Baum-3elch, and collect new sequences from a database using it2 Iterative HMM Algorithms Building an HMM given an aligned M)5 Pou already ,now this: !he Baum-3elch Algorithm Idea: Given a trial HMM, use for+ard/bac,+ard algorithms to average over all possible alignments of all sequences in t"e MS5. 5dd up t"e observed emission and transitions seen in t"e data wit" these probabilities7 and update the model +ith these observed values. Repeat2 Building an HMM given an unalignmed M)5 Idea: Similar to Baum-3elch algorit"m, e:cept t"at +e re9align the MSA using t"e current trial MS5 each round of iteration2 This +ill change which 'ositions in each sequence count as match vs insert states. Have to be careful not to get stuck in local optima. Building an HMM given a seed M)A and a data0ase Idea: Given a seed alignment, construct an initial MS52 Use t"is HMM to search t"e database for more "omologous sequences, giving a 0igger7 more diverse MS52 Fit a new HMM to this MS5. Repeat2 HMMER ● Free Soft+are for using/building 'ro/le HMMs ● HMMER3 is highly optimiJed. FastQ HMMER: Exam'le Run Profle HMMs: Practical issues & Local vs Global Alignments ( 6"oosing Matc" states < %"ylogenetic Weighting ; %seudocounts I Filtering Sequences 1. Local vs Global Alignments HMM for glo0al alignment !ry to match all residues in t"e query sequence to the model WIGDEVAVKAARHDDEDIS WHGD-VAVKILKVVDPTPE HMM for local alignment Allo+s unmatched states at beginning and end, so can fnd a match within a longer sequence7 or partial sequence2 EVAVKAARHDDEDISvykig WHGD-VAVKILKVVDPTPE Flan,ing model states Switching points (2 Choosing Match )tates 3hen building a new HMM given a seed MS5, you have to make a decision: 3hich columns +ill count as match states7 and which as insert states? (model construction) ● manual construction: !he user marks alignment columns 0y hand2 ● heuristic construction: A rule is used to decide whet"er a column should 0e mar,ed.
Recommended publications
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • HMMER User's Guide
    HMMER User's Guide Biological sequence analysis using pro®le hidden Markov models http://hmmer.wustl.edu/ Version 2.1.1; December 1998 Sean Eddy Dept. of Genetics, Washington University School of Medicine 4566 Scott Ave., St. Louis, MO 63110, USA [email protected] With contributions by Ewan Birney ([email protected]) Copyright (C) 1992-1998, Washington University in St. Louis. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. The HMMER software package is a copyrighted work that may be freely distributed and modi®ed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Some versions of HMMER may have been obtained under specialized commercial licenses from Washington University; for details, see the ®les COPYING and LICENSE that came with your copy of the HMMER software. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Appendix for a copy of the full text of the GNU General Public License. 1 Contents 1 Tutorial 5 1.1 The programs in HMMER . 5 1.2 Files used in the tutorial . 6 1.3 Searching a sequence database with a single pro®le HMM . 6 HMM construction with hmmbuild . 7 HMM calibration with hmmcalibrate . 7 Sequence database search with hmmsearch . 8 Searching major databases like NR or SWISSPROT .
    [Show full text]
  • Apply Parallel Bioinformatics Applications on Linux PC Clusters
    Tunghai Science Vol. : 125−141 125 July, 2003 Apply Parallel Bioinformatics Applications on Linux PC Clusters Yu-Lun Kuo and Chao-Tung Yang* Abstract In addition to the traditional massively parallel computers, distributed workstation clusters now play an important role in scientific computing perhaps due to the advent of commodity high performance processors, low-latency/high-band width networks and powerful development tools. As we know, bioinformatics tools can speed up the analysis of large-scale sequence data, especially about sequence alignment. To fully utilize the relatively inexpensive CPU cycles available to today’s scientists, a PC cluster consists of one master node and seven slave nodes (16 processors totally), is proposed and built for bioinformatics applications. We use the mpiBLAST and HMMer on parallel computer to speed up the process for sequence alignment. The mpiBLAST software uses a message-passing library called MPI (Message Passing Interface) and the HMMer software uses a software package called PVM (Parallel Virtual Machine), respectively. The system architecture and performances of the cluster are also presented in this paper. Keywords: Parallel computing, Bioinformatics, BLAST, HMMer, PC Clusters, Speedup. 1. Introduction Extraordinary technological improvements over the past few years in areas such as microprocessors, memory, buses, networks, and software have made it possible to assemble groups of inexpensive personal computers and/or workstations into a cost effective system that functions in concert and posses tremendous processing power. Cluster computing is not new, but in company with other technical capabilities, particularly in the area of networking, this class of machines is becoming a high-performance platform for parallel and distributed applications [1, 2, 11, 12, 13, 14, 15, 16, 17].
    [Show full text]
  • Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors
    Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors Running head: Six-fold speed-up of Smith-Waterman searches Torbjørn Rognes* and Erling Seeberg Institute of Medical Microbiology, University of Oslo, The National Hospital, NO-0027 Oslo, Norway Abstract Motivation: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence search algorithm is that of Smith- Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. Results: A fast implementation of the Smith-Waterman sequence alignment algorithm using SIMD (Single-Instruction, Multiple-Data) technology is presented. This implementation is based on the MMX (MultiMedia eXtensions) and SSE (Streaming SIMD Extensions) technology that is embedded in Intel’s latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith-Waterman implementation on the same hardware was achieved by an optimised 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. Availability: Online searches with the software are available at http://dna.uio.no/search/ Contact: [email protected] Published in Bioinformatics (2000) 16 (8), 699-706. Copyright © (2000) Oxford University Press. *) To whom correspondence should be addressed.
    [Show full text]
  • HMMER User's Guide
    HMMER User’s Guide Biological sequence analysis using profile hidden Markov models http://hmmer.org/ Version 3.0rc1; February 2010 Sean R. Eddy for the HMMER Development Team Janelia Farm Research Campus 19700 Helix Drive Ashburn VA 20147 USA http://eddylab.org/ Copyright (C) 2010 Howard Hughes Medical Institute. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. HMMER is licensed and freely distributed under the GNU General Public License version 3 (GPLv3). For a copy of the License, see http://www.gnu.org/licenses/. HMMER is a trademark of the Howard Hughes Medical Institute. 1 Contents 1 Introduction 5 How to avoid reading this manual . 5 How to avoid using this software (links to similar software) . 5 What profile HMMs are . 5 Applications of profile HMMs . 6 Design goals of HMMER3 . 7 What’s still missing in HMMER3 . 8 How to learn more about profile HMMs . 9 2 Installation 10 Quick installation instructions . 10 System requirements . 10 Multithreaded parallelization for multicores is the default . 11 MPI parallelization for clusters is optional . 11 Using build directories . 12 Makefile targets . 12 3 Tutorial 13 The programs in HMMER . 13 Files used in the tutorial . 13 Searching a sequence database with a single profile HMM . 14 Step 1: build a profile HMM with hmmbuild . 14 Step 2: search the sequence database with hmmsearch . 16 Searching a profile HMM database with a query sequence . 22 Step 1: create an HMM database flatfile . 22 Step 2: compress and index the flatfile with hmmpress .
    [Show full text]
  • Software List for Biology, Bioinformatics and Biostatistics CCT
    Software List for biology, bioinformatics and biostatistics v CCT - Delta Software Version Application short read assembler and it works on both small and large (mammalian size) ALLPATHS-LG 52488 genomes provides a fast, flexible C++ API & toolkit for reading, writing, and manipulating BAMtools 2.4.0 BAM files a high level of alignment fidelity and is comparable to other mainstream Barracuda 0.7.107b alignment programs allows one to intersect, merge, count, complement, and shuffle genomic bedtools 2.25.0 intervals from multiple files Bfast 0.7.0a universal DNA sequence aligner tool analysis and comprehension of high-throughput genomic data using the R Bioconductor 3.2 statistical programming BioPython 1.66 tools for biological computation written in Python a fast approach to detecting gene-gene interactions in genome-wide case- Boost 1.54.0 control studies short read aligner geared toward quickly aligning large sets of short DNA Bowtie 1.1.2 sequences to large genomes Bowtie2 2.2.6 Bowtie + fully supports gapped alignment with affine gap penalties BWA 0.7.12 mapping low-divergent sequences against a large reference genome ClustalW 2.1 multiple sequence alignment program to align DNA and protein sequences assembles transcripts, estimates their abundances for differential expression Cufflinks 2.2.1 and regulation in RNA-Seq samples EBSEQ (R) 1.10.0 identifying genes and isoforms differentially expressed EMBOSS 6.5.7 a comprehensive set of sequence analysis programs FASTA 36.3.8b a DNA and protein sequence alignment software package FastQC
    [Show full text]
  • PTIR: Predicted Tomato Interactome Resource
    www.nature.com/scientificreports OPEN PTIR: Predicted Tomato Interactome Resource Junyang Yue1,*, Wei Xu1,*, Rongjun Ban2,*, Shengxiong Huang1, Min Miao1, Xiaofeng Tang1, Guoqing Liu1 & Yongsheng Liu1,3 Received: 15 October 2015 Protein-protein interactions (PPIs) are involved in almost all biological processes and form the basis Accepted: 08 April 2016 of the entire interactomics systems of living organisms. Identification and characterization of these Published: 28 April 2016 interactions are fundamental to elucidating the molecular mechanisms of signal transduction and metabolic pathways at both the cellular and systemic levels. Although a number of experimental and computational studies have been performed on model organisms, the studies exploring and investigating PPIs in tomatoes remain lacking. Here, we developed a Predicted Tomato Interactome Resource (PTIR), based on experimentally determined orthologous interactions in six model organisms. The reliability of individual PPIs was also evaluated by shared gene ontology (GO) terms, co-evolution, co-expression, co-localization and available domain-domain interactions (DDIs). Currently, the PTIR covers 357,946 non-redundant PPIs among 10,626 proteins, including 12,291 high-confidence, 226,553 medium-confidence, and 119,102 low-confidence interactions. These interactions are expected to cover 30.6% of the entire tomato proteome and possess a reasonable distribution. In addition, ten randomly selected PPIs were verified using yeast two-hybrid (Y2H) screening or a bimolecular fluorescence complementation (BiFC) assay. The PTIR was constructed and implemented as a dedicated database and is available at http://bdg.hfut.edu.cn/ptir/index.html without registration. The increasing number of complete genome sequences has revealed the entire structure and composition of proteins, based mainly on theoretical predictions utilizing their corresponding DNA sequences.
    [Show full text]
  • Bioinformatics Courses Lecture 3: (Local) Alignment and Homology
    C E N Bioinformatics courses T R E Principles of Bioinformatics (BSc) & F B O I Fundamentals of Bioinformatics R O I I N N (MSc) T F E O G R R M A A T T I I Lecture 3: (local) alignment and V C E S homology searching V U Centre for Integrative Bioinformatics VU (IBIVU) Faculty of Exact Sciences / Faculty of Earth and Life Sciences 1 http://ibi.vu.nl, [email protected], 87649 (Heringa), Room P1.28 Divergent evolution sequence -> structure -> function • Common ancestor (CA) CA • Sequences change over time • Protein structures typically remain the same (robust against Sequence 1 ≠ Sequence 2 multiple mutations) • Therefore, function normally is Structure 1 = Structure 2 preserved within orthologous families Function 1 = Function 2 “Structure more conserved than sequence” 2 Reconstructing divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB─D A─BD 3 Reconstructing divergent evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) mutation deletion ACCD or ACCD Pairwise Alignment AB─D A─BD true alignment 4 Pairwise alignment examples A protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** A DNA sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ****** 5 Evolution and three-dimensional protein structure information Multiple alignment Protein structure What do we see if we colour code the space-filling (CPK) protein model? • E.g., red for conserved alignment positions to blue for variable (unconserved) positions. 6 Evolution and three-dimensional protein structure information Isocitrate dehydrogenase: The distance from the active site (in yellow) determines the rate of evolution (red = fast evolution, blue = slow evolution) Dean, A.
    [Show full text]
  • Scaling HMMER Performance on Multicore Architectures
    2011 International Conference on Complex, Intelligent, and Software Intensive Systems Scaling HMMER Performance on Multicore Architectures Sebastian Isaza∗, Ernst Houtgast∗, Friman Sanchezy, Alex Ramirezyz and Georgi Gaydadjiev∗ ∗Computer Engineering Laboratory, Delft University of Technology yComputer Architecture Department, Technical University of Catalonia zBarcelona Supercomputing Center Abstract—In bioinformatics, protein sequence alignment is one of the fundamental tasks that scientists perform. Since the growth of biological data is exponential, there is an ever- increasing demand for computational power. While current processor technology is shifting towards the use of multicores, the mapping and parallelization of applications has become a critical issue. In order to keep up with the processing demands, applications’ bottlenecks to performance need to be found and properly addressed. In this paper we study the parallelism and performance scalability of HMMER, a bioinformatics application to perform sequence alignment. After our study of the bottlenecks in a HMMER version ported to the Cell processor, we present two optimized versions to improve scalability in a larger multicore architecture. We use a simulator that allows us to model a system with up to 512 processors and study the performance of the three parallel versions of HMMER. Results show that removing the I/O bottleneck improves performance by 3× and 2:4× for a short Fig. 1. Swiss-Prot database growth. and a long HMM query respectively. Additionally, by offloading the sequence pre-formatting to the worker cores, larger speedups of up to 27× and 7× are achieved. Compared to using a single worker processor, up to 156× speedup is obtained when using growth is stagnating because of frequency, power and memory 256 cores.
    [Show full text]
  • Downloaded from TAIR10 [27]
    The Author(s) BMC Bioinformatics 2017, 18(Suppl 12):414 DOI 10.1186/s12859-017-1826-2 RESEARCH Open Access A sensitive short read homology search tool for paired-end read sequencing data Prapaporn Techa-Angkoon, Yanni Sun* and Jikai Lei From 12th International Symposium on Bioinformatics Research and Applications (ISBRA) Minsk, Belarus. June 5-8, 2016 Abstract Background: Homology search is still a significant step in functional analysis for genomic data. Profile Hidden Markov Model-based homology search has been widely used in protein domain analysis in many different species. In particular, with the fast accumulation of transcriptomic data of non-model species and metagenomic data, profile homology search is widely adopted in integrated pipelines for functional analysis. While the state-of-the-art tool HMMER has achieved high sensitivity and accuracy in domain annotation, the sensitivity of HMMER on short reads declines rapidly. The low sensitivity on short read homology search can lead to inaccurate domain composition and abundance computation. Our experimental results showed that half of the reads were missed by HMMER for a RNA-Seq dataset. Thus, there is a need for better methods to improve the homology search performance for short reads. Results: We introduce a profile homology search tool named Short-Pair that is designed for short paired-end reads. By using an approximate Bayesian approach employing distribution of fragment lengths and alignment scores, Short-Pair can retrieve the missing end and determine true domains. In particular, Short-Pair increases the accuracy in aligning short reads that are part of remote homologs. We applied Short-Pair to a RNA-Seq dataset and a metagenomic dataset and quantified its sensitivity and accuracy on homology search.
    [Show full text]
  • Clawhmmer: a Streaming Hmmer-Search Implementation
    ClawHMMER: A Streaming HMMer-Search Implementation Daniel Reiter Horn Mike Houston Pat Hanrahan Stanford University Abstract To mitigate the problem of choosing an ad-hoc gap penalty for a given BLAST search, Krogh et al. [1994] The proliferation of biological sequence data has motivated proposed bringing the probabilistic techniques of hidden the need for an extremely fast probabilistic sequence search. Markov models(HMMs) to bear on the problem of fuzzy pro- One method for performing this search involves evaluating tein sequence matching. HMMer [Eddy 2003a] is an open the Viterbi probability of a hidden Markov model (HMM) source implementation of hidden Markov algorithms for use of a desired sequence family for each sequence in a protein with protein databases. One of the more widely used algo- database. However, one of the difficulties with current im- rithms, hmmsearch, works as follows: a user provides an plementations is the time required to search large databases. HMM modeling a desired protein family and hmmsearch Many current and upcoming architectures offering large processes each protein sequence in a large database, eval- amounts of compute power are designed with data-parallel uating the probability that the most likely path through the execution and streaming in mind. We present a streaming query HMM could generate that database protein sequence. algorithm for evaluating an HMM’s Viterbi probability and This search requires a computationally intensive procedure, refine it for the specific HMM used in biological sequence known as the Viterbi [1967; 1973] algorithm. The search search. We implement our streaming algorithm in the Brook could take hours or even days depending on the size of the language, allowing us to execute the algorithm on graphics database, query model, and the processor used.
    [Show full text]