Bioinformatics Applications of Hmms

Total Page:16

File Type:pdf, Size:1020Kb

Bioinformatics Applications of Hmms BIOINFORMATICS APPLICATIONS OF HMMS CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES OUTLINE Definitions and terms Training approaches Sequence feature selection Secondary structure prediction Probabilistic alignment using HMMs: PFAM, HMMER Gene finding [next major topic] Prokaryotic genes and generalized HMMs Eukaryotic genes CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES DEFINITIONS AND REVIEW A hidden Markov model (HMM) is a generative stochastic model which assigns the probabilities to finite length strings over alphabet A. A four-tuple (A,Q,Pe,Pt) defines a hidden Markov model H: A - the finite alphabet over which the observed strings are defined. Q - the finite collection of hidden states of the model. Pe (ai|qk) - the probability of emitting character i if the hidden state is k Pt (qk|qm) - the probability of transition from hidden state k to hidden state m in one step 0 0 0 0 1 1 1 1 H H H H T T T T CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES ALGORITHMS Forward or backward (sum peeling): compute the probability of an observed string a1a2...an given emission and transmission probabilities. Runs in time O(|Q|2 n), or O(|Q| n) for sparse models. Decoding (Viterbi): compute the sequence of hidden states q1q2...qn that is most likely to have given rise to an observed sequence a1a2...an Runs in time O(|Q|2 n), or O(|Q| n) for sparse models. Training: estimate transition and/or emission probabilities given a set of labeled observed sequences (corresponding hidden states are known): frequency counts, possibly corrected only observed sequences: Baum-Welch, or another non-linear optimization procedure CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES TRAINING HMMS FROM LABELED SEQUENCES CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 011111112222222111111222211111112222111110 TRANSITIONS to state 0 1 2 Ai, j ai, j = from 0 0 (0%) 1 (100%) 0 (0%) |Q|"1 state A 1 1 (4%) 21 (84%) 3 (12%) !h=0 i,h 2 0 (0%) 3 (20%) 12 (80%) symbol E A C G T e = i,k in 6 7 5 7 i,k |#|"1 1 E state (24%) (28%) (20%) (28%) !h=0 i,h 3 3 2 7 2 (20%) (20%) (13%) (47%) EMISSIONS EXAMPLE FROM: HTTP://WWW.GENEPREDICTION.ORG/BOOK/HMM-PART1.PPT CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PROTEIN STRUCTURE PREDICTION A simple model states that each residue in a folded protein can be assigned to one of three structural features: Protein 1DZOA An α-helix (offset 4 hydrogen bonds) A β-strand/sheet Other (a loop, L) Cheng and Baldi BMC Bioinformatics 2007 8:113 CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES EMISSION AND TRANSITION FREQUENCIES Frequency distributions of amino-acid residues is different between classes. E.g. can be used to estimate emission probabilities. To estimate transition probabilities, we simply tabulate how frequently the transitions happen in a large reference dataset with known structure. STATIONARY FREQUENCIES OF THE HIDDEN MARKOV CHAIN GOLDMAN, THORNE AND JONES JMB 1996 CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES TRAINING CAVEATS Rare transition probabilities events are difficult to estimate from counts data. Some state k may not appear in any of the training sequences. This means #k➔l = 0 for every state l and Pt(k,l) cannot be computed from counts. One can ‘pad’ (reflecting our prior beliefs) to observed counts: A = # of k l transitions + r k,l → k,l Eb,k = # of emissions of k from b + rk(b) CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES STRUCTURE INFERENCE Given a trained HMM H and a sequence S we can: Run Viterbi decoding to assign a most-likely hidden path of α, β and L to a given sequence and infer the most likely path. Use a forward-backward algorithm to compute the posterior probabilities that that a given position i in the amino acid sequence is in an α-helix, β- sheet or a loop: Pr q = α S, H Pr q = β S, H p = { i | } p = { i | } i,α Pr S H i,β Pr S H { | } { | } Pr q = L S, H p = { i | } i,L Pr S H { | } CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES Query weight=0.0963 Q3=68.5% 0.0 0.40 0.8 20 40 60 80 100 120 sequence 1 weight=0.0963 0.0 0.40 0.8 20 40 60 80 100 120 sequence 2 weight=0.146 0.0 0.40 0.8 20 40 60 80 100 120 sequence 3 weight=0.129 0.0 0.40 0.8 20 40 60 80 100 120 sequence 4 weight=0.140 0.0 0.40 0.8 20 40 60 80 100 120 sequence 5 weight=0.109 0.0 0.40 0.8 20 40 60 80 100 120 sequence 6 weight=0.133 0.0 0.40 0.8 20 40 60 80 100 120 sequence 7 weight=0.150 HTTP WWW BIOMEDCENTRAL COM 0.0 0.4 0.8 :// . /1472-6807/6/25 0 20 40 60 80 100 120 d1jyoa protein consensus Q3=79.2% 0.0 0.40 0.8 20 40 60 80 100 120 true secondary structure h1 s1 s2 Q 3 - a standard measure of structural prediction accuracy, ALHEASGPSVILFGSDVTVPPASNAEQAK defined as the proportion of hhhhhoooossssooosssooooohhhhh residues assigned to correct class (true) ohhhoooossssooooosssooohhhhhh (22/29 = 76% - useful) Random assignment : Q = 33% 3 hhhhhoooohhhhooohhhooooohhhhh (22/29 = 76% - terrible) State-of-the-art prediction: Q3 ~ 80% HTTP://NOOK.CS.UCDAVIS.EDU/~KOEHL/CLASSES/CSB/CSB_LECTURE11.PPT CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES HMMS ACTUALLY USEFUL FOR STRUCTURE PREDICTION... HELIX COIL STRAND 3.1 4.6 p in [0.1, 0.25[ 3.7 9.6 H10 H3 c1 3.4 p in [0.25, 0.5[ 6.7 1.1 1.9 b3 c12 p in [0.5, 0.75[ 7.0 c9 H14 H2 p=>0.75 c6 8.3 b7 1.1 c10 hydrophilic 1.9 5.4 2.6 H9 H1 H7 b1 preference 3.1 b5 5.0 2.8 4.8 2.5 c8 hydrophobic c5 4.7 2.2 7.8 preference H12 H8 2.9 2.1 H6 c4 7.3 5.2 b6 b8 secondary H4 c2 structure 3.8 2.5 6.9 entry state H15 H11 c3 b2 3.5 secondary c11 4.4 5.5 structure exit state 4.5 3.3 7.2 b4 b9 H13 H5 c7 4.4 LOG-ODDS SCORE Helix Coil Strand Score > > > = = = < < < log2(piq/Pi) ; ; ; % : : : 9 9 9 8 8 8 " 7 7 7 6 6 6 5 5 5 # 4 4 4 Frequency of Frequency of 3 3 3 2 2 2 residue i in residue i in all 1 1 1 !" training sequences 0 0 0 state q / / / . !% - - - , , , + + + ! "# "$ % & " ' "% $ "( "" ( "! ) * $ ( * "% " "" ) ! % ' "# & !)("'*&$% HTTP://WWW.BIOMEDCENTRAL.COM/1472-6807/6/25 CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PROFILE HMM ALIGNMENT/ MATCHING A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus fail significance test. However, they may have weak similarities with many members of the family. The goal is to align a sequence to all members of the family at once. Family of related proteins can be represented by their multiple alignment and the corresponding profile. CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PROFILE REPRESENTATION OF SEQUENCE FAMILIES Aligned DNA sequences can be represented by a 4N profile matrix reflecting the frequencies of nucleotides in every aligned position. Protein family can be represented by a 20N profile representing frequencies of amino acids. These can be used to estimate emission probabilities of an HMM 1 A C A C G T G T 0.000455373 0.000819672 9.10747e-05 0.998634 0.0512143 0.119885 0.000273224 0.828628 0.000335008 0.000167504 0 0.999497 8.37521e-05 8.37521e-05 0.999749 8.37521e-05 0.000167504 0.0274707 0.000167504 0.972194 0.5 0.957377 0.0021062 0.0332003 0.00731626 0.0100599 0.981792 0.00108081 0.00706684 0 1 2 3 4 5 6 7 HIV protease CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES MULTIPLE ALIGNMENTS AND PROTEIN FAMILY CLASSIFICATION Multiple alignment of a protein family shows variations in conservation along the length of a protein Example: after aligning many globin proteins, the biologists recognized that the helical regions in globins are more conserved than others. One way to visualize: entropy plots Influenza A hema"lutinin 1.5 1 Antigenic sites 0.5 0 50 100 150 200 250 300 CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES WHAT ARE PROFILE HMMS A Profile HMM is a probabilistic representation of a multiple alignment.
Recommended publications
  • RDA COVID-19 Recommendations and Guidelines on Data Sharing
    RDA COVID-19 Recommendations and Guidelines on Data Sharing DOI: 10.15497/RDA00052 Authors: RDA COVID-19 Working Group Published: 30th June 2020 Abstract: This is the final version of the Recommendations and Guidelines from the RDA COVID19 Working Group, and has been endorsed through the official RDA process. Keywords: RDA; Recommendations; COVID-19. Language: English License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication RDA webpage: https://www.rd-alliance.org/group/rda-covid19-rda-covid19-omics-rda-covid19- epidemiology-rda-covid19-clinical-rda-covid19-1 Related resources: - RDA COVID-19 Guidelines and Recommendations – preliminary version, https://doi.org/10.15497/RDA00046 - Data Sharing in Epidemiology, https://doi.org/10.15497/RDA00049 - RDA COVID-19 Zotero Library, https://doi.org/10.15497/RDA00051 Citation and Download: RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing. Research Data Alliance. 2020. DOI: https://doi.org/10.15497/RDA00052 RDA COVID-19 Recommendations and Guidelines on Data Sharing RDA Recommendation (FINAL Release) Produced by: RDA COVID-19 Working Group, 2020 Document Metadata Identifier DOI: https://doi.org/10.15497/rda00052 Citation To cite this document please use: RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing. Research Data Alliance. 2020. DOI: https://doi.org/10.15497/rda00052 Title RDA COVID-19; Recommendations and Guidelines on Data Sharing, Final release 30 June 2020 Description This is the final version of the Recommendations and Guidelines
    [Show full text]
  • Downloaded Were Considered to Be True Positive While Those from the from UCSC Databases on 14Th September 2011 [70,71]
    Basu et al. BMC Bioinformatics 2013, 14(Suppl 7):S14 http://www.biomedcentral.com/1471-2105/14/S7/S14 RESEARCH Open Access Examples of sequence conservation analyses capture a subset of mouse long non-coding RNAs sharing homology with fish conserved genomic elements Swaraj Basu1, Ferenc Müller2, Remo Sanges1* From Ninth Annual Meeting of the Italian Society of Bioinformatics (BITS) Catania, Sicily. 2-4 May 2012 Abstract Background: Long non-coding RNAs (lncRNA) are a major class of non-coding RNAs. They are involved in diverse intra-cellular mechanisms like molecular scaffolding, splicing and DNA methylation. Through these mechanisms they are reported to play a role in cellular differentiation and development. They show an enriched expression in the brain where they are implicated in maintaining cellular identity, homeostasis, stress responses and plasticity. Low sequence conservation and lack of functional annotations make it difficult to identify homologs of mammalian lncRNAs in other vertebrates. A computational evaluation of the lncRNAs through systematic conservation analyses of both sequences as well as their genomic architecture is required. Results: Our results show that a subset of mouse candidate lncRNAs could be distinguished from random sequences based on their alignment with zebrafish phastCons elements. Using ROC analyses we were able to define a measure to select significantly conserved lncRNAs. Indeed, starting from ~2,800 mouse lncRNAs we could predict that between 4 and 11% present conserved sequence fragments in fish genomes. Gene ontology (GO) enrichment analyses of protein coding genes, proximal to the region of conservation, in both organisms highlighted similar GO classes like regulation of transcription and central nervous system development.
    [Show full text]
  • Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids
    This page intentionally left blank Biological sequence analysis Probabilistic models of proteins and nucleic acids The face of biology has been changed by the emergence of modern molecular genetics. Among the most exciting advances are large-scale DNA sequencing efforts such as the Human Genome Project which are producing an immense amount of data. The need to understand the data is becoming ever more pressing. Demands for sophisticated analyses of biological sequences are driving forward the newly-created and explosively expanding research area of computational molecular biology, or bioinformatics. Many of the most powerful sequence analysis methods are now based on principles of probabilistic modelling. Examples of such methods include the use of probabilistically derived score matrices to determine the significance of sequence alignments, the use of hidden Markov models as the basis for profile searches to identify distant members of sequence families, and the inference of phylogenetic trees using maximum likelihood approaches. This book provides the first unified, up-to-date, and tutorial-level overview of sequence analysis methods, with particular emphasis on probabilistic modelling. Pairwise alignment, hidden Markov models, multiple alignment, profile searches, RNA secondary structure analysis, and phylogenetic inference are treated at length. Written by an interdisciplinary team of authors, the book is accessible to molecular biologists, computer scientists and mathematicians with no formal knowledge of each others’ fields. It presents the state-of-the-art in this important, new and rapidly developing discipline. Richard Durbin is Head of the Informatics Division at the Sanger Centre in Cambridge, England. Sean Eddy is Assistant Professor at Washington University’s School of Medicine and also one of the Principle Investigators at the Washington University Genome Sequencing Center.
    [Show full text]
  • Genomic and Transcriptomic Surveys for the Study of Ncrnas with a Focus on Tropical Parasites
    PhD Thesis PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA UNIVERSIDADE FEDERAL DE MINAS GERAIS Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites Mainá Bitar Belo Horizonte February 2015 Universidade Federal de Minas Gerais PhD Thesis PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites PhD candidate: Mainá Bitar Advisor: Glória Regina Franco Co-advisor: Martin Alexander Smith Mainá Bitar Lourenço Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites Versão final Tese apresentada ao Programa Interunidades de Pós-Graduação em Bioinformática do Instituto de Ciências Biológicas da Universidade Federal de Minas Gerais como requisito parcial para a obtenção do título de Doutor em Bioinformática. Orientador: Profa. Dra. Glória Regina Franco BELO HORIZONTE 2015 043 Bitar, Mainá. Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites [manuscrito] / Mainá Bitar. – 2015. 134 f. : il. ; 29,5 cm. Orientador: Glória Regina Franco. Coorientador: Martin Alexander Smith. Tese (doutorado) – Universidade Federal de Minas Gerais, Instituto de Ciências Biológicas. Programa de Pós-Graduação em Bioinformática. 1. Bioinformática - Teses. 2. Trypanosoma cruzi. 3. Schistosoma mansoni. 4. Genômica. 5. Transcriptoma. 6. Trans-Splicing. I. Franco, Glória Regina. II. Smith, Martin Alexander. III. Universidade Federal de Minas Gerais. Instituto de Ciências Biológicas. IV. Título. CDU: 573:004 Ficha catalográfica elaborada por Fabiane C. M. Reis – CRB 6/2680 Esta tese é dedicada à minha mãe, que me deu a liberdade para sonhar e a força para viver a realidade.
    [Show full text]
  • On the Necessity of Dissecting Sequence Similarity Scores Into
    Wong et al. BMC Bioinformatics 2014, 15:166 http://www.biomedcentral.com/1471-2105/15/166 METHODOLOGY ARTICLE Open Access On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation Wing-Cheong Wong1*, Sebastian Maurer-Stroh1,2, Birgit Eisenhaber1 and Frank Eisenhaber1,3,4* Abstract Background: Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments. Results: The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate.
    [Show full text]
  • Download PDF of This Story
    B NY RA DY BARRETT ILLUSTRATION BY MIKE PERRY TE H NEW JANELIA COMPUTING CLUSTER PUTS A PREMIUM ON EXPANDABILITY AND SPEED. ple—it’s pretty obvious to anyone which words are basically the same. That would be like two genes from humans and apes.” But in organisms that are more diver- gent, Eddy needs to understand how DNA sequences tend to change over time. “And it becomes a difficult specialty, with seri- ous statistical analysis,” he says. From a computational standpoint, that means churning through a lot of opera- tions. Comparing two typical-sized protein sequences, to take a simple example, would require a whopping 10200 opera- Computational biologists have a need for to help investigators conduct genome tions. Classic algorithms, available since speed. The computing cluster at HHMI’s searches and catalog the inner workings the 1960s, can trim that search to 160,000 Janelia Farm Research Campus delivers and structures of the brain. computations—a task that would take the performance they require—at a mind- only a millisecond or so on any modern boggling 36 trillion operations per second. F ASTER Answers processor. But in the genome business, In the course of their work, Janelia A group leader at Janelia Farm, Eddy deals people routinely do enormous numbers researchers generate millions of digitized in the realm of millions of computations of these sequence comparisons—trillions images and gigabytes of data files, and they daily as he compares sequences of DNA. and trillions of them. These “routine” cal- run algorithms daily that demand robust He is a rare breed, both biologist and code culations could take years if they had to be computational horsepower.
    [Show full text]
  • Clawhmmer: a Streaming Hmmer-Search Implementation
    ClawHMMER: A Streaming HMMer-Search Implementation Daniel Reiter Horn Mike Houston Pat Hanrahan Stanford University Abstract To mitigate the problem of choosing an ad-hoc gap penalty for a given BLAST search, Krogh et al. [1994] The proliferation of biological sequence data has motivated proposed bringing the probabilistic techniques of hidden the need for an extremely fast probabilistic sequence search. Markov models(HMMs) to bear on the problem of fuzzy pro- One method for performing this search involves evaluating tein sequence matching. HMMer [Eddy 2003a] is an open the Viterbi probability of a hidden Markov model (HMM) source implementation of hidden Markov algorithms for use of a desired sequence family for each sequence in a protein with protein databases. One of the more widely used algo- database. However, one of the difficulties with current im- rithms, hmmsearch, works as follows: a user provides an plementations is the time required to search large databases. HMM modeling a desired protein family and hmmsearch Many current and upcoming architectures offering large processes each protein sequence in a large database, eval- amounts of compute power are designed with data-parallel uating the probability that the most likely path through the execution and streaming in mind. We present a streaming query HMM could generate that database protein sequence. algorithm for evaluating an HMM’s Viterbi probability and This search requires a computationally intensive procedure, refine it for the specific HMM used in biological sequence known as the Viterbi [1967; 1973] algorithm. The search search. We implement our streaming algorithm in the Brook could take hours or even days depending on the size of the language, allowing us to execute the algorithm on graphics database, query model, and the processor used.
    [Show full text]
  • HMMER User's Guide
    HMMER User’s Guide Biological sequence analysis using profile hidden Markov models http://hmmer.wustl.edu/ Version 2.2; August 2001 Sean Eddy Howard Hughes Medical Institute and Dept. of Genetics Washington University School of Medicine 660 South Euclid Avenue, Box 8232 Saint Louis, Missouri 63110, USA [email protected] With contributions by Ewan Birney ([email protected]) Copyright (C) 1992-2001, Washington University in St. Louis. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. The HMMER software package is a copyrighted work that may be freely distributed and modified under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Some versions of HMMER may have been obtained under specialized commercial licenses from Washington University; for details, see the files COPYING and LICENSE that came with your copy of the HMMER software. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Appendix for a copy of the full text of the GNU General Public License. 1 Contents 1 Tutorial 6 1.1 The programs in HMMER . 6 1.2 Files used in the tutorial . 7 1.3 Searching a sequence database with a single profile HMM . 7 HMM construction with hmmbuild ............................. 7 HMM calibration with hmmcalibrate ........................... 8 Sequence database search with hmmsearch ........................
    [Show full text]
  • Computational Identification of Functional RNA Homologs in Metagenomic Data
    Computational identification of functional RNA homologs in metagenomic data Eric P. Nawrocki and Sean R. Eddy* HHMI Janelia Farm Research Campus Ashburn, VA 20147 Phone: 571-209-3112 (EPN) 571-209-4163 (SRE) Fax: 571-209-4094 E-mail: [email protected] [email protected] * Corresponding author. Send proofs to: Sean Eddy 19700 Janelia Farm Blvd HHMI Janelia Farm Research Campus Ashburn, VA 20147 [email protected] An important step in analyzing a metagenomic sequence dataset is identifying functional sequence elements. This is a prerequisite for determining important properties of the environment the sequence data were sampled from, such as the metabolic processes and organismal diversity present there. At least initially, functional sequence element identification is addressed computationally. One class of elements, functional noncoding RNA elements, are especially difficult to identify because they tend to be short, lack open reading frames, and sometimes evolve rapidly at the sequence level even while conserving structure integral to their function (Eddy, 2001, Szymanski, 2003, Backofen, 2007, Hammann, 2007, Jossinet, 2007, Machado-Lima, 2008). Functional RNA elements include both RNA genes (genes transcribed into functional untranslated RNA) and cis-regulatory mRNA structures. RNA elements play many roles. Ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs) are well known and universally present in all cellular life. Bacteria, archaea, and viruses, the organisms predominantly targeted by current metagenomics studies, also use numerous small RNA (sRNA) genes for translational and posttranslational regulation (Gottesman, 2005), as well as many cis-regulatory RNAs such as riboswitches (structural RNAs that respond to binding small molecule metabolites and control expression of nearby genes (Winkler, 2005, Tucker, 2005)).
    [Show full text]
  • INFERNAL User's Guide
    INFERNAL User’s Guide Sequence analysis using profiles of RNA sequence and secondary structure consensus http://eddylab.org/infernal Version 1.1.4; Dec 2020 Eric Nawrocki and Sean Eddy for the INFERNAL development team https://github.com/EddyRivasLab/infernal/ Copyright (C) 2020 Howard Hughes Medical Institute. Infernal and its documentation are freely distributed under the 3-Clause BSD open source license. For a copy of the license, see http://opensource.org/licenses/BSD-3-Clause. Infernal development is supported by the Intramural Research Program of the National Library of Medicine at the US National Institutes of Health, and also by the National Human Genome Research Institute of the US National Institutes of Health under grant number R01HG009116. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. 1 Contents 1 Introduction 6 How to avoid reading this manual . 6 What covariance models are . 6 Applications of covariance models . 7 Infernal and HMMER, CMs and profile HMMs . 7 What’s new in Infernal 1.1 . 8 How to learn more about CMs and profile HMMs . 8 2 Installation 10 Quick installation instructions . 10 System requirements . 10 Multithreaded parallelization for multicores is the default . 11 MPI parallelization for clusters is optional . 11 Using build directories . 12 Makefile targets . 12 Why is the output of ’make’ so clean? . 12 What gets installed by ’make install’, and where? . 12 Staged installations in a buildroot, for a packaging system . 13 Workarounds for some unusual configure/compilation problems . 13 3 Tutorial 15 The programs in Infernal .
    [Show full text]
  • Reading Genomes Bit by Bit
    Reading genomes bit by bit Sean Eddy HHMI Janelia Farm Ashburn, Virginia Why did we sequence so many different flies? the power of comparative genome sequence analysis Why did we sequence a single-celled pond protozooan? exploiting unusual adaptations and unusual genomes Oxytricha trifallax Symbolic texts can be cracked by statistical analysis “Cryptography has contributed a new weapon to the student of unknown scripts.... the basic principle is the analysis and indexing of coded texts, so that underlying patterns and regularities can be discovered. If a number of instances can be collected, it may appear that a certain group of signs in the coded text has a particular function....” John Chadwick, The Decipherment of Linear B Cambridge University Press, 1958 Linear B, from Mycenae ca. 1500-1200 BC deciphered by Michael Ventris and John Chadwick, 1953 How much data are we talking about, really? STORAGE TIME TO COST/YEAR DOWNLOAD raw images 30 TB $36,000 20 days unassembled reads 100 GB $120 1 hr mapped reads 100 GB $120 1 hr assembled genome 6 GB $7 5 min differences 4 MB $0.005 0.2 sec my coffee coaster selab:/misc/data0/genomes 3 GB 450 GB JFRC computing, available disk ~ 1 petabyte (1000 TB) 1000 Genomes Project pilot 5 TB (30 GB/genome) selab:~eddys/Music/iTunes 128 albums 3 MB 15 GB NCBI Short Read Archive 200 TB + 10-20 TB/mo GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTTGATAAAGCAGGAATTACTACTGCTTGTTTACGAATTAAATCGAAGTGGACTGCTGG CGGAAAATGAGAAAATTCGACCTATCCTTGCGCAGCTCGAGAAGCTCTTACTTTGCGACCTTTCGCCATCAACTAACGATTCTGTCAAAAACTGACGCGTTGGATGAGGAGAAGTGGCTTAATATG
    [Show full text]
  • The Janus-Faced E-Values of Hmmer2: Extreme Value Distribution Or Logistic Function?
    January 28, 2011 15:38 WSPC/185-JBCB S0219720011005264 Journal of Bioinformatics and Computational Biology Vol. 9, No. 1 (2011) 179–206 c The Authors DOI: 10.1142/S0219720011005264 THE JANUS-FACED E-VALUES OF HMMER2: EXTREME VALUE DISTRIBUTION OR LOGISTIC FUNCTION? WING-CHEONG WONG∗,¶, SEBASTIAN MAURER-STROH∗,†, and FRANK EISENHABER∗,‡,§,∗∗ ∗Bioinformatics Institute (BII) Agency for Science, Technology and Research (A*STAR) 30 Biopolis Street, #07-01, Matrix, Singapore 138671 †School of Biological Sciences (SBS) Nanyang Technological University (NTU) 60 Nanyang Drive, Singapore 63755 ‡Department of Biological Sciences (DBS) National University of Singapore (NUS) 8 Medical Drive, Singapore 117597 §School of Computer Engineering (SCE) Nanyang Technological University (NTU) 50 Nanyang Drive, Singapore 637553 ¶[email protected] [email protected] ∗∗[email protected] Received 16 July 2010 Revised 11 October 2010 Accepted 11 October 2010 E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of exper- imentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode).
    [Show full text]