Pssms and Hmms, Models That Improve Search Sensitivity and Phenotype Prediction

Total Page:16

File Type:pdf, Size:1020Kb

Pssms and Hmms, Models That Improve Search Sensitivity and Phenotype Prediction PSSMs and HMMs, models that improve search sensitivity and phenotype prediction Mount, Chapter 4, pp. 185-192 Durbin et al, Chapter 5 (also earlier chapters) with help from Sean Eddy and Alex Bateman PSSMs and HMMs, improving sensitivity and phenotype prediction 1. Review: Position Independent scoring matricies (PAM250) 2. Position specific/dependent scoring matrices 3. PSSMs and PSI-BLAST 4. Using PSSMs for Phenotype prediction 5. HMMs – PSSMs with position specific gap penalties 1 Scoring Matrix Basics: DNA transition probabilities – 1 PAM 0.99 0.001 a c a c 0.008 0.001 t g t g a c g t a 0.99 0.001 0.008 0.001 = 1.0 c 0.001 0.99 0.001 0.008 = 1.0 g 0.008 0.001 0.99 0.001 = 1.0 t 0.001 0.008 0.001 0.99 = 1.0 Scoring matrix basics: Matrix multiples M^2={ PAM 2 {0.980, 0.002, 0.016, 0.002}, {0.002, 0.980, 0.002, 0.016}, {0.016, 0.002, 0.980, 0.002}, M^100={ PAM 100 {0.002, 0.016, 0.002, 0.980}} {0.499, 0.083, 0.336, 0.083}, {0.083, 0.499, 0.083, 0.336}, {0.336, 0.083, 0.499, 0.083}, M^5={ PAM 5 {0.083, 0.336, 0.083, 0.499}} {0.952, 0.005, 0.038, 0.005}, {0.005, 0.951, 0.005, 0.038}, {0.038, 0.005, 0.952, 0.005}, {0.005, 0.038, 0.005, 0.952}} M^1000={ PAM 1000 {0.255, 0.245, 0.255, 0.245}, M^10={ PAM 10 {0.245, 0.255, 0.245, 0.255}, {0.907, 0.010, 0.073, 0.010}, {0.255, 0.245, 0.255, 0.245}, {0.010, 0.907, 0.010, 0.073}, {0.245, 0.255, 0.245, 0.255}} {0.073, 0.010, 0.907, 0.010}, {0.010, 0.073, 0.010, 0.907}} 2 Where do scoring matrices come from? # & probability of mutation qij λS = log% ( $ p j ' probability of alignment by chance qij = M^20= PAM20 {0.828, 0.019, 0.133, 0.019}, {0.019, 0.828, 0.019, 0.133}, pi(a,c,g,t)= {0.133, 0.019, 0.828, 0.019}, pj=0.25 {0.019,€ 0.133, 0.019, 0.828}} # q & # q & λS =10log% a,a ( λS =10log% a,c ( $ pa ' $ pc ' # 0.828& # 0.019& =10log% ( = 5.2 =10log% ( = −11.2 $ 0.25 ' $ 0.25 ' log(2) λ = = 0.33 2 10 € € € Scoring matrices at DNA PAMs - ratios blastn (DNA) PAM1={ ratio=1/3.13=+1/-3 H=1.90 PAM20={ ratio=1/1.21=+4/-5 H=1.05 { 1.99, -6.23, -6.23, -6.22}, { 1.72, -2.09, -2.09, -2.09}, {-6.23, 1.99, -6.23, -6.23}, {-2.09, 1.72, -2.09, -2.09}, {-6.23, -6.23, 1.99, -6.23}, {-2.09, -2.09, 1.72, -2.09}, {-6.23, -6.23, -6.23, 1.99}} {-2.09, -2.09, -2.09, 1.72}} PAM2={ ratio=1/2.65=+2/-5 H=1.82 PAM30={ ratio=1/1=+1/-1 H=0.80 { 1.97, -5.24, -5.24, -5.24}, { 1.59, -1.59, -1.59, -1.59}, {-5.24, 1.98, -5.24, -5.24}, {-1.59, 1.59, -1.59, -1.59}, {-5.24, -5.24, 1.98, -5.24}, {-1.59, -1.59, 1.59, -1.59}, {-5.24, -5.24, -5.24, -5.24}} {-1.59, -1.59, -1.59, 1.59}} fasta (DNA) PAM10={ ratio=1/1.61=+2/-3 H=1.40 PAM45={ ratio=1.23/1=+5/-4 H=0.54 { 1.86, -3.00, -3.00, -3.00}, { 1.40, -1.14, -1.14, -1.14}, {-3.00, 1.86, -3.00, -3.00}, {-1.14, 1.40, -1.14, -1.14}, {-3.00, -3.00, 1.86, -3.00}, {-1.14, -1.14, 1.40, -1.14}, {-3.00, -3.00, -3.00, 1.86}} {-1.14, -1.14, -1.14, 1.40}} 3 Mutation probability matrix for 1 PAM A R N D C Q E G H I L A 9867 2 9 10 3 8 17 21 2 6 4 R 1 9913 1 0 1 10 0 0 10 3 1 N 4 1 9822 36 0 4 6 6 21 3 1 D 6 0 42 9859 0 6 53 6 4 1 0 C 1 1 0 0 9973 0 0 0 1 1 0 Q 3 9 4 5 0 9876 27 1 23 1 3 E 10 0 7 56 0 35 9865 4 2 3 1 G 21 1 12 11 1 3 7 9935 1 0 1 H 1 8 18 3 1 20 1 0 9912 0 1 I 2 2 3 1 2 1 2 0 0 9872 9 L 3 1 3 0 0 6 1 1 4 22 9947 Mutation probability matrix for 250 PAMs A R N D C Q E G H I L A 13 6 9 9 5 8 9 12 6 8 6 R 3 17 4 3 2 5 3 2 6 3 2 N 4 4 6 7 2 5 6 4 6 3 2 D 5 4 8 11 1 7 10 5 6 3 2 C 2 1 1 1 52 1 1 2 2 2 1 Q 3 5 5 6 1 10 5 6 3 2 5 E 5 4 7 11 1 9 12 5 6 3 2 G 21 1 12 11 1 3 7 9935 1 0 1 H 1 8 18 3 1 20 1 0 9912 0 1 I 2 2 3 1 2 1 2 0 0 9872 9 L 3 1 3 0 0 6 1 1 4 22 9947 4 Two expressions for Sij Transition frequency Alignment frequency (probability) (probability) - Durbin et al. -Altschul # t & # a & qij qij λS = log% ( λS = log% ( $ p j ' $ pi p j ' a t Altschul qij = pi × Durbin qij # a t & qij = piqij € λS =€log % ( $ pi p j ' € € Improved sensitivity with Position Specific Scoring Matrices sxl_drome (1sxl) ru1a_human (u1a) PSI-Blast E() iteration 1: <7 iteration 2: 10-8 5 Alignments show conservation Pairwise Alignment RU1A_HUMAN rrm2 VQAGAAR PABP_DROME rrm3 EAAEAAV +2 +2 +2 Cys 12 score matrices: Ser 0 2 Thr -2 1 3 20x20, 210 Pro -1 1 0 6 Ala -2 1 1 1 2 parameters Gly -3 1 0 -1 1 5 Asn -4 1 0 -1 0 0 2 position- Asp -5 0 0 -1 0 1 2 4 Glu -5 0 0 -1 0 0 1 3 4 independent Gln -5 -1 -1 0 0 -1 1 2 2 4 C S T P A G N D E Q 6 Profile (Position Specific) Alignment RU1A_HUMAN rrm1 SSATNAL RU1A_HUMAN rrm2 VQAGAAR query SFR1_HUMAN rrm1 RDAEDAV SXLF_DROME rrm1 MDSQRAI PABP_DROME rrm3 EAAEAAV target +3 +4 0 profile: 20 scores per column position-dependent Where pairwise scores come from – “probability of A given an A” the observed probability of seeing an A aligned to an A in real alignments P(A|A) score(AA)=log f(A) “frequency of A” the expected frequency of A in any sequence 0.64 Sc(AA) = log = +4 2 0.04 0.01 Sc(AE) = log = -2 2 0.04 7 Where profile scores (should) come from “probability of A at position x” the observed probability of seeing an A in the consensus column x P(A|position x) score(A|x)=log f(A) 1.00 0.04 Sc(A|6) = log = +4.6 Sc(A|5) = log = 0 2 0.04 2 0.04 0.00 0.06 Sc(N|6) = log = -inf Sc(N|5) = log = 0 2 0.06 2 0.06 1. what about position-specific gap penalties? 2. how to estimate parameters from small numbers of observations? Query: atp6_human.aa ATP synthase a chain - 226 aa Library: 5190103 residues in 13351 sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp|P00846|ATP6_HUMAN ATP synthase a chain (AT ( 226) 1400 325.8 5.8e-90 1.000 1.000 226 sp|P00847|ATP6_BOVIN ATP synthase a chain (AT ( 226) 1157 270.5 2.5e-73 0.779 0.951 226 sp|P00848|ATP6_MOUSE ATP synthase a chain (AT ( 226) 1118 261.7 1.2e-70 0.757 0.916 226 sp|P00849|ATP6_XENLA ATP synthase a chain (AT ( 226) 745 176.8 4.0e-45 0.533 0.847 229 sp|P00851|ATP6_DROYA ATP synthase a chain (AT ( 224) 473 115.0 1.7e-26 0.378 0.721 222 sp|P00854|ATP6_YEAST ATP synthase a chain pre ( 259) 428 104.7 2.3e-23 0.353 0.694 232 sp|P00852|ATP6_EMENI ATP synthase a chain pre ( 256) 365 90.4 4.8e-19 0.304 0.691 230 sp|P14862|ATP6_COCHE ATP synthase a chain (AT ( 257) 353 87.7 3.2e-18 0.313 0.650 214 sp|P68526|ATP6_TRITI ATP synthase a chain (AT ( 386) 309 77.6 5.1e-15 0.289 0.651 235 sp|P05499|ATP6_TOBAC ATP synthase a chain (AT ( 395) 309 77.6 5.2e-15 0.283 0.635 233 sp|P07925|ATP6_MAIZE ATP synthase a chain (AT ( 291) 283 71.7 2.3e-13 0.311 0.667 180 sp|P0AB98|ATP6_ECOLI ATP synthase a chain (AT ( 271) 178 47.9 3.2e-06 0.233 0.585 236 sp|P0C2Y5|ATPI_ORYSA Chloroplast ATP synth (A ( 247) 144 40.1 0.00062 0.242 0.580 231 sp|P06452|ATPI_PEA Chloroplast ATP synthase a ( 247) 143 39.9 0.00072 0.250 0.586 232 sp|P27178|ATP6_SYNY3 ATP synthase a chain (AT ( 276) 142 39.7 0.00095 0.265 0.571 170 sp|P06451|ATPI_SPIOL Chloroplast ATP synthase ( 247) 138 38.8 0.0016 0.242 0.580 231 sp|P08444|ATP6_SYNP6 ATP synthase a chain (AT ( 261) 127 36.3 0.0095 0.263 0.557 167 sp|P69371|ATPI_ATRBE Chloroplast ATP synthase ( 247) 126 36.0 0.01 0.221 0.571 231 sp|P06289|ATPI_MARPO Chloroplast ATP synthase ( 248) 126 36.0 0.011 0.240 0.575 167 sp|P30391|ATPI_EUGGR Chloroplast ATP synthase ( 251) 123 35.4 0.017 0.257 0.579 214 sp|P19568|TLCA_RICPR ADP,ATP carrier protein ( 498) 122 35.0 0.043 0.243 0.579 152 sp|P24966|CYB_TAYTA Cytochrome b ( 379) 113 33.0 0.13 0.234 0.532 158 sp|P03892|NU2M_BOVIN NADH-ubiquinone oxidored ( 347) 107 31.7 0.31 0.261 0.479 211 sp|P68092|CYB_STEAT Cytochrome b ( 379) 104 31.0 0.54 0.277 0.547 137 sp|P03891|NU2M_HUMAN NADH-ubiquinone oxidored ( 347) 103 30.8 0.58 0.201 0.537 149 sp|P00156|CYB_HUMAN Cytochrome b ( 380) 102 30.5 0.74 0.268 0.585 205 sp|P15993|AROP_ECOLI Aromatic amino acid tr ( 457) 103 30.7 0.78 0.234 0.622 111 sp|P24965|CYB_TRANA Cytochrome b ( 379) 101 30.3 0.87 0.234 0.563 158 sp|P29631|CYB_POMTE Cytochrome b ( 308) 99 29.9 0.95 0.274 0.584 113 sp|P24953|CYB_CAPHI Cytochrome b ( 379) 99 29.8 1.2 0.236 0.564 140 16 8 vs human/E.
Recommended publications
  • RDA COVID-19 Recommendations and Guidelines on Data Sharing
    RDA COVID-19 Recommendations and Guidelines on Data Sharing DOI: 10.15497/RDA00052 Authors: RDA COVID-19 Working Group Published: 30th June 2020 Abstract: This is the final version of the Recommendations and Guidelines from the RDA COVID19 Working Group, and has been endorsed through the official RDA process. Keywords: RDA; Recommendations; COVID-19. Language: English License: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication RDA webpage: https://www.rd-alliance.org/group/rda-covid19-rda-covid19-omics-rda-covid19- epidemiology-rda-covid19-clinical-rda-covid19-1 Related resources: - RDA COVID-19 Guidelines and Recommendations – preliminary version, https://doi.org/10.15497/RDA00046 - Data Sharing in Epidemiology, https://doi.org/10.15497/RDA00049 - RDA COVID-19 Zotero Library, https://doi.org/10.15497/RDA00051 Citation and Download: RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing. Research Data Alliance. 2020. DOI: https://doi.org/10.15497/RDA00052 RDA COVID-19 Recommendations and Guidelines on Data Sharing RDA Recommendation (FINAL Release) Produced by: RDA COVID-19 Working Group, 2020 Document Metadata Identifier DOI: https://doi.org/10.15497/rda00052 Citation To cite this document please use: RDA COVID-19 Working Group. Recommendations and Guidelines on data sharing. Research Data Alliance. 2020. DOI: https://doi.org/10.15497/rda00052 Title RDA COVID-19; Recommendations and Guidelines on Data Sharing, Final release 30 June 2020 Description This is the final version of the Recommendations and Guidelines
    [Show full text]
  • Enhanced Representation of Natural Product Metabolism in Uniprotkb
    H OH metabolites OH Article Diverse Taxonomies for Diverse Chemistries: Enhanced Representation of Natural Product Metabolism in UniProtKB Marc Feuermann 1,* , Emmanuel Boutet 1,* , Anne Morgat 1 , Kristian B. Axelsen 1, Parit Bansal 1, Jerven Bolleman 1 , Edouard de Castro 1, Elisabeth Coudert 1, Elisabeth Gasteiger 1,Sébastien Géhant 1, Damien Lieberherr 1, Thierry Lombardot 1,†, Teresa B. Neto 1, Ivo Pedruzzi 1, Sylvain Poux 1, Monica Pozzato 1, Nicole Redaschi 1 , Alan Bridge 1 and on behalf of the UniProt Consortium 1,2,3,4,‡ 1 Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet, CH-1211 Geneva 4, Switzerland; [email protected] (A.M.); [email protected] (K.B.A.); [email protected] (P.B.); [email protected] (J.B.); [email protected] (E.d.C.); [email protected] (E.C.); [email protected] (E.G.); [email protected] (S.G.); [email protected] (D.L.); [email protected] (T.L.); [email protected] (T.B.N.); [email protected] (I.P.); [email protected] (S.P.); [email protected] (M.P.); [email protected] (N.R.); [email protected] (A.B.); [email protected] (U.C.) 2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 3 Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA 4 Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street NorthWest, Suite 1200, Washington, DC 20007, USA * Correspondence: [email protected] (M.F.); [email protected] (E.B.); Tel.: +41-22-379-58-75 (M.F.); +41-22-379-49-10 (E.B.) † Current address: Centre Informatique, Division Calcul et Soutien à la Recherche, University of Lausanne, CH-1015 Lausanne, Switzerland.
    [Show full text]
  • The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe
    The EMBL-European Bioinformatics Institute The hub for bioinformatics in Europe Blaise T.F. Alako, PhD [email protected] www.ebi.ac.uk What is EMBL-EBI? • Part of the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data, services and research The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Structural biology Bioinformatics Administration Grenoble Monterotondo, Rome EMBO EMBL staff: 1500 people Structural biology Mouse biology >60 nationalities EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia Who we are ~500 members of staff ~400 work in services & support >53 nationalities ~120 focus on basic research EMBL-EBI’s mission • Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • Contribute to the advancement of biology through basic investigator-driven research in bioinformatics • Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • Help disseminate cutting-edge technologies to industry • Coordinate biological data provision throughout Europe Services Data and tools for molecular life science www.ebi.ac.uk/services Browse our services 9 What services do we provide? Labs around the
    [Show full text]
  • Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice
    (—THIS SIDEBAR DOES NOT PRINT—) UniProt Domain Architecture Alignment: A New Approach for Protein Similarity QUICK START (cont.) DESIGN GUIDE Search using InterPro Domain Annotation How to change the template color theme This PowerPoint 2007 template produces a 44”x44” You can easily change the color theme of your poster by going to presentation poster. You can use it to create your research 1 1 1 the DESIGN menu, click on COLORS, and choose the color theme of poster and save valuable time placing titles, subtitles, text, Tunca Doğan , Alex Bateman , Maria J. Martin your choice. You can also create your own color theme. and graphics. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), We provide a series of online tutorials that will guide you Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK through the poster design process and answer your poster Correspondence: [email protected] production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. ABSTRACT METHODOLOGY RESULTS & DISCUSSION When you are ready to print your poster, go online to InterPro Domains, DAs and DA Alignment PosterPresentations.com Motivation: Similarity based methods have been widely used in order to Generation of the Domain Architectures: You can also manually change the color of your background by going to VIEW > SLIDE MASTER. After you finish working on the master be infer the properties of genes and gene products containing little or no 1) Collect the hits for each protein from InterPro. Domain annotation coverage Overlap domain hits problem in Need assistance? Call us at 1.510.649.3001 difference b/w domain databases: the InterPro database: sure to go to VIEW > NORMAL to continue working on your poster.
    [Show full text]
  • Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences
    Evolution and Function of Drososphila melanogaster cis-regulatory Sequences By Aaron Hardin A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Molecular and Cell Biology in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael Eisen, Chair Professor Doris Bachtrog Professor Gary Karpen Professor Lior Pachter Fall 2013 Evolution and Function of Drososphila melanogaster cis-regulatory Sequences This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License 2013 by Aaron Hardin 1 Abstract Evolution and Function of Drososphila melanogaster cis-regulatory Sequences by Aaron Hardin Doctor of Philosophy in Molecular and Cell Biology University of California, Berkeley Professor Michael Eisen, Chair In this work, I describe my doctoral work studying the regulation of transcription with both computational and experimental methods on the natural genetic variation in a population. This works integrates an investigation of the consequences of polymorphisms at three stages of gene regulation in the developing fly embryo: the diversity at cis-regulatory modules, the integration of transcription factor binding into changes in chromatin state and the effects of these inputs on the final phenotype of embryonic gene expression. i I dedicate this dissertation to Mela Hardin who has been here for me at all times, even when we were apart. ii Contents List of Figures iv List of Tables vi Acknowledgments vii 1 Introduction1 2 Within Species Diversity in cis-Regulatory Modules6 2.1 Introduction....................................6 2.2 Results.......................................8 2.2.1 Genome wide diversity in transcription factor binding sites......8 2.2.2 Genome wide purifying selection on cis-regulatory modules......9 2.3 Discussion.....................................9 2.4 Methods for finding polymorphisms......................
    [Show full text]
  • Downloaded Were Considered to Be True Positive While Those from the from UCSC Databases on 14Th September 2011 [70,71]
    Basu et al. BMC Bioinformatics 2013, 14(Suppl 7):S14 http://www.biomedcentral.com/1471-2105/14/S7/S14 RESEARCH Open Access Examples of sequence conservation analyses capture a subset of mouse long non-coding RNAs sharing homology with fish conserved genomic elements Swaraj Basu1, Ferenc Müller2, Remo Sanges1* From Ninth Annual Meeting of the Italian Society of Bioinformatics (BITS) Catania, Sicily. 2-4 May 2012 Abstract Background: Long non-coding RNAs (lncRNA) are a major class of non-coding RNAs. They are involved in diverse intra-cellular mechanisms like molecular scaffolding, splicing and DNA methylation. Through these mechanisms they are reported to play a role in cellular differentiation and development. They show an enriched expression in the brain where they are implicated in maintaining cellular identity, homeostasis, stress responses and plasticity. Low sequence conservation and lack of functional annotations make it difficult to identify homologs of mammalian lncRNAs in other vertebrates. A computational evaluation of the lncRNAs through systematic conservation analyses of both sequences as well as their genomic architecture is required. Results: Our results show that a subset of mouse candidate lncRNAs could be distinguished from random sequences based on their alignment with zebrafish phastCons elements. Using ROC analyses we were able to define a measure to select significantly conserved lncRNAs. Indeed, starting from ~2,800 mouse lncRNAs we could predict that between 4 and 11% present conserved sequence fragments in fish genomes. Gene ontology (GO) enrichment analyses of protein coding genes, proximal to the region of conservation, in both organisms highlighted similar GO classes like regulation of transcription and central nervous system development.
    [Show full text]
  • Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids
    This page intentionally left blank Biological sequence analysis Probabilistic models of proteins and nucleic acids The face of biology has been changed by the emergence of modern molecular genetics. Among the most exciting advances are large-scale DNA sequencing efforts such as the Human Genome Project which are producing an immense amount of data. The need to understand the data is becoming ever more pressing. Demands for sophisticated analyses of biological sequences are driving forward the newly-created and explosively expanding research area of computational molecular biology, or bioinformatics. Many of the most powerful sequence analysis methods are now based on principles of probabilistic modelling. Examples of such methods include the use of probabilistically derived score matrices to determine the significance of sequence alignments, the use of hidden Markov models as the basis for profile searches to identify distant members of sequence families, and the inference of phylogenetic trees using maximum likelihood approaches. This book provides the first unified, up-to-date, and tutorial-level overview of sequence analysis methods, with particular emphasis on probabilistic modelling. Pairwise alignment, hidden Markov models, multiple alignment, profile searches, RNA secondary structure analysis, and phylogenetic inference are treated at length. Written by an interdisciplinary team of authors, the book is accessible to molecular biologists, computer scientists and mathematicians with no formal knowledge of each others’ fields. It presents the state-of-the-art in this important, new and rapidly developing discipline. Richard Durbin is Head of the Informatics Division at the Sanger Centre in Cambridge, England. Sean Eddy is Assistant Professor at Washington University’s School of Medicine and also one of the Principle Investigators at the Washington University Genome Sequencing Center.
    [Show full text]
  • Genomic and Transcriptomic Surveys for the Study of Ncrnas with a Focus on Tropical Parasites
    PhD Thesis PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA UNIVERSIDADE FEDERAL DE MINAS GERAIS Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites Mainá Bitar Belo Horizonte February 2015 Universidade Federal de Minas Gerais PhD Thesis PROGRAMA DE PÓS-GRADUAÇÃO EM BIOINFORMÁTICA Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites PhD candidate: Mainá Bitar Advisor: Glória Regina Franco Co-advisor: Martin Alexander Smith Mainá Bitar Lourenço Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites Versão final Tese apresentada ao Programa Interunidades de Pós-Graduação em Bioinformática do Instituto de Ciências Biológicas da Universidade Federal de Minas Gerais como requisito parcial para a obtenção do título de Doutor em Bioinformática. Orientador: Profa. Dra. Glória Regina Franco BELO HORIZONTE 2015 043 Bitar, Mainá. Genomic and transcriptomic surveys for the study of ncRNAs with a focus on tropical parasites [manuscrito] / Mainá Bitar. – 2015. 134 f. : il. ; 29,5 cm. Orientador: Glória Regina Franco. Coorientador: Martin Alexander Smith. Tese (doutorado) – Universidade Federal de Minas Gerais, Instituto de Ciências Biológicas. Programa de Pós-Graduação em Bioinformática. 1. Bioinformática - Teses. 2. Trypanosoma cruzi. 3. Schistosoma mansoni. 4. Genômica. 5. Transcriptoma. 6. Trans-Splicing. I. Franco, Glória Regina. II. Smith, Martin Alexander. III. Universidade Federal de Minas Gerais. Instituto de Ciências Biológicas. IV. Título. CDU: 573:004 Ficha catalográfica elaborada por Fabiane C. M. Reis – CRB 6/2680 Esta tese é dedicada à minha mãe, que me deu a liberdade para sonhar e a força para viver a realidade.
    [Show full text]
  • Molecular Genetics & Genomics
    page 46 Lab Times 5-2010 Ranking Illustration: Christina Ullman Publication Analysis 1997-2008 Molecular Genetics & Genomics Under the premise of a “narrow” definition of the field, Germany and England co-dominated European molecular genetics/genomics. The most frequently citated sub-fields were bioinformatical genomics, epigenetics, RNA biology and DNA repair. irst of all, a little science history (you’ll soon see why). As and expression. That’s where so-called computational biology is well known, in the 1950s genetics went molecular – and and systems biology enter research into basic genetic problems. Fdid not just become molecular genetics but rather molec- Given that development, it is not easy to answer the question ular bio logy. In 1963, however, Sydney Brenner wrote in his fa- what “molecular genetics & genomics” today actually is – and, mous letter to Max Perutz: “[...] I have long felt that the future of in particular, what is it in the context of our publication analy- molecular biology lies in the extension of research to other fields sis of the field? It is obvious that, as for example science historian of biology, notably development and the nervous system.” He Robert Olby put it, a “wide” definition can be distinguished from appeared not to be alone with this view and, as a consequence, a “narrow” definition of the field. The wide definition includes along with Brenner many of the leading molecular biologists all fields, into which molecular biology has entered as an exper- from the classical period redirected their research agendas, utilis- imental and theoretical paradigm. The “narrow” definition, on ing the newly developed molecular techniques to investigate un- the other hand, still tries to maintain the status as an explicit bio- solved problems in other fields.
    [Show full text]
  • On the Necessity of Dissecting Sequence Similarity Scores Into
    Wong et al. BMC Bioinformatics 2014, 15:166 http://www.biomedcentral.com/1471-2105/15/166 METHODOLOGY ARTICLE Open Access On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation Wing-Cheong Wong1*, Sebastian Maurer-Stroh1,2, Birgit Eisenhaber1 and Frank Eisenhaber1,3,4* Abstract Background: Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments. Results: The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate.
    [Show full text]
  • Pfam: the Protein Families Database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A
    D412–D419 Nucleic Acids Research, 2021, Vol. 49, Database issue Published online 30 October 2020 doi: 10.1093/nar/gkaa913 Pfam: The protein families database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A. Salazar1, Erik L.L. Sonnhammer2, Silvio C.E. Tosatto 3, Lisanna Paladin 3, Shriya Raj 1, Lorna J. Richardson 1, Robert D. Finn 1 and Alex Bateman 1 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK, 2Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden and 3Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy Received September 11, 2020; Revised October 01, 2020; Editorial Decision October 02, 2020; Accepted October 06, 2020 ABSTRACT per-family gathering thresholds. Pfam entries are manually annotated with functional information from the literature The Pfam database is a widely used resource for clas- where available. sifying protein sequences into families and domains. Since Pfam release 29.0, pfamseq is based on UniPro- Since Pfam was last described in this journal, over tKB reference proteomes, whilst prior to that, it was based 350 new families have been added in Pfam 33.1 and on the whole of UniProtKB (3,4). Although the underly- numerous improvements have been made to existing ing sequence database is based on reference proteomes, all entries. To facilitate research on COVID-19, we have of the profile HMMs are also searched against UniProtKB revised the Pfam entries that cover the SARS-CoV-2 and the resulting matches are made available on the Pfam proteome, and built new entries for regions that were website and in a flatfile format.
    [Show full text]
  • Download PDF of This Story
    B NY RA DY BARRETT ILLUSTRATION BY MIKE PERRY TE H NEW JANELIA COMPUTING CLUSTER PUTS A PREMIUM ON EXPANDABILITY AND SPEED. ple—it’s pretty obvious to anyone which words are basically the same. That would be like two genes from humans and apes.” But in organisms that are more diver- gent, Eddy needs to understand how DNA sequences tend to change over time. “And it becomes a difficult specialty, with seri- ous statistical analysis,” he says. From a computational standpoint, that means churning through a lot of opera- tions. Comparing two typical-sized protein sequences, to take a simple example, would require a whopping 10200 opera- Computational biologists have a need for to help investigators conduct genome tions. Classic algorithms, available since speed. The computing cluster at HHMI’s searches and catalog the inner workings the 1960s, can trim that search to 160,000 Janelia Farm Research Campus delivers and structures of the brain. computations—a task that would take the performance they require—at a mind- only a millisecond or so on any modern boggling 36 trillion operations per second. F ASTER Answers processor. But in the genome business, In the course of their work, Janelia A group leader at Janelia Farm, Eddy deals people routinely do enormous numbers researchers generate millions of digitized in the realm of millions of computations of these sequence comparisons—trillions images and gigabytes of data files, and they daily as he compares sequences of DNA. and trillions of them. These “routine” cal- run algorithms daily that demand robust He is a rare breed, both biologist and code culations could take years if they had to be computational horsepower.
    [Show full text]