An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement in Accuracy
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Improving the Prediction of Transcription Factor Binding Sites To
Improving the prediction of transcription factor binding sites to aid the interpretation of non-coding single nucleotide variants. Narayan Jayaram Research Department of Structural and Molecular Biology University College London A thesis submitted to University College London for the degree of Doctor of Philosophy 1 Declaration I, Narayan Jayaram confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Narayan Jayaram 2 Abstract Single nucleotide variants (SNVs) that occur in transcription factor binding sites (TFBSs) can disrupt the binding of transcription factors and alter gene expression which can cause inherited diseases and act as driver SNVs in cancer. The identification of SNVs in TFBSs has historically been challenging given the limited number of experimentally characterised TFBSs. The recent ENCODE project has resulted in the availability of ChIP-Seq data that provides genome wide sets of regions bound by transcription factors. These data have the potential to improve the identification of SNVs in TFBSs. However, as the ChIP-Seq data identify a broader range of DNA in which a transcription factor binds, computational prediction is required to identify the precise TFBS. Prediction of TFBSs involves scanning a DNA sequence with a Position Weight Matrix (PWM) using a pattern matching tool. This thesis focusses on the prediction of TFBSs by: (a) evaluating a set of locally-installable pattern-matching tools and identifying the best performing tool (FIMO), (b) using the ENCODE ChIP-Seq data to evaluate a set of de novo motif discovery tools that are used to derive PWMs which can handle large volumes of data, (c) identifying the best performing tool (rGADEM), (d) using rGADEM to generate a set of PWMs from the ENCODE ChIP-Seq data and (e) by finally checking that the selection of the best pattern matching tool is not unduly influenced by the choice of PWMs. -
Applied Category Theory for Genomics – an Initiative
Applied Category Theory for Genomics { An Initiative Yanying Wu1,2 1Centre for Neural Circuits and Behaviour, University of Oxford, UK 2Department of Physiology, Anatomy and Genetics, University of Oxford, UK 06 Sept, 2020 Abstract The ultimate secret of all lives on earth is hidden in their genomes { a totality of DNA sequences. We currently know the whole genome sequence of many organisms, while our understanding of the genome architecture on a systematic level remains rudimentary. Applied category theory opens a promising way to integrate the humongous amount of heterogeneous informations in genomics, to advance our knowledge regarding genome organization, and to provide us with a deep and holistic view of our own genomes. In this work we explain why applied category theory carries such a hope, and we move on to show how it could actually do so, albeit in baby steps. The manuscript intends to be readable to both mathematicians and biologists, therefore no prior knowledge is required from either side. arXiv:2009.02822v1 [q-bio.GN] 6 Sep 2020 1 Introduction DNA, the genetic material of all living beings on this planet, holds the secret of life. The complete set of DNA sequences in an organism constitutes its genome { the blueprint and instruction manual of that organism, be it a human or fly [1]. Therefore, genomics, which studies the contents and meaning of genomes, has been standing in the central stage of scientific research since its birth. The twentieth century witnessed three milestones of genomics research [1]. It began with the discovery of Mendel's laws of inheritance [2], sparked a climax in the middle with the reveal of DNA double helix structure [3], and ended with the accomplishment of a first draft of complete human genome sequences [4]. -
A Community Proposal to Integrate Structural
F1000Research 2020, 9(ELIXIR):278 Last updated: 11 JUN 2020 OPINION ARTICLE A community proposal to integrate structural bioinformatics activities in ELIXIR (3D-Bioinfo Community) [version 1; peer review: 1 approved, 3 approved with reservations] Christine Orengo1, Sameer Velankar2, Shoshana Wodak3, Vincent Zoete4, Alexandre M.J.J. Bonvin 5, Arne Elofsson 6, K. Anton Feenstra 7, Dietland L. Gerloff8, Thomas Hamelryck9, John M. Hancock 10, Manuela Helmer-Citterich11, Adam Hospital12, Modesto Orozco12, Anastassis Perrakis 13, Matthias Rarey14, Claudio Soares15, Joel L. Sussman16, Janet M. Thornton17, Pierre Tuffery 18, Gabor Tusnady19, Rikkert Wierenga20, Tiina Salminen21, Bohdan Schneider 22 1Structural and Molecular Biology Department, University College, London, UK 2Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, CB10 1SD, UK 3VIB-VUB Center for Structural Biology, Brussels, Belgium 4Department of Oncology, Lausanne University, Swiss Institute of Bioinformatics, Lausanne, Switzerland 5Bijvoet Center, Faculty of Science – Chemistry, Utrecht University, Utrecht, 3584CH, The Netherlands 6Science for Life Laboratory, Stockholm University, Solna, S-17121, Sweden 7Dept. Computer Science, Center for Integrative Bioinformatics VU (IBIVU), Vrije Universiteit, Amsterdam, 1081 HV, The Netherlands 8Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belvaux, L-4367, Luxembourg 9Bioinformatics center, Department of Biology, University of Copenhagen, Copenhagen, DK-2200, -
SD Gross BFI0403
Janet Thornton Bioinformatician avant la lettre Michael Gross B ioinformatics is very much a buzzword of our time, with new courses and institutes dedicated to it sprouting up almost everywhere. Most significantly, the flood of genome data has raised the gen- eral awareness of the need to deve-lop new computational approaches to make sense of all the raw information collected. Professor Janet Thornton, the current director of the European Bioinformatics Institute (EBI), an EMBL outpost based at the Hinxton campus near Cambridge, has been in the field even before there was a word for it. Coming to structural biology with a physics degree from the University of Nottingham, she was already involved with computer-generated structural im- ages in the 1970s, when personal comput- ers and user-friendly programs had yet to be invented. The Early Years larities. Within 15 minutes, the software From there to the EBI, her remarkable Janet Thornton can check all 2.4 billion possible re- career appears to be organised in lationships and pick the ones relevant decades. During the 1970s, she did doc- software to compare structures to each to the question at hand. In comparison toral and post-doctoral research at other, recognise known folds and spot to publicly available bioinformatics the Molecular Biophysics Laboratory in new ones. Such work provides both packages such as Blast or Psiblast, Oxford and at the National Institute for fundamental insights into the workings Biopendium can provide an additional Medical Research in Mill Hill, near Lon- of evolution on a molecular level, and 30 % of annotation, according to Inphar- don. -
Functional Effects Detailed Research Plan
GeCIP Detailed Research Plan Form Background The Genomics England Clinical Interpretation Partnership (GeCIP) brings together researchers, clinicians and trainees from both academia and the NHS to analyse, refine and make new discoveries from the data from the 100,000 Genomes Project. The aims of the partnerships are: 1. To optimise: • clinical data and sample collection • clinical reporting • data validation and interpretation. 2. To improve understanding of the implications of genomic findings and improve the accuracy and reliability of information fed back to patients. To add to knowledge of the genetic basis of disease. 3. To provide a sustainable thriving training environment. The initial wave of GeCIP domains was announced in June 2015 following a first round of applications in January 2015. On the 18th June 2015 we invited the inaugurated GeCIP domains to develop more detailed research plans working closely with Genomics England. These will be used to ensure that the plans are complimentary and add real value across the GeCIP portfolio and address the aims and objectives of the 100,000 Genomes Project. They will be shared with the MRC, Wellcome Trust, NIHR and Cancer Research UK as existing members of the GeCIP Board to give advance warning and manage funding requests to maximise the funds available to each domain. However, formal applications will then be required to be submitted to individual funders. They will allow Genomics England to plan shared core analyses and the required research and computing infrastructure to support the proposed research. They will also form the basis of assessment by the Project’s Access Review Committee, to permit access to data. -
UNIVERSITY of CALIFORNIA SAN DIEGO Making Sense of Microbial
UNIVERSITY OF CALIFORNIA SAN DIEGO Making sense of microbial populations from representative samples A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by James T. Morton Committee in charge: Professor Rob Knight, Chair Professor Pieter Dorrestein Professor Rachel Dutton Professor Yoav Freund Professor Siavash Mirarab 2018 Copyright James T. Morton, 2018 All rights reserved. The dissertation of James T. Morton is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Chair University of California San Diego 2018 iii DEDICATION To my friends and family who paved the road and lit the journey. iv EPIGRAPH The ‘paradox’ is only a conflict between reality and your feeling of what reality ‘ought to be’ —Richard Feynman v TABLE OF CONTENTS Signature Page . iii Dedication . iv Epigraph . .v Table of Contents . vi List of Abbreviations . ix List of Figures . .x List of Tables . xi Acknowledgements . xii Vita ............................................. xiv Abstract of the Dissertation . xvii Chapter 1 Methods for phylogenetic analysis of microbiome data . .1 1.1 Introduction . .2 1.2 Phylogenetic Inference . .4 1.3 Phylogenetic Comparative Methods . .6 1.4 Ancestral State Reconstruction . .9 1.5 Analysis of phylogenetic variables . 11 1.6 Using Phylogeny-Aware Distances . 15 1.7 Challenges of phylogenetic analysis . 18 1.8 Discussion . 19 1.9 Acknowledgements . 21 Chapter 2 Uncovering the horseshoe effect in microbial analyses . 23 2.1 Introduction . 24 2.2 Materials and Methods . 34 2.3 Acknowledgements . 35 Chapter 3 Balance trees reveal microbial niche differentiation . 36 3.1 Introduction . -
ISCB Ebola Award for Important Future Research on the Computational Biology of Ebola Virus
ISCB Ebola Award for Important Future Research on the Computational Biology of Ebola Virus Journal Information Article/Issue Information Journal ID (nlm-ta): F1000Res Self URI: f1000research-4-6464.pdf Journal ID (iso-abbrev): F1000Res Date accepted: 13 January 2015 Journal ID (pmc): F1000Research Publication date (electronic): 15 January 2015 Title: F1000Research Publication date (collection): 2015 ISSN (electronic): 2046-1402 Volume: 4 Publisher: F1000Research (London, UK) Electronic Location Identifier: 12 Article Id (accession): PMC4457108 Article Id (pmcid): PMC4457108 Article Id (pmc-uid): 4457108 PubMed ID: 26097686 DOI: 10.12688/f1000research.6038.1 Funding: The author(s) declared that no grants were involved in supporting this work. Categories Subject: Editorial Categories Subject: Articles Subject: Bioinformatics Subject: Theory & Simulation Subject: Tropical & Travel-Associated Diseases Subject: Viral Infections (without HIV) Subject: Virology ISCB Ebola Award for Important Future Research on the Computational Biology of Ebola Virus v1; ref status: not peer reviewed Peter D. Karp Bonnie Berger Diane Kovats Thomas Lengauer Michal Linial Pardis Sabeti Winston Hide Burkhard Rost 1 1International Society for Computational Biology, La Jolla, CA, USA 2 2SRI International, Menlo Park, CA, USA 3 3Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA 4 4Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbruecken, Germany 5 5Hebrew University & Institute of Advanced -
Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform
Universidad Nacional de Colombia Sede Manizales Master’s Thesis Methodology for predicting semantic annotations of protein sequences by feature extraction derived of statistical contact potentials and continuous wavelet transform Author: Supervisor: Gustavo Alonso Arango Dr. Cesar German Argoty Castellanos Dominguez A thesis submitted in fulfillment of the requirements for the degree of Master’s on Engineering - Industrial Automation in the Department of Electronic, Electric Engineering and Computation Signal Processing and Recognition Group June 2014 Universidad Nacional de Colombia Sede Manizales Tesis de Maestr´ıa Metodolog´ıapara predecir la anotaci´on sem´antica de prote´ınaspor medio de extracci´on de caracter´ısticas derivadas de potenciales de contacto y transformada wavelet continua Autor: Tutor: Gustavo Alonso Arango Dr. Cesar German Argoty Castellanos Dominguez Tesis presentada en cumplimiento a los requerimientos necesarios para obtener el grado de Maestr´ıaen Ingenier´ıaen Automatizaci´onIndustrial en el Departamento de Ingenier´ıaEl´ectrica,Electr´onicay Computaci´on Grupo de Procesamiento Digital de Senales Enero 2014 UNIVERSIDAD NACIONAL DE COLOMBIA Abstract Faculty of Engineering and Architecture Department of Electronic, Electric Engineering and Computation Master’s on Engineering - Industrial Automation Methodology for predicting semantic annotations of protein sequences by feature extraction derived of statistical contact potentials and continuous wavelet transform by Gustavo Alonso Arango Argoty In this thesis, a method to predict semantic annotations of the proteins from its primary structure is proposed. The main contribution of this thesis lies in the implementation of a novel protein feature representation, which makes use of the pairwise statistical contact potentials describing the protein interactions and geometry at the atomic level. -
Standardised Benchmarking in the Quest for Orthologs
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Harvard University - DASH Standardised Benchmarking in the Quest for Orthologs The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Altenhoff, A. M., B. Boeckmann, S. Capella-Gutierrez, D. A. Dalquen, T. DeLuca, K. Forslund, J. Huerta-Cepas, et al. 2016. “Standardised Benchmarking in the Quest for Orthologs.” Nature methods 13 (5): 425-430. doi:10.1038/nmeth.3830. http://dx.doi.org/10.1038/ nmeth.3830. Published Version doi:10.1038/nmeth.3830 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:29408292 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA HHS Public Access Author manuscript Author ManuscriptAuthor Manuscript Author Nat Methods Manuscript Author . Author manuscript; Manuscript Author available in PMC 2016 October 04. Published in final edited form as: Nat Methods. 2016 May ; 13(5): 425–430. doi:10.1038/nmeth.3830. Standardised Benchmarking in the Quest for Orthologs Adrian M. Altenhoff1,2, Brigitte Boeckmann3, Salvador Capella-Gutierrez4,5,6, Daniel A. Dalquen7, Todd DeLuca8, Kristoffer Forslund9, Jaime Huerta-Cepas9, Benjamin Linard10, Cécile Pereira11,12, Leszek P. Pryszcz4, Fabian Schreiber13, Alan Sousa da Silva13, Damian Szklarczyk14,15, Clément-Marie Train1, Peer Bork9,16,17, Odile Lecompte18, Christian von Mering14,15, Ioannis Xenarios3,19,20, Kimmen Sjölander21, Lars Juhl Jensen22, Maria J. -
Multiscale Modeling in Systems Biology
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 2051 Multiscale Modeling in Systems Biology Methods and Perspectives ADRIEN COULIER ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6214 ISBN 978-91-513-1225-5 UPPSALA URN urn:nbn:se:uu:diva-442412 2021 Dissertation presented at Uppsala University to be publicly examined in 2446 ITC, Lägerhyddsvägen 2, Uppsala, Friday, 10 September 2021 at 10:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Mark Chaplain (University of St Andrews). Abstract Coulier, A. 2021. Multiscale Modeling in Systems Biology. Methods and Perspectives. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 2051. 60 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-513-1225-5. In the last decades, mathematical and computational models have become ubiquitous to the field of systems biology. Specifically, the multiscale nature of biological processes makes the design and simulation of such models challenging. In this thesis we offer a perspective on available methods to study and simulate such models and how they can be combined to handle biological processes evolving at different scales. The contribution of this thesis is threefold. First, we introduce Orchestral, a multiscale modular framework to simulate multicellular models. By decoupling intracellular chemical kinetics, cell-cell signaling, and cellular mechanics by means of operator-splitting, it is able to combine existing software into one massively parallel simulation. Its modular structure makes it easy to replace its components, e.g. to adjust the level of modeling details. We demonstrate the scalability of our framework on both high performance clusters and in a cloud environment. -
OMA, a Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements
OMA, A Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements Christophe Dessimoz, Gina Cannarozzi, Manuel Gil, Daniel Margadant, Alexander Roth, Adrian Schneider, and Gaston H. Gonnet ETH Zurich, Institute of Computational Science, CH-8092 Z¨urich [email protected] Abstract. The OMA project is a large-scale effort to identify groups of orthologs from complete genome data, currently 150 species. The algo- rithm relies solely on protein sequence information and does not require any human supervision. It has several original features, in particular a verification step that detects paralogs and prevents them from being clustered together. Consistency checks and verification are performed throughout the process. The resulting groups, whenever a comparison could be made, are highly consistent both with EC assignments, and with assignments from the manually curated database HAMAP. A highly ac- curate set of orthologous sequences constitutes the basis for several other investigations, including phylogenetic analysis and protein classification. Complete genomes give scientists a valuable resource to assign functions to se- quences and to analyze their evolutionary history. These analyses rely heavily on gene comparison through sequence alignment algorithms that output the level of similarity, which gives an indication of homology. When homologous sequences are of interest, care must often be taken to distinguish between orthologous and paralogous proteins [1]. Both orthologs and paralogs come from the same ancestral sequence, and therefore are homologous, but they differ in the way they arise: paralogous se- quences are the product of gene duplication, while orthologous sequences are the product of speciation. -
Bioinformatic Methods in Applied Genomic Research
Alma mater studiorum - Università di Bologna Dottorato in Biotecnologie Cellulari e Molecolari: XXIII ciclo Settore scientifico disciplinare di afferenza: BIO11 BIOINFORMATIC METHODS IN APPLIED GENOMIC RESEARCH Presentato da Raffaele Fronza Coordinatore Dottorato: Supervisore: Prof. Santi Mario Spampinato Prof. Rita Casadio Esame finale anno 2011 ii Abstract Here I will focus on three main topics that best address and include the projects I have been working in during my three year PhD period that I have spent in different research laboratories addressing both computationally and practically important problems all related to modern molecular genomics. The first topic is the use of livestock species (pigs) as a model of obesity, a complex hu- man dysfunction. My efforts here concern the detection and annotation of Single Nucleotide Polymorphisms. I developed a pipeline for mining human and porcine sequences. Starting from a set of human genes related with obesity the platform returns a list of annotated porcine SNPs extracted from a new set of potential obesity-genes. 565 of these SNPs were analyzed on an Illumina chip to test the involvement in obesity on a population composed by more than 500 pigs. Results will be discussed. All the computational analysis and experiments were done in collaboration with the Biocomputing group and Dr.Luca Fontanesi, respectively, under the direction of prof. Rita Casadio at the Bologna University, Italy. The second topic concerns developing a methodology, based on Factor Analysis, to simul- taneously mine information from different levels of biological organization. With specific test cases we develop models of the complexity of the mRNA-miRNA molecular interaction in brain tumors measured indirectly by microarray and quantitative PCR.