From Existing Data to Novel Hypotheses
Total Page:16
File Type:pdf, Size:1020Kb
From existing data to novel hypotheses Design and application of structure-based Molecular Class Specific Information Systems Remko Kuipers Thesis committee Thesis supervisors Prof. dr. ir. V.A.P. Martins dos Santos Professor of Systems and Synthetic Biology Wageningen University Prof. dr. G. Vriend Professor of Bioinformatics of Macromolecular Structures CMBI, Radboud University Nijmegen, Medical Centre Thesis co-supervisor Dr. P.J. Schaap Assistant professor, Laboratory of Systems and Synthetic Biology Wageningen University Other members Prof. dr. S.C. de Vries, Wageningen University Prof. dr. T.R.J.M. Desmet, Ghent University, Belgium Dr. S.A.F.T. van Hijum, Radboud University Nijmegen Dr. R. de Jong, DSM Biotechnology Center, Delft This research was conducted under the auspices of the Graduate School VLAG (Advanced studies in Food Technology, Agrobiotechnology, Nutrition and Health Sciences). From existing data to novel hypotheses Design and application of structure-based Molecular Class Specific Information Systems Remko Kuipers Thesis submitted in fulfilment of the requirements for the decree of doctor at Wageningen University by the authority of the Rector Magnificus Prof. dr. M.J. Kropff, in the presence of the Thesis Committee appointed by the Academic Board to be defended in public on Wednesday December 12th, 2012 at 11 a.m. in the Aula. Remko K.P. Kuipers From existing data to novel hypotheses. Design and application of structure- based Molecular Class Specific Information Systems. 232 pages Thesis, Wageningen University, Wageningen, The Netherlands (2012) With references, with summaries in Dutch and English ISBN: 978-94-6173-350-4 Table of Contents Chapter 1 General Introduction 7 Chapter 2 Technical Background 49 Chapter 3 3DM: systematic analysis of heterogeneous super- 95 family data to discover protein functionalities Chapter 4 Protein structure analysis of mutations causing 119 inheritable diseases. An e-Science approach with life scientist friendly interfaces Chapter 5 Correlated mutation analyses on super-family 137 alignments reveal functionally important residues Chapter 6 Novel tools for extraction and validation of 157 disease related mutations applied to Fabry disease Chapter 7 The α/β-Hydrolase Fold 3DM Database 175 (ABHDB) as a Tool for Protein Engineering Chapter 8 Increasing the thermostability of sucrose 195 phosphorylase by a combination of sequence- and structure-based mutagenesis Chapter 9 General discussion 211 Chapter 10 Summary 219 Acknowledgements 227 Curriculum Vitae 229 List of publications 230 Overview of completed training activities 231 General Introduction Chapter 1 1.1 Life Sciences The life sciences encompass a wide range of research fields aiming to understand biological systems on all levels. Today, four billion years after life first arose on earth, it has diversified into three domains: eukaryotes, prokaryotes, and archaea, each consisting of innumerable species. Several million species are currently found in nature and described by scientists and a multitude of this number has yet to be categorized [1]. An even larger number of species have evolved, thrived and become extinct again in the four billion year history of live on earth [2]. Life can be studied in many different ways, and while biology remains at the heart of the life sciences, many more specialized fields have sprung up. Life science research projects are involved in the study of organisms, understanding of biological processes and pathways, interactions within cells, transport of compounds within and between cells, the production of useful products, defence versus external threats, reactions to stimuli and stress, and many other areas of interest. To be able to study life from all these different angles scientists produce and consume large amounts of data. One data driven classification for research fields in the life sciences is directly linked to the data aspect of biological systems that is studied. Nomenclature for these classifications commonly uses the subject of interest appended with the suffix omics, a neologism referring to obtaining a class of bio-data in high throughput. A large number of different omics techniques and methodologies have been developed of which a few that we encountered while working on this thesis are listed here. 1) Genomics; 2) Transcriptomics; 3) Proteomics; 4) Secretomics; 5) Interactomics; 6) Variomics; 7) Metabolomics; 8) Fluxomics; Different aspects of biological systems are thus studied by different high-throughput omics techniques. Genomics projects, for example, focus on the genome of an organism. Transcriptomics focuses on the subset of those genes that are transcribed under specific conditions such as a disease or external stimuli. Proteomics studies the expression of proteins and secretomics how a subset of these proteins is secreted which is important for commercial protein engineering. Interactomics aims to understand the interactions among the many biological components in a system, for instance, the full protein-protein interaction network of a cell. Variomics studies the variation in DNA and protein sequences between individuals and populations for applications in healthcare. Metabolomics studies metabolites and fluxomics their fluxes in, out, and through cells or cell organelles. 8 Introduction The overall goal of research projects in all these fields is to improve the understanding of life in all its varieties. A broader viewpoint encompassing data from multiple fields and in- depth analyses are often required to study biological components and mechanisms. Small perturbations in biological processes may have huge consequences requiring studying the entire system instead of just a single component. Systems biology studies how higher-level properties emerge from complex networks of interactions among the many individual components of biological systems. Systems biology projects include an iterative cycle of experimental generation of heterogeneous data sets, the analyses of these data sets, and their subsequent integration into mathematical models. Several cycles are usually required to generated and fine tune a model for a specific purpose. The work described in this thesis mainly focusses on data integration. In addition, several tools are discussed for data analyses and in some projects predictions were experimentally verified. Systems biology projects require high quality data to accurately predict biological phenomena. Nowadays, a substantial part of life science research is data centric requiring a wide variety of equipment, methodologies, and tools used to produce those bio- data. Biological data is thus produced by lots of different researchers, laboratories, and organizations, and made available to the research community in lots of different ways. Due to the size and complexity of these data sets, structured storage methods are required to allow research projects to access and utilize the full potential of these data sets. On-line databases are nowadays commonly used to offer public access to data from research projects. For example, institutes such as NCBI, EBI, or CMBI offer public access to databases that contain many different data types such as protein sequences [3–5], structures [6,7], annotations [3,8] or publications [9]. Frequently used data types, such as sequences and structures, have largely been internationally standardized. These data types tend to get deposited in publicly accessible databases hosted by large universities or governmental organizations that have the required financial backing to maintain these repositories for years on end. These data types, for example protein sequences or gene expression data, are generally of a low information content, and relatively easy to produce with currently available high throughput techniques. Higher quality, experimentally derived data sets such as pathway fluxes and ligand binding studies are more detailed. These datasets are, however, more difficult and expensive to produce because much more human expertise and more expensive equipment is required to generate the data. Public databases are often not available for these highly specialized data types due to the smaller number of producers and consumers. These data sets are therefore usually offered by the producers themselves, despite the costs and burden of maintaining in-house data repositories. New developments and improvements in equipment and software tools will lead to easier and more cost efficient methods of producing these datasets. Additionally, increased storage efficiency and an improved ability to store diverse data types in generic databases will lead to easier and cheaper hosting of data sets. Combined, these developments will lead to a continuous increase in the amount and quality of data that will become publicly available [10–12]. 9 Chapter 1 The availability of many diverse data sets for scientist is important as novel insights into systems of interest can be gained not only by novel research and experiments but also by combining and integrating existing data. Even data originally generated for a completely different purpose might be relevant and useful for a novel question when properly integrated with all other existing knowledge. Research projects thus produce and consume lots of different data types that have to be interlinked and integrated into a database to be of any use. This thesis describes how molecular class specific information systems (MCSISes) can be used to handle a vast amount of highly heterogeneous data