Biomolecular Simulation Data Management In
Total Page:16
File Type:pdf, Size:1020Kb
BIOMOLECULAR SIMULATION DATA MANAGEMENT IN HETEROGENEOUS ENVIRONMENTS Julien Charles Victor Thibault A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Biomedical Informatics The University of Utah December 2014 Copyright © Julien Charles Victor Thibault 2014 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Julien Charles Victor Thibault has been approved by the following supervisory committee members: Julio Cesar Facelli , Chair 4/2/2014___ Date Approved Thomas E. Cheatham , Member 3/31/2014___ Date Approved Karen Eilbeck , Member 4/3/2014___ Date Approved Lewis J. Frey _ , Member 4/2/2014___ Date Approved Scott P. Narus , Member 4/4/2014___ Date Approved And by Wendy W. Chapman , Chair of the Department of Biomedical Informatics and by David B. Kieda, Dean of The Graduate School. ABSTRACT Over 40 years ago, the first computer simulation of a protein was reported: the atomic motions of a 58 amino acid protein were simulated for few picoseconds. With today’s supercomputers, simulations of large biomolecular systems with hundreds of thousands of atoms can reach biologically significant timescales. Through dynamics information biomolecular simulations can provide new insights into molecular structure and function to support the development of new drugs or therapies. While the recent advances in high-performance computing hardware and computational methods have enabled scientists to run longer simulations, they also created new challenges for data management. Investigators need to use local and national resources to run these simulations and store their output, which can reach terabytes of data on disk. Because of the wide variety of computational methods and software packages available to the community, no standard data representation has been established to describe the computational protocol and the output of these simulations, preventing data sharing and collaboration. Data exchange is also limited due to the lack of repositories and tools to summarize, index, and search biomolecular simulation datasets. In this dissertation a common data model for biomolecular simulations is proposed to guide the design of future databases and APIs. The data model was then extended to a controlled vocabulary that can be used in the context of the semantic web. Two different approaches to data management are also proposed. The iBIOMES repository offers a distributed environment where input and output files are indexed via common data elements. The repository includes a dynamic web interface to summarize, visualize, search, and download published data. A simpler tool, iBIOMES Lite, was developed to generate summaries of datasets hosted at remote sites where user privileges and/or IT resources might be limited. These two informatics-based approaches to data management offer new means for the community to keep track of distributed and heterogeneous biomolecular simulation data and create collaborative networks. iv To My Family TABLE OF CONTENTS ABSTRACT ....................................................................................................................... iii LIST OF TABLES .............................................................................................................. x LIST OF FIGURES ........................................................................................................... xi LIST OF ABBREVIATIONS .......................................................................................... xiv ACKNOWLEDGMENTS ............................................................................................... xvi Chapter 1. INTRODUCTION .......................................................................................................... 1 Problem statement ........................................................................................................... 1 Main objectives ............................................................................................................... 2 Aim 1 .......................................................................................................................... 2 Aim 2 .......................................................................................................................... 3 Aim 3 .......................................................................................................................... 3 Dissertation outline ......................................................................................................... 4 References ....................................................................................................................... 4 2. BACKGROUND ............................................................................................................ 6 Biomolecular simulations ............................................................................................... 6 Introduction ................................................................................................................. 6 Molecular dynamics .................................................................................................... 7 Quantum chemistry ..................................................................................................... 9 Computing environment ............................................................................................... 10 Software packages .................................................................................................... 10 High-performance computing hardware ....................................................................11 Data storage .............................................................................................................. 12 Data sharing .................................................................................................................. 13 Experimental data ..................................................................................................... 13 Molecular simulations ............................................................................................... 14 Dissertation ................................................................................................................... 16 References ..................................................................................................................... 17 3. DATA MODEL, DICTIONARIES, AND DESIDERATA FOR BIOMOLECULAR SIMULATION DATA INDEXING AND SHARING ...................................................... 25 Abstract ......................................................................................................................... 25 Background ............................................................................................................... 25 Results ....................................................................................................................... 25 Conclusions ............................................................................................................... 26 Background ................................................................................................................... 26 Introduction ............................................................................................................... 26 Motivation for a common data representation: examples ......................................... 31 Experimental ................................................................................................................. 38 Identification of core data elements .......................................................................... 38 Logical model ........................................................................................................... 40 Dictionaries ............................................................................................................... 42 Results ........................................................................................................................... 42 Identification of core data elements .......................................................................... 42 Logical model ........................................................................................................... 50 Dictionaries ............................................................................................................... 63 Discussion ..................................................................................................................... 65 Conclusion .................................................................................................................... 69 Methods......................................................................................................................... 70 References ..................................................................................................................... 82 4. THESAURUS AND ONTOLOGY DEVELOPMENTS FOR BIOMOLECULAR SIMULATION DATA EXCHANGE ................................................................................ 87 Abstract ......................................................................................................................... 87 Introduction ................................................................................................................... 88 Methods........................................................................................................................