Proquest Dissertations

©Copyright 2011 Andrew M. Simms Mining Mountains of Data: Organizing All Atom Molecular Dynamics Protein Simulation Data into SQL and OLAP Cubes Andrew M. Simms A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2011 Program Authorized to Offer Degree: Medical Education and Biomedical Informatics UMI Number: 3472310 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMI' Dissertation Publishing UMI 3472310 Copyright 2011 by ProQuest LLC. All rights reserved. This edition of the work is protected against unauthorized copying under Title 17, United States Code. ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Andrew M. Simms and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Chair of the Supervisory Committee: Valerie Daggett' Reading Committee: ^J CtJ..JLA^~JL Valerie Daggett (sA^d^/^y ^PUl« IraP,. Kalet JMer Myler Date: ^'/^v-ZO/r*y. / In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of the dissertation is allowable only for scholarly purposes, consistent with "fair use" as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to ProQuest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted "the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform." Signature Date m^y 2,2,011 University of Washington Abstract Mining Mountains of Data: Organizing All Atom Molecular Dynamics Protein Simulation Data into SQL and OLAP Cubes Andrew M. Simms Chair of the Supervisory Committee: Professor Valerie Daggett Bioengineering Across scientific disciplines, the ability to generate, collect, and store data has outpaced the ability to make sense of them. Methods and technology exist today for working with ex tremely large data sets, yet the most common data organization paradigm in science is to cre ate files, in some cases millions of files, and store them in file systems. Despite the best intentions, these repositories quickly become disorganized, fragile, and difficult to manage; hindering mining and exploitation of the data they contain. This is fundamentally an infor matics problem, and here I present the design of a very large scale repository to organize and mine molecular dynamics simulation data. TABLE OF CONTENTS Page List of Figures iii List of Tables iv Chapter 1: Introduction 1 Chapter 2: Protein Simulation Data in the Relational Model 3 Introduction 3 A Dimensional Model for MD Simulation Data 5 Relational Design and Implementation 9 SQL Server Implementation 18 Conclusions and Future Directions 30 Chapter 3: Augmenting the Relational Model using Online Analytical Processing 31 Introduction 31 SQL Server Analysis Services 33 Dynameomics OLAP Database Design and Implementation 39 Storage and Calculation Performance Analysis 48 Discussion 51 Conclusions 53 Chapter 4: Beyond the Relational Model: 3D Spatial Hashing 54 Introduction 55 Results 57 Conclusions 64 Methods 65 Chapter 5: Generation of a Consensus Domain Dictionary 73 Introduction 73 Methods 76 Results 82 Discussion 92 Chapter 6: The Molecular Mechanics Parameter Markup Language 94 Introduction 94 Force Field Parameters 95 The MMPL Data Model 96 Validation of elements and relationships 102 MMPL Components and Extending the Parameter Library 104 Conclusions 106 l Chapter 7: Conclusions and Future Directions 108 Paying for Storage Infrastructure 108 Cloud Computing 109 Moving to the Cloud 110 Conclusions 112 Bibliography 113 li LIST OF FIGURES Figure Number Page 1. Star and Snowflake Schemas 4 2. Dimensional Hiearchies and Groups 6 3. Structure Dimension Links 7 4. Directory Schema Diagram 11 5. Simulation and Simulation Group Dimension Tables 13 6. High level view of an Analysis Services Database 34 7. Example MDX Statement 38 8. Dimensions and hierarchies 41 9. MDX lookup query execution times 51 10. MDX calculation query execution times 52 11. Illustration of spatial binning within a periodic box 57 12. 11 metafolds representative of sequence length in Dynameomics 58 13. Contacts query execution times 58 14. Compression and execution times 62 15. Comparison of total execution times and table sizes 63 16. Comparison of compression 64 17. Heavy atom contacts query % 67 18. Cache clearing commands 71 19. Target Selection and Preparation ('Prep') database schema 77 20. Overview of the consensus domain dictionary (CDD) generation process 78 21. Overview of the mapping and target selection process 81 22. Distribution of domain populations between folds and metafolds 84 23. Example metafolds rejected for not being autonomous units 90 24. Structure representatives 91 26. MMPL schema 97 27. Illustration of relationships between structural elements 100 28: Parameter types 102 29. Parameter mask matching algorithm 103 30. A minimal parameter library 105 31. Cloud services and repositories Ill in LIST OF TABLES Table Number Page 1. Unique Simulation Attributes 8 2. Structure Group Type 12 3. Simulation Dimension Attributes and Relational Columns 14 4. Supported and Planned Fact Types 16 5. Shared Identifiers 17 6. Common SQL Server Data Types 21 7. Dimensional Key Column Usage 28 8. Secondary Dimensions for dihedral angles, secondary structure, and OAP state 29 9. Naming rules for coordinate and analysis tables 30 10. Dimensions and attributes 42 11. Measure group definitions 44 12. Measure groups and relationships to cube dimensions 48 13. Test server configuration 49 14. Storage analysis 50 15. Test set definition 59 16. Comparison of average execution times by protein 59 17. Compression comparison 65 18. Test server hardware configuration, hardware and software 70 19. SCOP, CATH, and Dali 83 20. Justifications for rejection for 888 metafolds in the v2009 CDD 90 21. MMPL Elements 98 22. MMPL File Manifest 106 IV ACKNOWLEDGEMENTS I would like to give thanks to my wife, Merianne White, for all her encouragement and support that has made this journey possible. 1 Chapter 1: Introduction Across scientific disciplines, the ability to generate, collect, and store data has outpaced the ability to make sense of them. Methods and technology exist today for working with ex tremely large data sets, yet the most common data organization paradigm in science is to create files, in some cases millions of files, and store them in file systems. Despite the best intentions, these repositories quickly become disorganized, fragile, and difficult to manage; hindering min ing and exploitation of the data they contain. This is fundamentally an informatics problem, and the following chapters detail the design of a very large scale repository to organize and mine molecular dynamics simulation data. The design of large-scale informatics infrastructure begins with a thorough analysis of the underlying data. The goal of this analysis is to discover and establish the interrelationships and boundaries of the data being captured or generated, as the data do not organize themselves. This process is by no means static, and will evolve as hypotheses are generated, tested, and re fined. The urge to rush toward implementing persistence for internal data structures of specific algorithms must be avoided, as this will only result in needless data transformation and code refactoring as new algorithms are developed. Instead, data storage should be designed around conceptual structure of the data and intended paths of analysis. It should then be implemented using the primary objects of the chosen storage engine. Dimensional modeling is an approach to database design that focuses analysis as a pri mary consideration. Originally pioneered as a method to organize large volumes of financial data, it is well suited to scientific data. Chapter 2 describes the dimensional model for protein simulation data and its implementation using a relational database. On-line Analysis Processing (OLAP) is type of database developed specifically to ad dress the needs of data analysis as opposed to managing transactions. In contrast to relational databases, OLAP databases fundamentally store and operate on multi-dimensional data. Chap ter 3 explores OLAP and details the implementation of the Dynameomics data model using the 2 multi-dimensional OLAP feature of SQL Server Analysis Services. Relational databases are general purpose tools and are largely agnostic to the semantics of the data they house. By coding inherent data features into relational primitives, huge per formance gains are possible. Chapter 4 describes spatial indexing, where three-dimensional coordinates are placed into a 1-dimensional index and implemented as a simple foreign key, enabling rapid calculation of contact distances. A single data model is unlikely

Load more