The Genomic Search Engine Toolkit

GENESTATION: THE GENOMIC SEARCH ENGINE TOOLKIT By Mara Kim Dissertation Submitted to the Faculty of the Graduate School of Vanderbilt University In partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY In Biological Sciences October 31, 2018 Approved: Douglas K. Abbot, Ph.D. Antonis Rokas, Ph.D. Ge Zhang, M.D., Ph.D. John A. Capra, Ph.D. Thomas A. Lasko, M.D., Ph.D. Copyright © 2018 by Mara Kim All Rights Reserved ii DEDICATION To my mother, who showed me the wonders of art. To my father, who taught me the beauty of science. To my sisters that made me who I am. To Andrea, who inspires me to be my best. I love you. iii ACKNOWLEDGEMENTS This work was supported by Vanderbilt University, the Gisela Mosig Travel Fund, and the March of Dimes Ohio Collaborative. I would like to thank my collaborators Kris McGary, Antonis Rokas, Patrick Abbot, Ken Petren, Tony Capra, Lou Muglia, Scott Williams, Sashank Nutakki, Jibril Hirbo, Haley Eidem, Julie Phillips, Rohit Venkat, and Brian Cooper. I would also like to recognize Abigail Lind, David Rinker, Jen Wisecaver, Corinne Simonti, and Ling Chen for their help in developing ideas and techniques for this work. I would like to thank Katherine Friedman and Steve Baskauf for their mentorship during my time as both an undergraduate and graduate student at Vanderbilt, as well as Alicia Goostree, Leslie Maxwell, Angela Titus, Christopher Patterson, and LaDonna Smith for their patience and assistance with the logistics of graduate school. I am grateful to my committee, Patrick Abbot, Tony Capra, Tom Lasko, Ge Zhang, and Antonis Rokas, for the chance to do this work and pushing me to accomplish as much as possible. Their ideas and criticism were instrumental in ensuring the vision behind the project became a reality. Finally, I would like to thank my advisor, Antonis Rokas, a fantastic scientist and teacher, for taking me in as an undergraduate researcher. He always believed in me and encouraged me in all my efforts. Without his mentorship and support, this work would not have been possible. iv TABLE OF CONTENTS ACKNOWLEDGEMENTS................................................................................................iv LIST OF TABLES...........................................................................................................viii LIST OF FIGURES..........................................................................................................ix CHAPTER...............................................................................................................1 I. Introduction..........................................................................................................1 A BRIEF HISTORY OF GENOMICS...................................................................1 GEneSTATION...................................................................................................5 THE GENOMIC SEARCH ENGINE....................................................................6 SynTHy and GeneViewer...................................................................................6 REFERENCES...................................................................................................7 II. GEneSTATION 1.0: a synthetic resource of diverse evolutionary and functional genomic data for studying the evolution of pregnancy-associated tissues and phenotypes.........................................................................................................11 ABSTRACT......................................................................................................12 INTRODUCTION..............................................................................................12 DATA SOURCES AND DATA ORGANIZATION................................................15 DATA PRESENTATION....................................................................................17 SYNTHESIS AND ANALYSIS...........................................................................25 DATA ACCESS.................................................................................................26 SUMMARY AND FUTURE PERSPECTIVES...................................................26 MATERIALS AND METHODS..........................................................................27 ACKNOWLEDGEMENTS.................................................................................30 FUNDING.........................................................................................................30 REFERENCES.................................................................................................31 III. GEneSTATION 2.0: the integrative encyclopedia of evolutionary and functional data of human coding and non-coding elements for the study of pregnancy- associated diseases...........................................................................................35 ABSTRACT......................................................................................................35 INTRODUCTION..............................................................................................35 DATA SOURCES AND ORGANIZATION..........................................................37 v DATA PRESENTATION....................................................................................38 SYNTHESIS AND ANALYSIS...........................................................................41 SUMMARY.......................................................................................................41 MATERIALS AND METHODS..........................................................................41 REFERENCES.................................................................................................42 IV. The Genomic Search Engine.............................................................................46 ABSTRACT......................................................................................................46 INTRODUCTION..............................................................................................46 METHODS.......................................................................................................50 DESIGN...........................................................................................................50 PERFORMANCE.............................................................................................54 CONCLUSIONS...............................................................................................56 REFERENCES.................................................................................................57 V. SynTHy: Synthesis and Testing of Hypotheses..................................................59 BACKGROUND................................................................................................59 IMPLEMENTATION..........................................................................................61 RESULTS AND DISCUSSION..........................................................................61 CONCLUSION.................................................................................................62 REFERENCES.................................................................................................62 VI. GeneViewer: Integrative visualisation of the holistic gene..................................64 BACKGROUND................................................................................................64 IMPLEMENTATION..........................................................................................64 RESULTS AND DISCUSSION..........................................................................65 CONCLUSIONS...............................................................................................66 REFERENCES.................................................................................................67 VII. Conclusion.........................................................................................................68 REFERENCES.................................................................................................71 APPENDIX............................................................................................................73 I. Genome Feature Object: Representing Genomic Features in JSON document stores.................................................................................................................73 Introduction......................................................................................................73 vi Mapping...........................................................................................................73 Indexing............................................................................................................78 II. Genestation Command Line Interface................................................................83 Synopsis...........................................................................................................83 Commands.......................................................................................................83 III. Genomic Data Descriptor JSON........................................................................85 INTRODUCTION..............................................................................................85 GENOMIC METADATA KEYS..........................................................................85 GENOMIC DATA KEYS....................................................................................85 vii LIST OF TABLES Time to count the features on human

The Genomic Search Engine Toolkit

Sequence Alignment/Map) Is a Text Format for Storing Sequence Alignment Data in a Series of Tab Delimited ASCII Columns

In Silico Analysis of Single Nucleotide Polymorphisms

Biological Databases

Alternate-Locus Aware Variant Calling in Whole Genome Sequencing Marten Jäger1,2, Max Schubach1, Tomasz Zemojtel1,Knutreinert3, Deanna M

Galaxy Platform for NGS Data Analyses

Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Will a Biological Database Be Different from a Biological Journal? Philip Bourne

COMBINED STUDY on DIGITAL SEQUENCE INFORMATION in PUBLIC and PRIVATE DATABASES and TRACEABILITY Note by the Executive Secretary 1

Protein-Protein Interactions Prediction Based on Iterative Clique Extension with Gene Ontology Filtering

Genomics in the UK an Industry Study for the Office of Life Sciences

Genomic Databases and Softwares: in Pursuit of Biological Relevance