Nucleic Acid High-Throughput Sequencing Studies Present Unique Challenges in Analysis and Interpretation

Nucleic Acid High-Throughput Sequencing Studies Present Unique Challenges in Analysis and Interpretation DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Kenji Oman, B.S., M.S. Graduate Program in Physics The Ohio State University 2015 Dissertation Committee: Dr. Ralf Bundschuh, Advisor Dr. Kurt Fredrick Dr. Richard Furnstahl Dr. Michael Poirier c Copyright by Kenji Oman 2015 Abstract From the discovery of nucleic acids, their significance as an information carrier in the cell, and with the development of high-throughput sequencing (HTS) techniques, molecular biology has seen ever-increasing developments in our understanding of the mechanisms of life. Here we first present a small overview of the progression of our understanding of nucleic acids, and current HTS techniques used to study them. We then investigate the interaction of methyl-binding-domain (MBD) with methylated DNA, as used in MethylCap-seq (a HTS technique), and present a model for their interaction, and a Bayesian model utilizing our increased understanding to predict methylation levels in samples with an unknown methylation profile. We next introduce a HTS analysis pipeline we have developed, and examine the use of this pipeline in the analysis of 5′-end seq data, ultimately leading to its abandonment. Finally, we present a further application of the pipeline in our investigation of lepA’s role in translation initiation and elongation in E. coli. ii To my parents, who always told me I could. iii Acknowledgments There are many people who have helped me get to the point of a Ph.D. First and foremost, I would like to thank my parents and family for their words of encouragement and support through every step of my education. Their influence continues to be felt. I would also like to thank my many teachers and professors. Through their efforts, I have learned a little something of the world, and my eyes have been opened to the complexities of nature. Finally, I would like to thank those who have had a direct hand in my training as a scientist. First, I must thank my advisor, Dr. Ralf Bundschuh. Through our many interactions, his guidance, and patience with me, I have grown from knowing next to nothing about biology, programming, and data analysis, to gaining a grasp of each. His example, balancing work and family, is an inspiration to me. I must also thank my fellow graduate students working with Prof. Bundschuh: Cai Chen, Yi-Hsuan Lin, Billy Baez, Blythe Morland, and Dengke Zhao. Through our interactions, I have gained a broader appreciation of the variety of biophysics questions and techniques. I would also like to thank Catharine Shipps for her dilligent effort in helping with our research, as well as Ryan Mangelson, who helped with another one of our projects. I would also like to thank Dr. Michael Poirier and the students of his group for our weekly group meetings—they have been most informative to me as an introduction to some of the challenges of experimental biophysics, and have been a great means of giving me presentation practice. There are also our collaborators, without whose work, I would have had no data to work with and would have had to do a very different Ph.D. I also appreicate the many conversations we had in our weekly meetings, and their paitience with me as I learned better iv the biology of our projects. We have the PIs, Drs. Pearlly Yan, Kurt Fredrick, and Daniel Schoenberg, as well as the post-docs, Drs. Dan Kiss, Chandrama Mukherjee, and Bappa Roy, and graduate students Rohan Balakrishnan, Jackson Trotman, and David Frankhouser. Finally, there is my committee, including my advisor Prof. Bundschuh, who read through all parts of my manuscript and provided numerous comments and suggestions for improvement, as well as Drs. Kurt Fredrick, Richard Furnstahl, and Michael Poirier. I thank them for reading through my dissertation and their patience with me as I have gone through the process of dissertation writing and defense. Despite their much help, I am certain errors remain, which are fully my own. v Vita May, 2009 ..................................... B.S., Carnegie Mellon University, Pitts- burgh, PA August, 2012 . M.S., The Ohio State University, Colum- bus, OH Publications Rohan Balakrishnan, Kenji Oman (co-first author), Shinichiro Shoji, Ralf Bundschuh, Kurt Fredrick. The conserved GTPase LepA contributes mainly to translation initiation in Escherichia coli. Nucl. Acids Res., 42:13370-13383 (2014). Daniel L. Kiss, Kenji Oman, Ralf Bundschuh, Daniel R. Schoenberg. Uncapped 5 ends of mRNAs targeted by cytoplasmic capping map to the vicinity of downstream CAGE tags. FEBS Letters 3:279-284 (2015). Daniel L. Kiss, Kenji Oman, Julie A. Dougherty, Chandrama Mukherjee, Ralf Bundschuh, Daniel R. Schoenberg. Cap homeostasis is independent of poly(A) tail length. Nucl. Acids Res, in review. Blythe Moreland, Kenji Oman (co-first author), Pearlly Yan, Ralf Bundschuh. Methyl-CpG MBD2 interaction requires minimum separation and exhibits minimal sequence specificity. Biophys. J., in preparation. Fields of Study Major Field: Physics Studies in Nucleic Acids: Dr. Ralf Bundschuh vi Table of Contents Page Abstract ........................................... ii Dedication ......................................... iii Acknowledgments ..................................... iv Vita ............................................. vi List of Figures ...................................... x List of Tables ....................................... xvi List of Abbreviations .................................. xvii Chapters 1 An Introduction to Nucleic Acids and Modern Sequencing Techniques . 1 1.1 An overview of nucleic acids ..................... 1 1.1.1 Discovery of information transfer through DNA .......... 1 1.1.2 Structure of DNA and RNA .................. 2 1.1.3 DNA replication, the central dogma, and the protein code . 3 1.2 DNA/RNA Sequencing ........................ 4 1.2.1 Modern sequencing techniques ................. 5 1.2.2 Applications of next-generation sequencing . 10 1.2.3 Challenges of next-generation sequencing . 13 1.3 Scientific contributions to the field ................... 14 1.4 Conclusions ............................. 15 2 MBD-DNA Interactions as Probed through HTS ........... 17 2.1 Introduction ............................ 17 2.1.1 MBD background ....................... 18 2.2 Methods .............................. 21 2.2.1 Pre-Data analysis ....................... 21 2.2.2 Preliminary priming and questions asked . 22 2.2.3 Library analysis workflow overview . 22 2.2.4 Question 1a: Genomic CpG content vs input, examining protocol bias 23 2.2.5 Question 1b: Genomic G/C content vs input, examining protocol bias 25 2.2.6 Analysis overview for remaining questions . 26 2.2.7 Model-Building ........................ 29 2.2.8 Model Predictions ....................... 35 2.3 Results/ Discussion ......................... 39 vii 2.3.1 Sequencing introduces a G/C content bias . 39 2.3.2 MBD binding to methylated CpG shows no significant position dependence ............................ 40 2.3.3 MBD binding to two CpGs simultaneously requires minimum separation, and shows reduced binding at an intermediate level of separation 42 2.3.4 MBD binding to 3 CpGs shows similar pairwise separation dependence as for the 2 CpG case ..................... 43 2.3.5 MBD binding multiple CpGs shows unexpected pulldown behavior . 45 2.3.6 Model fitting to data ..................... 47 2.3.7 Utilizing the Bayesian model . 50 2.4 Conclusion/ Future Work ...................... 50 3 An Overview of HTS Analysis Pipeline/Tools Developed with a Case Study in the Analysis of 5′-end Sequencing Data ............ 52 3.1 Introduction ............................ 52 3.2 Pipeline/tools developed ....................... 52 3.2.1 Read sequencing ....................... 52 3.2.2 Raw to aligned ........................ 53 3.2.3 Computational removal of rRNA reads . 54 3.2.4 BAM file quality controls ................... 55 3.2.5 Genomic coverage summary . 57 3.2.6 Normalizations ........................ 57 3.2.7 Coverage per position visualization techniques . 58 3.2.8 Differential expression analysis . 60 3.2.9 Local expression variability ................... 61 3.2.10 Analysis of local expression variability . 61 3.3 Capping of RNA: a regulator of transcript life cycle . 63 3.3.1 Cell culture preparation .................... 63 3.3.2 5′-end seq workflow overview . 64 3.3.3 Application of Methods to 5′-end Seq . 65 3.3.4 Results/ Discussion ...................... 67 3.4 Conclusions ............................. 73 4 An Investigation of LepA’s Function in E. coli ............. 74 4.1 Introduction: Background on LepA . 74 4.1.1 LepA is highly conserved, and yet not well understood . 74 4.1.2 Synthetic phenotypes exhibited by ∆lepA . 75 4.1.3 Deletion of the active-site histidine or the unique C-terminal domain (CTD) in LepA fails to complement the synthetic phenotypes . 76 4.1.4 Examining LepA’s effect on the transcriptome and translatome . 76 4.2 Methods .............................. 78 4.2.1 LepA investigations ...................... 78 4.3 Results ............................... 84 4.3.1 Without LepA, many mRNA coding regions exhibit reduced average ribosome density (ARD) .................... 84 viii 4.3.2 LepA’s effect on ARD is related to the sequence of the translation initiation region (TIR) ..................... 87 4.3.3 LepA’s effect on ribosome distribution along mRNAs . 89 4.4 Discussion

Nucleic Acid High-Throughput Sequencing Studies Present Unique Challenges in Analysis and Interpretation

Analysis of Gene Expression Data for Gene Ontology

Genome-Wide Analysis Reveals Selection Signatures Involved in Meat Traits and Local Adaptation in Semi-Feral Maremmana Cattle

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

A Yeast Phenomic Model for the Influence of Warburg Metabolism on Genetic Buffering of Doxorubicin Sean M

Novel Signature Genes for Human Left Ventricle Cardiomyopathies Identifed by Weighted Co- Expression Network Analysis (WGCNA)

Common Homozygosity for Predicted Loss-Of-Function Variants Reveals Both Redundant and Advantageous Effects of Dispensable Human Genes

Common Homozygosity for Predicted Loss-Of-Function Variants Reveals Both Redundant and Advantageous Effects of Dispensable Human Genes

Title: a Yeast Phenomic Model for the Influence of Warburg Metabolism on Genetic

A Genetic Locus on Chromosome 2Q24 Predicting Peripheral Neuropathy Risk in Type 2 Diabetes: Results from the ACCORD and BARI 2D Studies

Investigating Developmental and Epileptic Encephalopathy Using Drosophila Melanogaster

Clinical Efficacy and Immune Regulation with Peanut Oral

A High-Density Human Mitochondrial Proximity Interaction Network