Gene Network Inference Via Sequence Alignment And

Gene Network Inference via Sequence Alignment and Rectification by Philippe Christophe Faucon A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved August 2017 by the Graduate Supervisory Committee: Huan Liu, Chair Xiao Wang Sharon Crook Yalin Wang Hessam Sarjoughian ARIZONA STATE UNIVERSITY December 2017 ©2017 Philippe Christophe Faucon All Rights Reserved ABSTRACT While techniques for reading DNA in some capacity has been possible for decades, the ability to accurately edit genomes at scale has remained elusive. Novel techniques have been introduced recently to aid in the writing of DNA sequences. While writing DNA is more accessible, it still remains expensive, justifying the increased interest in in silico predictions of cell behavior. In order to accurately predict the behavior of cells it is necessary to extensively model the cell environment, including gene-to-gene interactions as completely as possible. Significant algorithmic advances have been made for identifying these interactions, but despite these improvements current techniques fail to infer some edges, and fail to capture some complexities in the network. Much of this limitation is due to heavily underdetermined problems, whereby tens of thousands of variables are to be inferred using datasets with the power to resolve only a small fraction of the variables. Additionally, failure to correctly resolve gene isoforms using short reads contributes significantly to noise in gene quantification measures. This dissertation introduces novel mathematical models, machine learning techniques, and biological techniques to solve the problems described above. Mathematical models are proposed for simulation of gene network motifs, and raw read simulation. Machine learning techniques are shown for DNA sequence matching, and DNA sequence correction. Results provide novel insights into the low level functionality of gene networks. Also shown is the ability to use normalization techniques to aggregate data for gene network inference leading to larger data sets while minimizing increases in inter- experimental noise. Results also demonstrate that high error rates experienced by third generation sequencing are significantly different than previous error profiles, and i that these errors can be modeled, simulated, and rectified. Finally, techniques are provided for amending this DNA error that preserve the benefits of third generation sequencing. ii I dedicate this work to my parents Arlene and Philippe, who inspire me to be the best version of myself; to my sister Anni and my brother Pierre, who inspire me to pursue what I love; and to my many excellent teachers and mentors throughout the years, who inspire me to learn and to teach. iii ACKNOWLEDGMENTS I would like to thank my advisors, Huan Liu and Xiao Wang for their support and guidance throughout my PhD. Without their mentoring, and at times corralling, I would almost certainly still be working on a million small projects without any significant progress. I especially appreciate their willingness to let me fully explore my research interests, and their forthrightness when I set my mind to a task that they believed would not be fruitful. I would also like to thank the rest of my committee; Sharon Crook who introduced me to bioinformatics during my undergraduate years, and continued to guide me during my graduate research, along with Yalin Wang and Hessam Sarjoughian, who provided valuable suggestions on my research directions. I am remarkably fortunate to have worked with so many amazing colleagues in both the Xiao Lab and Data Mining and Machine Learning(DMML) lab at ASU. Over my graduate studies I’ve had the pleasure of working with many exceptional researchers: Parithi Balachandran, Ghazaleh Beigi Xingwen Chen, Ryan Dougherty, Hao Hu, Isaac Jones, Jundong Li, David Menn, Fred Morstatter, Tahora Nazer, Suhas Ranganath, Justin Sampson, Kai Shu, Kylie Standage-Beier, Ri-Qi Su, Robert Trevino, Lezhi Wang, Suhang Wang, Fuqing Wu, Liang Wu, Tsung-Yen (John) Yu, and Qi Zhang. Above all I would like to thank my sidekick Parithi Balachandran, an excellent research colleague and friend, and without whom many research projects would not have been possible. I am also incredibly grateful to Robert Trevino for his assistance with for his research insights, and for introducing me to conference paper writing. Having a life outside of research, and the support of friends is critical to enjoying and succeeding in grad school. On that note I would especially like to thank Parithi Balachandran, Michael Berg, Mario Giacomazzo, Hao Hu, David Menn, Fred Morstat- ter, James Palazzolo, Jose Perez, Kim Phan, Brianna Rudow, Jeff Semmens, and iv Robert Trevino and the ASU fencing team. Without their support and friendship I would likely have left the university long before completing my degree. I would also like to acknowledge my family, who supported me through the most difficult parts of my studies and provided a constant stream of encouragement. My grandparents Berit and Anker, my parents Arlene and Philippe, and my siblings Anni and Pierre, have all played a monumental role in shaping me into the person I am today. I could not be more grateful for the constant support they have given me, and I hope that I have, do, and will continue to make them proud. Learning to learn is one of the most difficult tasks that we as people, and especially as researchers have to undertake. We have to find techniques for locating relevant and trustworthy sources, sorting through their available knowledge, and distilling it down to what we can remember and make use of. Through my studies I was very fortunate to meet mentors that inspired me to learn, and showed me the joy in learning. Likewise while teaching I had outstanding students that showed me joy in teaching, and inspiring others to learn and achieve great things. In specific I would like to thank Karen Glazier, the first (and only) teacher to fail me, for reminding me that regardless of how much you think you know you can still fail if you don’t apply yourself. v TABLE OF CONTENTS Page LIST OF TABLES . ix LIST OF FIGURES . x CHAPTER 1 GENE NETWORK MOTIFS . 1 1.1 Results . 4 1.1.1 Network Enumeration and Parameter Scanning for Multi- stability . 4 1.1.2 Complete Auto-Activation Contributes to Multistability . 7 1.1.3 Sensitivity and bifurcation of multistability . 8 1.1.4 Stochastic State Transitioning . 12 1.1.5 SSS Analysis . 15 1.1.6 A Regulatory Network for Pluripotency . 17 1.1.7 In Silico Induced Pluripotency . 23 1.2 Discussion . 27 1.3 Materials and methods . 31 1.3.1 Enumerating and Eliminating Redundant Networks . 31 1.3.2 ODE Modeling and High-Throughput Screening . 31 1.3.3 Analysis of the Parameter Space . 33 1.3.4 Bifurcation Analysis . 34 1.3.5 Stochastic Simulations . 34 1.3.6 Statistical Tests . 35 2 GENE NETWORK MODELING . 36 2.1 Network Models . 37 vi CHAPTER Page 2.1.1 Boolean Networks . 38 2.1.2 Bayesian and Neural Networks . 40 2.1.3 Differential Equation Networks . 42 2.2 RNA Sequencing . 44 2.3 Data Aggregation . 44 2.3.1 Acquiring the Data . 47 2.3.2 Noise Minimization . 49 2.3.3 Future Work . 51 3 NANOPORE READ SIMULATION . 52 3.1 Introduction . 52 3.2 Related Work . 55 3.3 Problem Description . 57 3.3.1 Error Source Identification . 59 3.3.2 Error Features . 59 3.3.3 Read Simulation. 60 3.4 Results . 63 3.4.1 Modeling K-mer Bias . 63 3.4.2 Identifying Bias Sources. 65 3.4.3 Simulation Results. 65 3.4.4 Simulator Complexity. 67 3.5 Conclusion . 69 4 READ CORRECTION . 71 4.1 Introduction . 71 4.2 Related Works . 73 vii CHAPTER Page 4.3 Problem Description . 75 4.4 Proposed Method . 76 4.4.1 High-Accuracy K-mers . 76 4.4.2 Utilization of K-means . 79 4.4.3 Synthetic Data Generation . 79 4.5 Results . 80 4.5.1 Clustering on Synthetic Data . 80 4.6 Conclusion . 81 5 CONCLUSIONS . 83 5.1 Methodological Contributions . 83 5.2 Future Work . 84 NOTES .............................................................. 85 REFERENCES . 86 viii LIST OF TABLES Table Page 1. Overview of RNA-Seq Experiments Analyzed with a Listing of Tool and Asset Versions. 45 2. Identified Error Sources and Their Contribution to K-Mer Overall Accuracy with Respect to the Bias Source . ..

Gene Network Inference Via Sequence Alignment And

Introduction to Nutrigenetics, Nutrigenomics and Epigenetics

6615 Sofiva Genomics

A Murine Atlas of Age-Related Changes in Intercellular Communication Inferred with the Package Scdiffcom

Circulating Exosomes Potentiate Tumor Malignant Properties in a Mouse Model of Chronic Sleep Fragmentation

Lifelong Restriction of Dietary Branched-Chain Amino Acids Has Sex-Specific Benefits for Frailty and Life Span in Mice

Dissecting the Genetic Etiology of Lupus at ETS1 Locus

Critical Physiological and Pathological Functions of Forkhead Box O Tumor Suppressors

Annual Scientific Meeting 23 – 27 March 2018, Adelaide Convention and Exhibition Centre

UCLA Electronic Theses and Dissertations

Loss of Tip30 Accelerates Pancreatic Cancer Progression

Linking Human-Genetic and Host-Microbiome Associations to Molecular Mechanisms Using Protein-Protein Interaction Networks

Multivariate Analysis of Prokaryotic Amino Acid Usage Bias: a Computational Method for Understanding Protein Building Block Selection in Primitive Organisms