Gene Network Inference via

Sequence Alignment and Rectification

by

Philippe Christophe Faucon

A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

Approved August 2017 by the Graduate Supervisory Committee:

Huan Liu, Chair Xiao Wang Sharon Crook Yalin Wang Hessam Sarjoughian

ARIZONA STATE UNIVERSITY

December 2017 ©2017 Philippe Christophe Faucon All Rights Reserved ABSTRACT While techniques for reading DNA in some capacity has been possible for decades, the ability to accurately edit genomes at scale has remained elusive. Novel techniques have been introduced recently to aid in the writing of DNA sequences. While writing DNA is more accessible, it still remains expensive, justifying the increased interest in in silico predictions of cell behavior. In order to accurately predict the behavior of cells it is necessary to extensively model the cell environment, including gene-to-gene interactions as completely as possible. Significant algorithmic advances have been made for identifying these interactions, but despite these improvements current techniques fail to infer some edges, and fail to capture some complexities in the network. Much of this limitation is due to heavily underdetermined problems, whereby tens of thousands of variables are to be inferred using datasets with the power to resolve only a small fraction of the variables. Additionally, failure to correctly resolve gene isoforms using short reads contributes significantly to noise in gene quantification measures. This dissertation introduces novel mathematical models, machine learning tech- niques, and biological techniques to solve the problems described above. Mathematical models are proposed for simulation of gene network motifs, and raw read simula- tion. Machine learning techniques are shown for DNA sequence matching, and DNA sequence correction. Results provide novel insights into the low level functionality of gene networks. Also shown is the ability to use normalization techniques to aggregate data for gene network inference leading to larger data sets while minimizing increases in inter- experimental noise. Results also demonstrate that high error rates experienced by third generation sequencing are significantly different than previous error profiles, and

i that these errors can be modeled, simulated, and rectified. Finally, techniques are provided for amending this DNA error that preserve the benefits of third generation sequencing.

ii I dedicate this work to my parents Arlene and Philippe, who inspire me to be the best version of myself; to my sister Anni and my brother Pierre, who inspire me to pursue what I love; and to my many excellent teachers and mentors throughout the years, who inspire me to learn and to teach.

iii ACKNOWLEDGMENTS I would like to thank my advisors, Huan Liu and Xiao Wang for their support and guidance throughout my PhD. Without their mentoring, and at times corralling, I would almost certainly still be working on a million small projects without any significant progress. I especially appreciate their willingness to let me fully explore my research interests, and their forthrightness when I set my mind to a task that they believed would not be fruitful. I would also like to thank the rest of my committee; Sharon Crook who introduced me to bioinformatics during my undergraduate years, and continued to guide me during my graduate research, along with Yalin Wang and Hessam Sarjoughian, who provided valuable suggestions on my research directions. I am remarkably fortunate to have worked with so many amazing colleagues in both the Xiao Lab and Data Mining and Machine Learning(DMML) lab at ASU. Over my graduate studies I’ve had the pleasure of working with many exceptional researchers: Parithi Balachandran, Ghazaleh Beigi Xingwen Chen, Ryan Dougherty, Hao Hu, Isaac Jones, Jundong Li, David Menn, Fred Morstatter, Tahora Nazer, Suhas Ranganath, Justin Sampson, Kai Shu, Kylie Standage-Beier, Ri-Qi Su, Robert Trevino, Lezhi Wang, Suhang Wang, Fuqing Wu, Liang Wu, Tsung-Yen (John) Yu, and Qi Zhang. Above all I would like to thank my sidekick Parithi Balachandran, an excellent research colleague and friend, and without whom many research projects would not have been possible. I am also incredibly grateful to Robert Trevino for his assistance with for his research insights, and for introducing me to conference paper writing. Having a life outside of research, and the support of friends is critical to enjoying and succeeding in grad school. On that note I would especially like to thank Parithi Balachandran, Michael Berg, Mario Giacomazzo, Hao Hu, David Menn, Fred Morstat- ter, James Palazzolo, Jose Perez, Kim Phan, Brianna Rudow, Jeff Semmens, and

iv Robert Trevino and the ASU fencing team. Without their support and friendship I would likely have left the university long before completing my degree. I would also like to acknowledge my family, who supported me through the most difficult parts of my studies and provided a constant stream of encouragement. My grandparents Berit and Anker, my parents Arlene and Philippe, and my siblings Anni and Pierre, have all played a monumental role in shaping me into the person I am today. I could not be more grateful for the constant support they have given me, and I hope that I have, do, and will continue to make them proud. Learning to learn is one of the most difficult tasks that we as people, and especially as researchers have to undertake. We have to find techniques for locating relevant and trustworthy sources, sorting through their available knowledge, and distilling it down to what we can remember and make use of. Through my studies I was very fortunate to meet mentors that inspired me to learn, and showed me the joy in learning. Likewise while teaching I had outstanding students that showed me joy in teaching, and inspiring others to learn and achieve great things. In specific I would like to thank Karen Glazier, the first (and only) teacher to fail me, for reminding me that regardless of how much you think you know you can still fail if you don’t apply yourself.

v TABLE OF CONTENTS Page LIST OF TABLES ...... ix LIST OF FIGURES ...... x CHAPTER 1 GENE NETWORK MOTIFS ...... 1 1.1 Results ...... 4 1.1.1 Network Enumeration and Parameter Scanning for Multi- stability ...... 4 1.1.2 Complete Auto-Activation Contributes to Multistability . . . . 7 1.1.3 Sensitivity and bifurcation of multistability ...... 8 1.1.4 Stochastic State Transitioning ...... 12 1.1.5 SSS Analysis ...... 15 1.1.6 A Regulatory Network for Pluripotency ...... 17 1.1.7 In Silico Induced Pluripotency ...... 23 1.2 Discussion ...... 27 1.3 Materials and methods ...... 31 1.3.1 Enumerating and Eliminating Redundant Networks ...... 31 1.3.2 ODE Modeling and High-Throughput Screening ...... 31 1.3.3 Analysis of the Parameter Space ...... 33 1.3.4 Bifurcation Analysis ...... 34 1.3.5 Stochastic Simulations ...... 34 1.3.6 Statistical Tests ...... 35 2 GENE NETWORK MODELING ...... 36 2.1 Network Models ...... 37

vi CHAPTER Page 2.1.1 Boolean Networks ...... 38 2.1.2 Bayesian and Neural Networks ...... 40 2.1.3 Differential Equation Networks ...... 42 2.2 RNA Sequencing ...... 44 2.3 Data Aggregation ...... 44 2.3.1 Acquiring the Data ...... 47 2.3.2 Noise Minimization ...... 49 2.3.3 Future Work ...... 51 3 NANOPORE READ SIMULATION ...... 52 3.1 Introduction ...... 52 3.2 Related Work ...... 55 3.3 Problem Description ...... 57 3.3.1 Error Source Identification ...... 59 3.3.2 Error Features ...... 59 3.3.3 Read Simulation...... 60 3.4 Results ...... 63 3.4.1 Modeling K-mer Bias ...... 63 3.4.2 Identifying Bias Sources...... 65 3.4.3 Simulation Results...... 65 3.4.4 Simulator Complexity...... 67 3.5 Conclusion ...... 69 4 READ CORRECTION ...... 71 4.1 Introduction ...... 71 4.2 Related Works ...... 73

vii CHAPTER Page 4.3 Problem Description ...... 75 4.4 Proposed Method ...... 76 4.4.1 High-Accuracy K-mers ...... 76 4.4.2 Utilization of K-means ...... 79 4.4.3 Synthetic Data Generation ...... 79 4.5 Results ...... 80 4.5.1 Clustering on Synthetic Data ...... 80 4.6 Conclusion ...... 81 5 CONCLUSIONS ...... 83 5.1 Methodological Contributions ...... 83 5.2 Future Work ...... 84 NOTES ...... 85 REFERENCES ...... 86

viii LIST OF TABLES Table Page 1. Overview of RNA-Seq Experiments Analyzed with a Listing of Tool and Asset Versions...... 45 2. Identified Error Sources and Their Contribution to K-Mer Overall Accuracy with Respect to the Bias Source ...... 64 3. Sum of Squares Error (SSE) between K-Mer Simulators and Their Respective Training Sets, and between the Fragmented R9 Data Set and the Complete R9 Data Set ...... 67 4. memory and Wall-Clock Time Consumption of SNaReSim under Various Conditions ...... 68 5. Comparison of MHAP and K-Means Clusters over the Engineered Feature Space, the Number of K-Means Clusters Was Set to the Number of Compo- nents Identified by MHAP ...... 80

ix LIST OF FIGURES Figure Page 1. Multistability Arises from Small Gene Networks and Underlies Cell Differ- entiation. Fully-Connected Triads (FCTs) Are Important, Recurring Tran- scriptional Networks in the Development and Maintenance of Cellular States. Notably, the Oct4-Sox2-Nanog Triad Has Been Implicated in Inducing and Maintaining Pluripotency. In the Waddington Model for Cell Differentiation, the Cell’s Underlying Developmental Landscape Is Governed by the Dynamical Potential of These Small Gene Networks to Realize Multi- stability. Transition from One State to Another Can Be Guided via Altered Topology or Strength of Wiring...... 1 2. High-Throughput Computational Screening for Multistable Networks. (A) FCTs Were Enumerated and Modeled by ODEs (Parameters Denoted as P). (B) The Multistability Probability of 104 FCTs Generated by the Computa- tional Screen Is Plotted, with a High Multistability Probability FCTs Marked with an X with the Color Representing the Multistability Group; the Two Highest Probability Groups Are Labeled by Their Indices. (C) The Fifteen FCTs with the Highest Multistable Potentials Are Represented Visually. . . . . 6

x Figure Page 3. Bifurcation and Stochastic Analysis of Network 84. (A) The Topology of Network 84 Was Predicted to Have the Highest Probability for Multistability. (B) The Bifurcation Diagram of Network 84 Plots Transcription Factor Concentrations of X and Y at Each SSS against α, the Self-Activation Strength. The Mutual Inhibition Parameters Are Also Set Equal and Referred to as β, Here Set to 0.1 and α Is Used as the Bifurcation Parameter Ranging between .01 and 100. SSS Are Color Coded and Listed in the Legend to Distinguish Different SSS, Where ‘+’ Indicates the Gene Is ‘ON’ and ‘- ’ Means ‘OFF’. (C) Simulations of Noise-Induced State Transitioning in Network 84 under Different Levels of Noise. Simulations Were Performed with Auto-Regulation Equal to 1 and Mutual Inhibition Equal to 0.1, with Noise Levels of 0.5, 0.85, and 1 from left to right. The Locations of the Deterministically Calculated States Are Indicated with Red Spheres, with Their Size Correlated to Their Spectral Radius. The Blue Ribbons Indicate the Temporal Trajectories of the Simulations. The Black Arrows Indicate Initial Direction of State Transitions. (D) The Number of States Traveled in the Stochastic Simulations Plotted versus Noise Strength. The Red Line Represents Simulations that Were Initiated from the All-ON State, and the Blue Line Represents Simulations that Were Initiated from the Origin...... 9

xi Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting the Number of Stable Steady States Generated by Combinations of Differ- ent Auto-Regulatory Strengths (Rows) and Mutual-Regulatory Strengths (Columns). The Green and Orange Regions Are Visualized as FCT Diagrams Independently in B. (B) The Green-Boxed and Orange-Boxed Diagrams Corresponds to the Green-Boxed and Orange-Boxed Parameter Combina- tion Regime from (A). (C) Many of the Parameter Combinations Yielding Multi-Stable Systems Are Not Represented by the Matrix, and Are instead Abstracted Here. As an Example, Here We Present a Parameter Combination Regime that Can Support 6 SSS and Auto-Activation Strengths Are Similar to Those in the Orange Box (Thick Red Arrows) but Have One Unrestrained Pair of Mutual Inhibition that Can Exist at a Very High Strength (Blue T-Head Repression Lines)...... 16

xii Figure Page 5. Bifurcation and Stochastic Analysis of Network 1. (A) Network 1 Shares the Topological Architecture of the Oct4-Sox2-Nanog Triad, and Is Believed to Underlie Stem Cell Differentiation and Reprogramming. (B) The Bifurcation Diagram of Network 84 Plots Transcription Factor Concentrations of X and Y at Each SSS against α, the Self-Activation Strength. The Mutual Inhibition Parameters Are Also Set Equal and Referred to as β, Here Set to 0.1 and α Is Used as the Bifurcation Parameter Ranging between .01 and 100. SSS Are Color Coded and Listed in the Legend to Distinguish Different SSS, Where ‘+’ Indicates the Gene Is ‘ON’ and ‘-’ Means ‘OFF’. (C) Simulations of Noise- Induced State Transitioning in Network 1 under Different Levels of Noise. Simulations Were Performed with Auto-Regulation Equal to 1 and Mutual Inhibition Equal to 0.1, with Noise Levels of 0.5, 0.85, and 1 from left to right. The Locations of the Deterministically Calculated States Are Indicated with Red Spheres, with Their Size Correlated to Their Spectral Radius. The Blue Ribbons Indicate the Temporal Trajectories of the Simulations. The Black Arrows Indicate Initial Direction of State Transitions. (D) The Number of States Traveled in the Stochastic Simulations Plotted versus Noise Strength. The Red Line Represents Simulations that Were Initiated from the All-ON State, and the Blue Line Represents Simulations that Were Initiated from the Origin...... 18

xiii Figure Page 6. Stable Steady State Analysis of Network 1. (A) A Matrix Presenting the Num- ber of Stable Steady States Generated by Combinations of Different Auto- Regulatory Strengths (Rows) and Mutual-Regulatory Strengths (Columns). The Green and Orange Regions Are Visualized as FCT Diagrams Indepen- dently in B. (B) The Green-Boxed and Orange-Boxed Diagrams Corresponds to the Green-Boxed and Orange-Boxed Parameter Combination Regime from (A). (C) Many of the Parameter Combinations Yielding Multi-Stable Systems Are Not Represented by the Matrix, and Are instead Abstracted Here. As an Example, Here We Present a Parameter Combination Regime that Can Support 6 SSS and Auto-Activation Strengths Are Similar to Those in the Orange Box (Thick Red Arrows) but Have One Unrestrained Pair of Mutual Inhibition that Can Exist at a Very High Strength (Blue T-Head Repression Lines)...... 19

xiv Figure Page 7. Bifurcation and Stochastic Analysis of Network 1 with Constitutive Basal Expression. (A) The Topology of Network 1 Is Used to Mimic IPSC Exper- iments with Added Constitutive Expression of All Three Genes. (B) The X and Y Protein Concentrations of Stable and Unstable Steady States Are Plotted against Self-Activation Strengths. With the Addition of the Consti- tutive Expression We Can See a Further Decrease in Stabilities, Specifically Two-ON States Are Never Stable. Additionally One-ON States Lose Stability Very Rapidly, Approximately at α=1, and the All-OFF State Destabilizes after α=9 Leaving Only the All-ON State. This Can Also Be Seen in the Sizes of the Spectral Radii, Deterministically These Steady States Exist but in Practice They Have Very Little Influence. (C) Simulations of Network 1 Modeled in the Presence of an Additional Constitutive Overexpression Term Consistently Show a Transition from an All-OFF State to an All-ON State after a Period of Latency...... 24 8. Stochastic Simulations Capture Latency Distributions in Induced Pluripo- tency. (A) Distribution of Simulated Latency for the Modified Network 1 Model Illustrates a Skewed Bell Shape. The Histogram Was Generated from 2000 Simulations. (B) Similar Results Generated from 2000 Simulations with Network84...... 26

xv Figure Page 9. Two Fully Connected(A/B), and Two Partially Connected(C/D) Triads Are Depicted. (A/C) Illustrate Only Positive Network Regulation (Activation), Indicated by Red Arrows, I.e. Increasing Quantities of the Transcription Factor at the Tail of the Arrow Also Increases Quantities of the Transcription Factor at the Head. (B/D) Utilize Both Positive and Negative Network Regulation (Repression), Indicated by the Blue Bars, I.e. Increasing the Quantities of the Transcription Factor at the Tail of the Error Decreases Quantities of the Transcription Factor at the Head...... 38 10.In This Figure We Present the Data Processing Pipeline from 5 Other ScRNA- Seq Experiments Available on NCBI GEO. GSE59127 and GSE59129 (left, Blue), GSE 52583 (left Center, Red), GSE21180(Center, Olive), GSE45284 (right Center, Orange) and Our Pipeline (right)...... 49 11.Scatterplots Showing Ratios of Coefficient of Variation(CV), Where Points Represent CV of a Single Gene within the Original Author’s Pipeline, and Ours. The X=y Line Drawn, Representing the Point Where CV Is Equal between Both Pipelines. Points on the X>y Side Are Drawn in Red, and Point on the Y>x Side Are Drawn in Black. For the Experiments in (A- C), Raw ScRNA-Seq Reads Are Provided by the NCBI GEO Service and Analyzed Using Our Customized Pipeline. In Addition to Raw Reads Each Author Provides Estimates of Gene Activity Levels through FPKM or TPM. In (A-C) We Plot the Coefficient of Variation (CV) of Our Pipeline Results against Those of the Original Authors. In (D) We Group the Experiments from (A-C) to Visualize Inter-Experimental CV...... 50

xvi Figure Page 12.The Accuracies of All 6-Mers with Accuracy on the Vertical Axis, and a Kmer Index on the Horizontal Axis. (A) All Experiments Sorted with Respect to Their Accuracy. (B) Experiments Sorted with Respect to the Escherichia Coli R9 2D Experiment. (C) The R9 E. Coli Template Experiment Divided Randomly into 5 Groups, Sorted with Respect to the First Group. (D) Only the R9 E. Coli Template Experiment, Aligned with Both Graphmap and BWA-MEM, Sorted with Respect to BWA-MEM...... 58 13.A Plot of the Non-Negative Least Squares (Nnls) Fit of the Predicted Features Compared to the Actual Distribution from the R9 E. Coli 2D Experiment, Sorted by the True Experiment. (A) All Model Feautures and the True Accuracy of Neighboring K-Mers as a Feature. (B) Only the Features that Are Fully Predictable from the Model, as Described in Bias Souces Table. . . . 64 14.Comparison of 3 Previously Published Sequencing Simulators with the New Model We Proposed. Fit of R9 Pore Model K-Mer Accuracies Sorted Inter- nally (A), and by 2D E. Coli, the Training Model (B). Fit of R7 Pore Model K-Mer Accuracies Sorted Internally (C), and by 2D E. Coli, the Training Model (D)...... 66

xvii Figure Page 15.(A) Strands of DNA Bases Are Added to the Nanopore Sequencer as an Analog Data Input (B) As Strands Pass through the Nanopores Electrical Current Readings Are Taken. These Readings Are Consequently Grouped into ‘‘Events”, with Each Event Approximately Representing a K-Mer (Sequence of Bases). These Events May Be Improperly Split, Events May Have a Mean Value Higher or Lower than Expected for a DNA K-Mer, and Event Lengths May Be Different. (C) A Base Caller Uses the Sequence of Events to Predict the Original Sequence of Bases...... 73 16.(A) DNA Sequences Can Be Interpreted as a Sequence of K-Mers, Each K- Mer Can Be Replaced by Its Expected Mean Current to Provide a Mapping from DNA Sequences into a High-Dimensional Current Space. (B) Clusters of 2-Dimensional Points (Sequence Length 7), Clusters Are Found Using DBSCAN with Neighborhood Size 1 and Epsilon 2...... 77

xviii Chapter 1

GENE NETWORK MOTIFS

Figure 1. Multistability arises from small gene networks and underlies cell differentiation. Fully-connected triads (FCTs) are important, recurring transcriptional networks in the development and maintenance of cellular states. Notably, the Oct4-Sox2-Nanog triad has been implicated in inducing and maintaining stem cell pluripotency. In the Waddington model for cell differentiation, the cell’s underlying developmental landscape is governed by the dynamical potential of these small gene networks to realize multistability. Transition from one state to another can be guided via altered topology or strength of wiring.

Embryonic stem cells have the capacity to differentiate into more than 200 isogenic progenitor and terminal cell types [1], each of which represents a stable cellular state. The potential for a system to realize multiple states is termed multistability, which has been proposed as the mechanism underlying cell differentiation for over half a century, with stable states represented as either valleys in the developmental landscape

1 [2], [3] or as dynamic attractors in high-dimensional gene expression space [4]–[6]. Bistability is the simplest form of multistability, which has been studied extensively both in natural contexts [7] and in synthetic systems [8]–[13]. It has been shown to be responsible for both normal cell fate determination in Xenopus oocyte maturation [14] and oncogenesis [15]. Additionally, recent work in hematopoietic stem cells (HSC) implies that more complicated cases of multistability help in canalizing HSC lineage determination [16]. The discovery of induced pluripotent stem cells (iPSC) [17]–[20] further demonstrates that cellular state transitioning within a multistable system is a reversible process that can potentially be engineered for therapeutic purposes. Despite evidence for the existence of multiple substates of pluripotent and hematopoietic stem cells [21]–[25], the molecular mechanisms responsible for gen- erating complex multistability are poorly understood. Transitions between stable states may underlie widely observed stem cell population heterogeneity [23], [25]–[34] , random latency in induced pluripotency [34], [35], and even the progression of diseases such as cancer [36], [37]. According to this view, regulatory networks capable of adopting multiple stable states may enable cells to explore adjacent fate possibilities through state transitions, thereby enabling differentiation in the case of stem cells, or generating phenotypic diversity for selection to act upon in the case of tumor evolu- tion. Recent studies have implicated the fully-connected triad (FCT) [38]–[40] as a recurring network motif among the transcriptional regulatory circuits that control the development and maintenance of cellular states. Most notably, the Oct4-Sox2-Nanog triad is believed to be the key functional module in the maintenance of stem cell pluripotency [41]–[43]. The success of induced pluripotency through the overexpression of transcription factors alone further indicates the dominant role of transcription factor networks in maintaining pluripotency and directing cellular reprogramming. All this

2 suggests a strong relationship between biological multistability and the conserved FCT network of transcription factors(Figure 1), and motivates a need for understanding the relationship between FCT regulatory relationships and multistability. Here we developed a high-throughput computational approach to search the network dynamics of all fully-connected transcription triads for their potential to realize multistability. Our computational searches identified complete auto-activation, (i.e.: all three nodes activate themselves), as a topological feature of triads capable of multiple stable steady states (SSS). Detailed analyses were performed on network topologies selected for a high likelihood of multistability. These analyses show that the relative stability of SSS can be tuned by adjusting the network regulatory strengths between nodes. After incorporation of gene expression noise in stochastic simulations, the system was able to spontaneously transit from metastable (SSS with relatively lower stabilities) to differentiated states (SSS with at least one gene OFF and high stabilities), behaving like a three-dimensional Markovian jump system with bias. This is also consistent with experimentally demonstrated stochastic cancer cell state transitions [44], which result in a phenotypic equilibrium in populations of cancer cells. Different levels of noise, designed to mimic degrees of “noisy” transcriptional activity in cellular systems, were found to either promote or disrupt state transitions, with some transitions requiring layovers at one or more intermediate states. Such connections between gene expression noise and cell state transitions have also been demonstrated in many biological processes [45]. Finally, in our simplified theoretical simulation of reprogramming to generate induced pluripotent stem cells, the distribution of random temporal latency could be described by an Inverse Gaussian distribution, a finding consistent with experimental observations [34]. Taken together, our work illustrates

3 that computational screening, analysis, and simulation of network dynamics can serve as valuable tools for connecting conserved regulatory modules with complex biological processes, such as development, disease and stem cell reprogramming.

1.1 Results

1.1.1 Network Enumeration and Parameter Scanning for Multistability

To define the set of FCT networks with a high likelihood for multistability, we first enumerated all possible fully-connected three-node topologies (2A). Here, nodes represent transcription factors and edges represent possible transcriptional regulations. After taking into account topological equivalencies, we were left with 104 unique FCTs as candidate topologies, which fully represent all 29=512 possible FCT topologies. Each FCT candidate was generally modeled by a set of three-variable ordinary differential equations (ODEs), with each variable representing the abundance of a single transcription factor and each of the nine tunable parameters representing a single regulation strength (Figure 2A). The activation and repression of each gene by other factors are modeled additively to account for reported mechanistic independence of transcription regulations [16], [40], [46] (MATERIALS AND METHODS). This simplification can be expanded to include more mechanistic details of specific networks without altering the general conclusion of this screening. A state of the network is then defined as an equilibrium or SSS if all transcription factor abundances do not change over time. An algorithm was developed to automate the search process to identify all possible states. Specifically, to measure the probability of multistability, we scanned N = 421,875 parameter sets for each of the 104 networks to identify all possible states. The values

4 of these parameters are generic but cover a wide range to serve as a proof of principle for diverse biological scenarios. The probability of multistability was then defined as the number of parameter sets giving rise to at least 5 stable steady states divided by N. This probability reflects the robustness of certain FCT’s multistability, and hence quantifies its possible natural presence frequency. The entire FCT multistability screen is an ensemble of many independent computations and, as a result, was run in parallel and performed in high-throughput computer clusters (see MATERIALS AND METHODS). Such large-scale screening is computationally intensive and is only possible with simplified model equations. Therefore the chosen forms of our ODE models omit many biochemical details such as and transcriptional factor co-binding. The models serve as a general abstraction and system specific details could be incorporated later in light of specific experimental evidence.

5 Figure 2. High-throughput computational screening for multistable networks. (A) FCTs were enumerated and modeled by ODEs (parameters denoted as P). (B) The multistability probability of 104 FCTs generated by the computational screen is plotted, with a high multistability probability FCTs marked with an X with the color representing the multistability group; the two highest probability groups are labeled by their indices. (C) The Fifteen FCTs with the highest multistable potentials are represented visually.

6 1.1.2 Complete Auto-Activation Contributes to Multistability

Using our network screening approach, we quantified and ranked the potential for multistability of all 104 possible FCTs (Figure 2A). Of the 104, 15 showed significant probability for multistability (marked in Figure 2B and topologies drawn in Figure 2C). The multistable FCTs were empirically categorized into three groups according to their probability of multistability (color shaded in Figure 2B,C). The first group, which had the highest probability of realizing multistability and with the potential for eight states, consisted of a single network (Network 84) with all positive auto-regulation and all mutual inhibition. The second group of multistable FCTs, consisting of Networks 8, 38, 60, and 62, was also exclusively positive auto-regulating but displayed a mixture of positive and negative mutual regulation. Specifically, these networks were either enriched for mutual inhibition (Networks 60 and 62) or displayed symmetrical topologies (Network 8 and 38). The third group, which was comprised of 10 FCTs that exhibited lower but still significant potential for multistability, once again featured similar positive auto-regulation as a common attribute. The networks of the third group had either asymmetric topologies (Networks 3, 10, 12, 27, 36 and 56) or competing mutual regulation of target genes (Network 5, 25, and 30), except for Network 1, which was entirely void of inhibitory regulation. These results suggest that an appropriate combination of mutual inhibitory and activating regulation helps to increase the potential for a network to realize multistability, a finding that is consistent with previous theoretical studies [47]. From our screen, we were able to identify a set of important, minimum topological features capable of multistability. In particular, of the possible 16 FCTs with com- plete positive auto-regulation, our screen selected 15 as having a high likelihood for

7 multistability. Network 54 was the only FCT displaying complete auto-activation with insignificant potential for multistability, likely because of competing mutual activation and repression of all the nodes (bottom of Figure 2C). Indeed, positive auto-regulation has been shown to be an important feature of bistable systems [7], [9], [10], [14], [48], [49], and recent experimental work also suggests it could be necessary for maintaining multipotency in stem cells [50]. The role of positive feedback in multistability has been well studied [9], [14], [51], and the interconnected autoregulatory loop formed by Oct4, Sox2, and Nanog is thought to underlie the bistable switch between ESC self-renewal and differentiation, and the ability of core ESC regulators to ‘boot up’ the pluripotency transcriptional program by forced expression during reprogramming [52]. Our computational results similarly suggest that, within the limits of the study, complete positive auto-regulation is an important contributor to the generation of high-dimensional multistability across diverse FCT topologies. It is striking to note that one or two instances of auto- activation, no matter how strong, are not sufficient to generate multistability in our simulations. Therefore, it is possible that complete auto-activation of FCTs could be used as an important hallmark in the identification of new multistable gene networks.

1.1.3 Sensitivity and bifurcation of multistability

8 Figure 3. Bifurcation and Stochastic Analysis of Network 84. (A) The topology of Network 84 was predicted to have the highest probability for multistability. (B) The bifurcation diagram of Network 84 plots transcription factor concentrations of X and Y at each SSS against α, the self-activation strength. The mutual inhibition parameters are also set equal and referred to as β, here set to 0.1 and α is used as the bifurcation parameter ranging between .01 and 100. SSS are color coded and listed in the legend to distinguish different SSS, where ‘+’ indicates the gene is ‘ON’ and ‘-’ means ‘OFF’. (C) Simulations of noise-induced state transitioning in Network 84 under different levels of noise. Simulations were performed with auto-regulation equal to 1 and mutual inhibition equal to 0.1, with noise levels of 0.5, 0.85, and 1 from left to right. The locations of the deterministically calculated states are indicated with red spheres, with their size correlated to their spectral radius. The blue ribbons indicate the temporal trajectories of the simulations. The black arrows indicate initial direction of state transitions. (D) The number of states traveled in the stochastic simulations plotted versus noise strength. The red line represents simulations that were initiated from the all-ON state, and the blue line represents simulations that were initiated from the origin.

9 Network 84 (Figure 3A, all positive auto-regulation and all mutual inhibition) yielded the highest potential for multistability in our screen (Figure 2C). To gain a deeper understanding of this FCT, we performed a bifurcation analysis [53], [54] to investigate how the change of one regulation strength affects the number, stability, and location of all states, while other parameters are kept constant (Figure 3B). Bifurcation analysis is a useful analytical tool that can reveal when and where one state can bifurcate into two or more states in response to a changing parameter. This is analogous to a progenitor cell transitioning from a self-renewing phase to a multi-lineage differentiating phase in response to a stimulus [16]. In the bifurcation diagram of Network 84 (Figure 3B), locations of both SSS (thick colored lines) and unstable steady states (thin grayed lines) are plotted. X and Y indicate the abundance of gene product of X and Y. The axis of α represents the strength of all self-activation, while other parameters are kept constant. Gene product Z is omitted because only three quantities can be visualized in the plot. Nevertheless, its abundance can be inferred from that of X and Y. Thus the figure provides the position of SSS as the concentration of the three transcription factors, and their self-activation strength, varies. As can be seen in Figure 3B, Network 84 has complex dynamics. When auto- activation is very weak (α less than 1), the system only has one SSS (the black line), which corresponds to the state in which the abundances of all three gene products are very low and could be roughly categorized as all genes OFF. This finding intuitively makes sense considering the mutual repression of Network 84’s topology, which under weak auto-activation does not support individual genes overcoming degradation to support the ON state. However, as the strengths of auto-activation increase beyond a threshold, the system suddenly demonstrates many more SSS (colored solid lines).

10 All of these SSS branch out from the all-OFF state, which is the typical nature of bifurcation. When α = 1, the system can assume its maximum of eight states (illustrated in Figure 3B as red circles). These states include the following possible binary combinations: all three genes ON (orange); all three genes OFF (black); two genes ON and one gene OFF (blue, cyan, and purple); and, one gene ON and two genes OFF (green, brown, and yellow). The locations of these eight SSS are also plotted in Figure 3C in XYZ coordinate. The sizes of the circles correspond to each state’s stability (details below). Network 84 is able to maintain multiple SSS for a wide parameter range of auto-activation. The all-OFF state (Figure 3B black line) disappears when auto-activation strengths become greater than 1.05 (Figure 3B gray circles illustrate locations of SSS when α =5). As the auto-activations become stronger (e.g., α greater than 7), all three one-gene-ON (green, brown, and yellow) states disappeared, leaving only two-gene-ON and all-ON states to be SSS (blue, cyan, purple, and orange lines in Figure 3B). The potential for this three-node gene network to yield as many as eight stable states underscores the idea that many cell types (e.g., over 200) could be generated from networks comprised of a small number of nodes. Taken further, a fully-connected network with only eight nodes could produce 256 states [47]. While only theoretical, it is an example of the potential for such combinatorial complexity and provides a plausible mechanism for cell lineage determination, where different gene product abundances and ON-OFF combinations could drive the system towards a spectrum of cell fates. Coupled with our bifurcation analysis, where we modeled the abrupt appearance or collapse of SSS in this topology, FCTs could also lead to discrete jumps between states in response to a changing cellular environment or signaling. The phenomenon of hysteresis, when such jumps cannot be reversed by returning to the

11 previous environmental conditions, has been established as the mechanism for many irreversible cell fate determinations [55].

1.1.4 Stochastic State Transitioning

Next, we investigated the effects of gene expression fluctuations on the multistability potential of Network 84. Fluctuations in gene expression under constant environmental conditions may arise from inherent stochastic fluctuations in gene expression, or from regulatory architecture, chromatin structure, or transcriptional bursting. Here, we use the broad term ‘gene expression noise’ to encompass all these possibilities. Gene expression noise has been shown to affect cellular physiology in single-cell organisms [56]–[62], and has also been linked to random developmental patterns and cell fate decisions in multicellular organisms [63]–[67]. The mechanisms underlying the utilization and regulation of gene expression noise in higher organisms, however, are not well understood. We introduced such noise into our model by modifying the algorithm from Adalsteinsson et al. [68] to help understand how inherent stochasticity affects cell differentiation (see MATERIALS AND METHODS). Using our modified algorithm, we simulated the temporal evolution of all three gene product abundances under the influence of different levels of gene expression noise. The system was initialized at the all-ON state to mimic the promiscuous gene expression state that is often observed when various progenitor cells start to differentiate [69]–[73]. The locations of eight SSS when α =1 are indicated by the red spheres in Figure 3C, where the size of the spheres represents the stability of each state as measured by its spectral radius [74], [75]. As shown in Figure 3C, at small noise levels (Figure 3C, left panel), expression of all three genes displayed fluctuations, yet the trajectory remained focused around the initial all-ON state, indicating that

12 small levels of gene expression noise were not sufficient to induce spontaneous state transitioning in Network 84. This also indicates the stability of the network under well-buffered conditions. As the level of noise was increased to an intermediate level (Figure 3C, middle panel), the system began to transition from the initial all-ON state to a two-ON-one-OFF state and later to a two-OFF-one-ON state. Within the time frame simulated, the system transitioned back and forth between these two latter states and spent most of the time in the vicinity of these two states. When the noise level was increased further, the system was able to visit all six states with at least one gene ON and one gene OFF (Figure 3C, right panel). The results of the stochastic state-transition simulations have several interesting biological implications. First, the system transitioned exclusively between states with at least one gene ON and one gene OFF. After jumping out of the initial state, it never returned to the all-ON state within the simulation time frame; moreover, it never reached the all-OFF state. For this topology, the all-ON and all-OFF states are significantly less stable than the other six states, and thus serve as weaker attractors. Specifically, it is the fine balance in the abundance of all gene products that determines the stability of the all-ON and all-OFF states, and this balance is easily disrupted by random perturbations such as gene expression noise. These results provide a simplified and yet concrete demonstration of the recently proposed concept of “dynamic equilibrium” [76], in which individual cells can jump between states randomly, while the proportion of cells in a population at one specific state remains constant. Here we can see the proportion (probability) of cells in one state is determined by the stability of each state, which in turn is determined by network topology and system parameters. Therefore when one single fluorescence reporter tagged to one gene is experimentally used to track single cell dynamics, the cell

13 population will exhibit a broad distribution because this multi-dimensional system is projected onto a single measured dimension. Second, state transitions were only observed between two adjacent states and not between non-neighboring states. In other words, the transition from one state to another non-neighboring state required layovers at one or more intermediate states. This phenomenon resembles the experimental observation that progenitor cells pass through multiple intermediate states as they differentiate [5]. Moreover, these simulations also highlight that the transition of a triad from one state to another can take multiple possible routes. Similar results have been observed at the cellular level, where recent progress in de-differentiation [17], [19], [20] and trans-differentiation [77] has shown that natural developmental cellular state transitions may be only one of many possible routes between cell types [78]. Third, trajectories of the stochastic simulation showed a directional bias in gene expression state space, presumably due to mutual repression. For instance, when the system was in the X-ON/Y-ON/Z-OFF state, its trajectory was virtually limited to the X-Y plane, with little fluctuation in the Z direction (Figure 3C, middle and right panels). The result is that each state in these simulations was functionally “flat” and comprised of the expression from only two genes. Similarly, the one-ON states had a polar or single dimensional expression distribution. Next, we subjected the system to a range of noise levels to gain a deeper under- standing of the system’s response and pattern of state transitioning. Noise strengths ranging from 0.4 to 1.2 were selected for analysis. Plotted in Figure 3D (red line) is the number of states traveled by the system, from the all ON state, within the simulated time versus the noise strength. The result shows a step function-like increase in the number of states with increasing noise strength. A similar pattern was observed when

14 the simulation was initiated at the origin (blue line) rather than the all-ON state. One exception is that from this alternative starting point the system could not travel to the all-ON state and therefore had only six accessible states. In both simulations, the sharp increase in the number of states traveled occurred between noise strengths of 0.8 and 1 (Figure 3D). This appears to be the critical threshold for overcoming the stability barrier between states and for complete noise- induced state transitioning. The existence of such a critical noise threshold suggests that natural systems may have the ability to operate in different regimes that either utilize or repress gene expression stochasticity to achieve distinct physiological goals.

1.1.5 SSS Analysis

To further explore Network 84’s response to two or more simultaneously changing parameters, we conducted more thorough parameter scanning (see MATERIALS AND METHODS for details). First, two parameter combination regimes shown in Figure 4A and B suggest that, with strong auto-activation, Network 84 is ensured to have at least four states (bottom two rows). Under these conditions, weak mutual inhibition led to seven or eight states (orange and green boxes) while strong mutual inhibition led to only four: all two-ON-one-OFF combinations and all-ON (purple box). However, when both mutual inhibition and auto-activation are weak, Network 84 assumes only one state, that of all genes OFF (Figure 4A, blue box). When auto-activation is weak and mutual inhibition is strong the system is capable of the three two-ON-one-OFF states (Figure 4A, red box, locations of SSS not shown). Taken together, this again reinforces the importance of strong auto-activation in generating multistability and hence the potential to canalize cell differentiation. As a third parameter combination regime, we tested the effect of asymmetrical regulation on multistability (Figure 4C, see

15 Figure 4. Stable Steady State Analysis of Network 84. (A) A matrix presenting the number of stable steady states generated by combinations of different auto-regulatory strengths (rows) and mutual-regulatory strengths (columns). The green and orange regions are visualized as FCT diagrams independently in B. (B) The green-boxed and orange-boxed diagrams corresponds to the green-boxed and orange-boxed parameter combination regime from (A). (C) Many of the parameter combinations yielding multi-stable systems are not represented by the matrix, and are instead abstracted here. As an example, here we present a parameter combination regime that can support 6 SSS and auto-activation strengths are similar to those in the orange box (thick red arrows) but have one unrestrained pair of mutual inhibition that can exist at a very high strength (blue T-head repression lines).

MATERIALS AND METHODS for details). Here we increased the mutual repression of a single pair of genes for the conditions represented by the green and orange boxes (Figure 4A, auto-regulatory strength between 1- 5) and saw the number of SSS fall to six. These findings provide possible insight into why certain network topologies have been adopted by evolution for particular regulatory tasks. For example, Network 84 demonstrates a broad, multistable parameter space that can be tuned, through regulatory strengths, for the step-wise control of the maximal number of states. Such

16 a mechanism could be envisioned to transit cellular identity from a pluripotent state to multi- or uni-potent state during differentiation.

1.1.6 A Regulatory Network for Pluripotency

Experiments probing the structure and function of the regulatory network governing pluripotency suggest that the core pluripotency regulators Oct4, Sox2, and Nanog are organized into a regulatory topology of mutual activation that is shared with Network 1 selected by our screen, [21], [79]. This fully interconnected autoregulatory loop is believed to stabilize self-renewal, govern the bistable switch between ES cell self-renewal and differentiation, and underlie the ability of selected pluripotency regulators to ‘boot up’ the pluripotency transcriptional programming upon forced expression in somatic cells, thereby reprogramming them to a stem cell-like state [52]. Additional layers of regulatory complexity have been described in this network, including autorepression by Nanog [80], [81] and differential cell phenotypes in response to varying levels of these factors [82]–[84]. Moreover, Nanog acts as a homodimer [85], while Sox2 and Oct4 heterodimerize to regulate transcription [86], and attempts have been made to model the regulatory dynamics of these transcriptional complexes and their interaction with signaling factors [87]. Further adding to the complexity, Oct4 may also partner with other Sox factors to form alternative regulatory complexes that bind a distinct set of enhancers [88]. Notably, protein levels of the core pluripotency regulator Nanog have been observed to fluctuate in cultured pluripotent stem cell populations [23], [24], [89]–[92], suggesting that this regulatory network is capable of adopting multiple states. To further explore possible roles of multistability in stem cell transcriptional regulation, we carried out

17 Figure 5. Bifurcation and Stochastic Analysis of Network 1. (A) Network 1 shares the topological architecture of the Oct4-Sox2-Nanog triad, and is believed to underlie stem cell differentiation and reprogramming. (B) The bifurcation diagram of Network 84 plots transcription factor concentrations of X and Y at each SSS against α, the self-activation strength. The mutual inhibition parameters are also set equal and referred to as β, here set to 0.1 and α is used as the bifurcation parameter ranging between .01 and 100. SSS are color coded and listed in the legend to distinguish different SSS, where ‘+’ indicates the gene is ‘ON’ and ‘-’ means ‘OFF’. (C) Simulations of noise-induced state transitioning in Network 1 under different levels of noise. Simulations were performed with auto-regulation equal to 1 and mutual inhibition equal to 0.1, with noise levels of 0.5, 0.85, and 1 from left to right. The locations of the deterministically calculated states are indicated with red spheres, with their size correlated to their spectral radius. The blue ribbons indicate the temporal trajectories of the simulations. The black arrows indicate initial direction of state transitions. (D) The number of states traveled in the stochastic simulations plotted versus noise strength. The red line represents simulations that were initiated from the all-ON state, and the blue line represents simulations that were initiated from the origin. 18 Figure 6. Stable Steady State Analysis of Network 1. (A) A matrix presenting the number of stable steady states generated by combinations of different auto-regulatory strengths (rows) and mutual-regulatory strengths (columns). The green and orange regions are visualized as FCT diagrams independently in B. (B) The green-boxed and orange-boxed diagrams corresponds to the green-boxed and orange-boxed parameter combination regime from (A). (C) Many of the parameter combinations yielding multi-stable systems are not represented by the matrix, and are instead abstracted here. As an example, here we present a parameter combination regime that can support 6 SSS and auto-activation strengths are similar to those in the orange box (thick red arrows) but have one unrestrained pair of mutual inhibition that can exist at a very high strength (blue T-head repression lines). detailed analysis on this network topology (Figure 5A). Our goal in doing so is not to capture every regulatory complexity present in the endogenous biological network, but rather to provide a theoretical framework for how this simplified network topology and fluctuations in network components might affect multistability. As with Network 84, we first carried out the bifurcation analysis for Network 1 (Figure 5B). Considering the topology of Network 1 consists solely of positive regulation, one might expect that the network would primarily assume the all OFF “resting” or all ON “activated” states. This intuition is largely correct but not complete. While network 1 does tend toward

19 the all-OFF state it is also capable of generating eight states when the strengths of auto-regulation and mutual regulation are 1 and 0.1, respectively (Figure 5B and C; Figure 6A). This diagram (Figure 5B) shares some similarities with that of Network 84 (Figure 3B). For instance, both networks exhibit eight SSS when auto-activation strength is equal to 1, and fewer as auto-activation strength increases. Likewise, when auto-activation is reduced to be smaller than 1, both networks collapse to only support one SSS (the black line). Thus Network 1 also displays discrete emergence and collapse, or hysteresis, of SSS, allowing it to potentially serve as a foundation for irreversible cell fate decision-making. It is interesting that while the two-ON states (blue, cyan, and purple lines) collapse readily as α increases, the one-ON states (green, brown, and yellow lines) and the all-ON and all-OFF states (orange, black), are stable over a wide range of parameter values. It is also important to note that despite similarities in the state space locations of each of the eight SSS in Networks 1 and 84, there are considerable differences in their relative stabilities. As with Network 84, we next incorporated different strengths of gene expression stochasticity, or noise, to simulate state transitioning (Figure 5C). The simulations were started from the all-ON state to mimic the all-ON pluripotent state of the Oct4-Sox2-Nanog triad. In the presence of small levels of noise, the system remained in the initial state throughout the duration of the simulation (Figure 5C, left panel). When the noise level was increased, the system was able to spontaneously transition to other states before settling in the all-OFF state, which represents the all-OFF terminal, or differentiated, state (Figure 5C, middle panel). When the noise level was increased further, the system transitioned out of the initial state much earlier toward the all-OFF state via an intermediate state (Figure 5C, right panel), confirming that the all-OFF state is indeed the terminal state and robust to different levels of noise.

20 It should be noted that simulation times were the same for each panel (Figure 5C). In the high noise level simulation (Figure 5C, right panel), the system reverted to the all-OFF state relatively quickly and exhibited small stochastic deviations that rarely exceeded the boundary of the sphere representing the state. This could perhaps be likened to the terminal state of the Oct4-Sox2-Nanog triad after cell differentiation, where expression of these genes is markedly depressed in gene expression profiles. Furthermore, for the modeling of cell fate decisions during development, the binary nature of these simulations appears to mimic the observed stable states of the Oct4-Sox2-Nanog triad, typically in either the all-ON or all-OFF state, with partial activation visible transiently during iPSC induction [17], [18], [93]. Much like the trajectories observed for Network 84, those for Network 1 exhibited the ability to transition between many states stochastically, with the potential of accessing five of the eight possible states. Although, as depicted, the residence time of states for Network 1 was largely restricted to either all-ON or all-OFF (Figures 3C, 5C). Network 1 simulations could not access the three two-ON-one-OFF states unless initiated there, presumably because of the states’ low stabilities. Moreover, unlike Network 84, all the simulation trajectories for Network 1 ultimately reached and remained at the all-OFF state, a result of Network 1’s topology and the superior stability of the all-OFF state. This also led to a lower average number of states traveled by Network 1 at higher noise levels (Figure 5D). A higher noise threshold for state switching of this sort could serve as added protection against aberrant regulation of pluripotency. From a modeling perspective, these results offer a useful means of rapidly predicting the dynamical stochastic trajectory of a network based on state stabilities, without having to run entire stochastic simulations. As with Network 84, we tested the sensitivity of Network 1 to different combinations

21 of regulation strengths (Figure 6). Again, weak mutual activation and strong auto- activation gave rise to the maximal number of states (Figure 6A, green and orange boxes). Interestingly, however, the remainder of the regulation parameter space was much simpler than that of Network 84, yielding only one or two states (Figure 5A, blue and red boxes). Again, independent manipulation of mutual regulatory interactions can also lead to parameter combinations that enable multistability. A thorough parameter scan for such multistability found a third interesting configuration, one where a gene (Z) propagates signal to downstream genes (X and Y), which also have strong auto-activation (Figure 6C). This pattern has one defining characteristic in that it maintains multi-stability as long as there is only one master while the promoted genes have weak interactions with each other. Such multi-stability is lost when the system adopts a two-driver gene topology. Therefore Network 1 may be considered a largely bimodal topology that retains the potential for greater state diversity, but only within a narrow parameter window (Figure 5B and Figure 6A). This characterization of Network 1 topology is supported by other studies. Specifically, while claims of monoallelic expression of Nanog [89] have been contested [24], [29], a common finding across the literature is heterogenous expression of endogenous Nanog protein in cultured ES cells by immunostaining [23], [24], [75], [76], [89], [90]. These findings suggest bistable expression of Nanog, and support the idea that the Oct4-Sox2-Nanog network may exist in at least four stable states (all high; Oct4 and Sox2 high but Nanog low, as occurs in low Nanog ESCs; high Sox2 but low Nanog and Oct4 as occurs in neural progenitor cells [94]; or all low/off in the differentiated state). Analysis of the regulatory parameter space for networks 1 and 84 raises an important consideration that may be broadly applicable. Could progressive lineage determination,

22 or restriction of pluripotency, be correlated with the strength of mutual regulation? For both of the networks analyzed, the highest potential for multistability (eight states) was found in a parameter space with weak mutual regulation (Figure 4A and Figure 6A). As the strength of mutual regulation increases from this point, more restricted potential fates predominate, much as would be expected as cell lines differentiate. The influence of greater mutual regulation on the potential for multistability would be interesting to explore, and perhaps is well-suited to modeling in the hematopoietic system where hematopoietic stem cells (HSCs) or multi-potent progenitors could be compared with bi-potent progenitors.

1.1.7 In Silico Induced Pluripotency

Reprogramming mature cell types into iPSCs is achieved by the overexpression of a number of defined transcription factors (typically including Oct4, Sox2, c-Myc, and Klf-4) that regulate the necessary pluripotency phenotypes as well as each other’s expression [18]–[20]. Recently, it has been shown that overexpression of only Oct4, Sox2, and Nanog results in similar rates of iPSC generation from human amnion cells [95], and, importantly, the Oct4-Sox2-Nanog triad is highly expressed and appears to be the core regulatory circuit for pluripotency in stem cells [96]. To construct a simplified in silico version of the iPSC experiment involving the Oct4-Sox2-Nanog triad, we added a constitutive expression term to each gene of Network 1 (Figure 7A, see MATERIALS AND METHODS). To begin, we performed bifurcation analysis on the network 1 topology with an added term for constitutive expression (Figure 7B). The addition of a basal level of expression to Network 1 resulted in at most five possible states, two of which (the orange all-ON and black all-OFF lines) possessed considerably higher stabilities than

23 Figure 7. Bifurcation and Stochastic Analysis of Network 1 with constitutive basal expression. (A) The topology of Network 1 is used to mimic iPSC experiments with added constitutive expression of all three genes. (B) The X and Y protein concentrations of stable and unstable steady states are plotted against self-activation strengths. With the addition of the constitutive expression we can see a further decrease in stabilities, specifically two-ON states are never stable. Additionally one-ON states lose stability very rapidly, approximately at α=1, and the all-OFF state destabilizes after α=9 leaving only the all-ON state. This can also be seen in the sizes of the spectral radii, deterministically these steady states exist but in practice they have very little influence. (C) Simulations of Network 1 modeled in the presence of an additional constitutive overexpression term consistently show a transition from an all-OFF state to an all-ON state after a period of latency. the others (Figure 7B). All one-ON-two-OFF SSS (green, brown, and yellow lines) are only stable for a narrow range of α. Even when they are stable, their stabilities are insignificant. Looking closely we can see that between the red, green, and yellow lines there are 3 unstable paths without a stable component. These lines would be the locations of two-ON-one-OFF states if they achieved stability. More importantly, as auto-activation strengthens, even the all-OFF state’s stability becomes insignificant when compared with the all-ON state (Figure 7B gray circles). A comparison of bifurcation diagrams of Network 1 with and without the constitu- tive expression term (Figure 5B, Figure 7B) highlights some striking and important

24 differences. Firstly, as observed experimentally, by adding the constitutive basal expression term to all three genes, there is increase in the stability of the all-ON state. Moreover, our simulation predicts a reduction in the number of possible SSS from eight to five, as well as a significant reduction in the stability of the non all-ON or all-OFF states. While these findings must be considered within the limits of our study, they suggest constitutive expression could help to canalize the transition from all-OFF to all-ON by simplifying the triad’s regulatory space, and thus provide a foundation for induced pluripotency. To further examine modified Network 1, the effects of gene expression noise were investigated through stochastic simulation from the all-OFF state with a moderate level of noise (see MATERIALS AND METHODS). Consistent with experimental evidence [34], at the start of the simulation, the system hovered near the origin for a period of time and then began to transition toward the all-ON state. Once again, by adding a basal overexpression term to the genes, we found that the stability of the all-ON state was increased to be greater than that of the all-OFF state (Figure 7C). As a result, the system was attracted to the all-ON “pluripotent” state in a way that was not observed with the unmodified Network 1 (Figure 5C). These findings reflect observations from the experimental practice of induced pluripotency in which the reprogramming factors are strongly overexpressed, and may shed light on the regulatory mechanisms behind induced pluripotency. For instance, the overexpression of defined transcription factors in differentiated cell types may alter the regulatory parameters of the FCT into ones that prefer the all-ON “pluripotent” state. Affected cells would then gradually transition from their original state to the preferred all-ON state. Further, as is often done experimentally, we removed the constitutive expression term from our simulation and found the system to return

25 to an unmodified Network 1 behavior. Much like iPSCs, the network was able to re-differentiate to the all-OFF state or to remain in the all-ON state, under conditions of low noise or high auto-activation respectively.

Figure 8. Stochastic simulations capture latency distributions in induced pluripotency. (A) Distribution of simulated latency for the modified Network 1 model illustrates a skewed bell shape. The histogram was generated from 2000 simulations. (B) Similar results generated from 2000 simulations with Network 84.

To further test the utility of our model as an abstraction for the transcriptional regulation of stem cells and stem cell reprogramming, we next used our model to predict the temporal dynamics of stochastic reprogramming. We first investigated the time required for the system to transition from the all-OFF to the all-ON state, which is termed the first passage time (FPT) [97]. An FPT distribution of induced pluripotency was generated by running 2000 simulations with our modified Network 1, resulting in a characteristic asymmetrical bell shape distribution with a long tail and a peak at approximately 50 normalized time units (Figure 8A). To ensure the generality of this observation, an FPT distribution for the transition from one state to another

26 in Network 84 was similarly generated and found to yield the same pattern (Figure 8B). Such a skewed bell shape distribution is consistent with published experimental data [34].

1.2 Discussion

Over fifty years have passed since the connection between development and multista- bility was first proposed [3]. However, despite efforts in the last decade to reverse engineer transcription regulatory networks, the molecular basis for multistability is still not well understood. Likewise, multistable systems with more than two states have not yet been constructed de novo. We address this challenge with a high-throughput, dynamical systems search algorithm, powered by parallel computing, that identifies gene network topologies enriched in their capacity to realize multistability. Focusing our screen on FCTs, our algorithm identified complete auto-activation as an important topological feature for high-dimensional multistability, along with a variety of combinations of mutual regulation. Our analyses shed light on existing ideas and generate new testable hypotheses for some of the regulatory mechanisms underlying cell fate determination. Many of the seemingly disparate experimental observations including random reprogramming latency, the existence of intermediate states during differentiation and reprogramming, population heterogeneity, and alternate routes of cellular state transitioning [5], [27], [34], [96], [98], can potentially be connected and addressed by our dynamical analyses of FCTs. The central importance of complete auto-activation in multistable triads suggests that nodes for such regulatory networks could be identified in exogenous overexpression screens for known key regulators. In such a screen, the components of multistable FCTs would self-identify, by maintaining their own expression after the withdrawal

27 of candidate transcription factor overexpression. Transcription factors identified this way could represent “pioneers”, those capable of activating regulatory partners or overcoming transcriptional inaccessibility, i.e., silenced loci, such as Oct4. The identification of regulatory partners (mutual regulation) could also be char- acterized by the analysis of promoter/enhancer regions upstream of identified auto- activation transcription factors. Such a “guilt by association” method may also help to infer the function of the triad as a whole, or lesser-known regulatory partners. Thus, using established developmental transcription factors as seeds, the Waddington landscapes of whole cell lineages could perhaps be mapped by using auto-activation as a cipher. The notion that specific topologies may serve particular regulatory roles better than others is supported by the identification of an additional Network 1 regulatory topology in a stem cell population. The transcription factors Gata2, Fli1 and Scl/Tal1 form a regulatory triad that operates through mutual and auto-activation to promote hematopoiesis in the mouse embryo [99]. As discussed in the context of the Oct4- Sox2-Nanog triad, the Network 1 topology has a tendency to operate in a bimodal and terminal manner, meaning that once switched to the all-ON state the triad has a tendency to be locked into that state. Such a regulatory system would impart a sort of cell “memory”, perhaps allowing newly formed HSCs to travel from the aorta-gonad-mesonephros (AGM) or primordial niche to the fetal liver and bone marrow niche without losing their multipotent state [99], [100]. Importantly, regulatory triads do not exist in isolation and are not static entities. Rather, inputs from the environment and other signaling pathways can influence the state or composition of endogenous triads through the perturbation of regulatory relationships. The consequence of such influence was evident in our bifurcation

28 analysis of Networks 84, 1, and modified 1, where small changes to the strengths of the network regulation created or eliminated SSS. Thus, under the influence of external signaling (e.g., post-translational modification), subcellular localization of co-factors or regulatory ligands [101]–[103], the same network topology may alternate between different regulatory state configurations. Such state switching could be an effective regulatory strategy for responding to changing environmental and developmental cues. While gene expression stochasticity is an accepted factor of cell fate determination [61], its role in complex processes such as differentiation remains elusive. We subjected our chosen multistable networks (e.g., Network 84) to stochastic simulations and found that moderate levels of gene expression noise were sufficient to disrupt the balance of protein abundances required for maintaining all-ON and all-OFF states, even with high protein levels. In other words, our simulations suggest that noise is critical to a biological system’s ability to transition from meta-stable states to differentiated states. Moreover, noise-induced transitioning showed multiple possible routes from one state to another, with the trajectories traversing step-wise through adjacent states (Figures 3C, 5C). This is possibly the in silico manifestation of experimental observations showing progenitor cells passing through alternate intermediate states on the way to a differentiated state [5]. Additionally, by testing a range of noise levels, we found that the number of states traveled by these multistable systems increased sharply at specific noise levels (Figures 3C, 5C). Perhaps cellular systems straddle such a threshold, whereby gene expression noise can either be suppressed or amplified to realize very different physiological outcomes. The advent of single-cell gene expression profiling methods [104] affords new windows on cellular heterogeneity, and theoretical frameworks such as we present here may provide valuable insights when informed by the experimental data these technologies will provide.

29 Like signaling cascades in their simplest form, the regulatory relationships between the genes of FCTs define their meta-properties. In a manner reminiscent of the low- pass filtering capacity of cascades, our stochastic simulations detected network-specific variation in tolerance for noise. Specifically, in the analyses of Networks 1 and 84, significantly different noise thresholds for multistability were measured—0.8 and 1.4, respectively (Figures 3 and 5). This observation suggests that, along with other measures, networks may be tuned for unique noise-switching thresholds. For example, the Gata2-Fli1-Sci/Tal1 triad (Network 1) would be predicted to remain in the all-ON state as long as noise levels are maintained low, but to transition quickly to an all-OFF state under high noise. This could be tested experimentally using tunable synthetic noise controllers [94] or generators [105] upstream of these HSC regulators to apply large fluctuations to the system. As such, differentiation or pluripotency may provide readily accessible experimental models. One possible method to induce noise could be the expression of splice isoforms [106], [107]. Focusing on the endogenous triads discussed, state change dynamics could be evaluated under the perturbations of either single or multiple splice isoforms. Thus, while controlling for total protein expression within a particular triad, one could evaluate whether “noise”, derived from the concurrent expression of multiple isoforms, influences the product or rate of the induced state transition. If so, an interesting outcome may be that individual differentiation pathways could be primed for state change by expressing tailored suites of splice-isoforms or in the case of iPSC reprogramming, perhaps accelerate the rate of state switching. Finally, our work offers new directions for synthetic biology [11], [12], [108]–[112]. As a tool, the application of synthetic biology to disease-related or developmental triads may allow for true reverse engineering of in vivo processes. For example,

30 the regulatory dynamics of Network 1 (Gata2-Fli1-Scl/Tal1 and Oct4-Sox2-Nanog) or Network 84 (RUNX2, SOX9 and PPAR-γ) topologies could be systematically examined for response to perturbation or noise by constructing such FCTs in an orthogonal environment (e.g., bacteria). Similarly, the effects of stochasticity on the regulation of pluripotency could be studied directly in mammalian systems using tunable synthetic noise generators. Such a system could vary the gene expression of triad components, while keeping their mean expression level constant, to evaluate the influence of noise on network state change [57], [113]. Our work here provides a starting point for the design of next-generation synthetic circuits that could exploit multistability and its physiological consequences.

1.3 Materials and methods

1.3.1 Enumerating and Eliminating Redundant Networks

Three-node networks can have 39=19683 configurations, each with nine edges that consist of positive, negative or null regulations between the nodes. Our focus on FCTs resulted in a total of 29=512 possible configurations because each edge can only represent positive or negative regulation. Through permutation analysis many of these 512 networks were found to be identical to each other. Identical networks were eliminated, leaving us with 104 unique networks (see SI).

1.3.2 ODE Modeling and High-Throughput Screening

A three-node gene regulatory network can be described by a three-dimensional ODE:

dx = a1f1 (x) + b1g1 (y) + c1h1 (z) − δx (1.1) dt

31 dy = a2f2 (x) + b2g2 (y) + c2h2 (z) − δy (1.2) dt dz = a3f3 (x) + b3g3 (y) + c3h3 (z) − δz (1.3) dt In this formulation each variable (x,y,z) represents the protein abundance of one gene product (Equations (1.1)–(1.3)). Here only one generic equation is used to describe the gene expression regulation including transcription, translation, and other post-transcriptional and post-translational modifications. Each equation captures the core nonlinear dynamics of gene expression regulation and omits details of the

fine-tuning of it. The parameters (a1−3,b1−3,c1−3) represent the strength of mutual or auto-regulation, be it activation or inhibition. Each function (f, g or h) has the form Fn / (kn + Fn) when representing activation, and the form kn/(kn+Fn) when representing inhibition, where F can be either x, y, or z. The activation and repression of each gene by other factors are modeled additively to account for reported mechanistic independence of transcription regulations [16], [40], [46]. For example, Network 84 shown in Figure 3A can be described by the ODE:

dx xn kn kn = a1 + b1 + c1 − δx (1.4) dt kn + xn kn + yn kn + zn dy kn yn kn = a2 + b2 + c2 − δy (1.5) dt kn + xn kn + yn kn + zn dz kn kn zn = a3 + b3 + c3 − δz (1.6) dt kn + xn kn + yn kn + zn In the models values of n=4 and k=0.5 were assumed for all genes (Equations (1.4)– (1.6)). While it is known that transcription factors critical in stem cell programming often form homodimers [85], [114] or heterodimers [115], n is set to be equal 4 because the nonlinearity quantified by n is often affected by many factors in addition to protein multimerization [10]. A hill coefficient of 4 has been used previously to demonstrate generic behavior of stem cell differentiation [16]. Recently, hill coefficients

32 up to 9 have also been used [40] to construct models with experimentally verified predictions. It is also found in our study that hill coefficient values affect the region of multistability but not general conclusions. While it did not affect the results of steady-state calculations, δ was set to be 1; this assumption can be adjusted in light of future available experimental results. For mutual regulation strength, the parameters

(b1, c1, a2, c2, a3, b3) were set to [0.1, 0.2, 0.5, 1 or 5]; for auto-regulation strength, the parameters (a1, b2, c3) were set to 0.1, 0.5 or 1. These parameter possibilities lead to a total of 56*33=421,875 parameter sets to test. We numerically solved for the root of the right-hand side of each ODE set with over 1000 different initial guesses uniformly distributed over the entire state space. Measuring the probability of multistability for each FCT was accomplished by analyz- ing all 421,875 parameter sets. A parallel algorithm was developed to distribute this task as an ensemble of independent computations (See the SI).

1.3.3 Analysis of the Parameter Space

In addition to the tabular data (Figure 4A and 4A) that is useful for quick visualizations and analysis of stabilities, it is desirable to recognize key patterns of parameter combinations (PCs) yielding multistability. Therefore networks chosen (Network 84 and 1) for in-depth analysis had their results recomputed with a wider sample of parameter spaces. Specifically the PCs were expanded to five levels for each edge, [0.1 0.5 1 2 6], yielding 5^9 = 1,953,125 possible PCs. The found clusters can then be generalized to create the FCT diagrams in Figures 4C and 4C.

33 1.3.4 Bifurcation Analysis

Bifurcation diagrams were generated using MatCont [116], a continuation toolbox for MATLAB. Bifurcation analysis was performed on the auto-activation strength; mutual-regulation strengths were set at .1. Spectral radii were gathered at specific intervals to compare the stability of the stable steady states. Stable and unstable steady state information was graphed to generate Figures 3B, 5B, and 7B.

1.3.5 Stochastic Simulations

The ODE model can be converted to chemical Langevin equations to introduce stochasticity while approximating the Gillespie algorithm [68], [117]. All parameters were scaled up so that copy numbers of gene products are in biologically reasonable quantities and can reach over one hundred. This was done by multiplying the coefficients in a, b, c and k by 100 with other parameters unchanged. This linear shift does not alter the bifurcation dynamics or the relative locations of SSS. At each iteration, all chemical species are updated using the equation:

p N (t + ∆t) = N (t) + ∆tA (N (t)) + S ∆t (F (N (t)) + B (N (t)))z (t) (1.7)

In this equation, N (t) represents the abundance of chemical species, A(t) is the right-hand side the ODE, and F and B are the forward and backward reaction terms in each equation, respectively, which were extracted from corresponding ODEs. z(t) are the standard Normal variables, S is the stoichiometric matrix of the biochemical reactions (refer to [68] for additional details). In our modified algorithm, we added one parameter, α, to Equation 1.7 to represent the tunable noise strength:

34 p N (t + ∆t) = N (t) + ∆tA (N (t)) + α ∗ S ∆t (F (N (t)) + B (N (t)))z (t) (1.8)

Here α was the noise strength used in Figures 3C and 5C. Therefore, the results were comparable to pure discrete simulations when α equals 1. Modulation of the noise strength(α) was used to facilitate investigations of the system’s dynamics in response to different noise levels. The multiplicative noise term is chosen for simplicity without loss of generality.

1.3.6 Statistical Tests

All statistical tests were performed using Matlab. The parameters for each distribution were estimated using the Matlab command mle. Chi-square (X 2) statistics for each distribution were computed with the chi2gof command.

35 Chapter 2

GENE NETWORK MODELING

Embryonic stem cells have the capacity to differentiate into more than 200 isogenic progenitor and terminal cell types [1], each of which represents a stable cellular state. The potential for a system to realize multiple states is termed multistability, which has been proposed as the mechanism underlying cell differentiation for over half a century, with stable states represented as either valleys in the developmental landscape[2], [3] or as dynamic attractors in high-dimensional gene expression space [4]–[6]. These states are realized through a combination of inputs resulting from both the cytoplasm, and the host of the cell. The interplay of genes with these inputs within the cell drives cells towards large responses such as embryonic or carcinogenic differentiation, but also smaller responses such as metabolic activities [118], stress responses . Understanding these interactions is critical to the construction of novel gene circuits with a behavior that is predictable in silico, and consistent in vivo [119]. The ability to infer gene networks has made tremendous strides over the past decades thanks to advances along 3 main fronts. Decreased sequencing costs have lead to an explosion of available sequencing data, much of which is available to researchers from systems like the NCBI GEO DataSets [120]. Due to the nature of the problem, whereby tens of thousands of genes are predicted using hundreds of individual sample points, the problem is massively underdetermined. This ability to access and pool these samples puts better constraints on the system as a whole, and leads to much stronger conclusions during network inference. Second, significant advances have been made in the algorithms used for gene network inference. These advances have been

36 made through both algorithmic advancements, and novel approaches made possible through generally increasing computational power. Third, the quality of sequencing results has improved significantly. In specific the length of sequence that can be read successfully has improved, and these longer sequences can more easily be resolved to unique positions, increasing the quality of gene quantifications, and consequently network inference. Here we discuss network inference methodologies capable of inferring these network interactions from sequencing data and some of their limitations. We discuss limitations of tools downstream from the sequencers, and propose a solution to integrating data from multiple sources. We propose a technique for automatically collecting and pre-processing a large corpus of sequencing runs from public sources, allowing for increased power during the network inference phase. We demonstrate the viability of our pipeline as a means of both increasing input data, and decreasing gene expression noise, a significant problem in network inference [104], [121]–[123].

2.1 Network Models

Gene regulatory network (GRN) modeling is a technique dedicated making functional understanding of gene regulation at a network level, a crucial step in the understanding of cell fate determination. Specifically it seeks to understand the processes behind GRNs and employ them to engineer novel biological devices, metabolic processes, and therapeutics. Since the development of using Boolean networks [4] to model cell fate determination, there have been massive advances in model accuracy. These advances have sourced both from growth in computational abilities, but also in greater understanding of cells and gene networks. As computational power has grown [124], the desire for more powerful and predictive models has evolved the modeling

37 landscape from simple and lightweight into more complex and flexible models. Initially starting with Boolean networks [4], [125] and moving through deterministic and linear models [126], [127], Bayesian network modeling [128]–[130], and more recently to differential equation based models. Differential equations have the added benefit of being extensible to incorporate stochastic simulations in a more natural way than previous models.

Figure 9. Two fully connected(A/B), and two partially connected(C/D) triads are depicted. (A/C) illustrate only positive network regulation (activation), indicated by red arrows, i.e. increasing quantities of the transcription factor at the tail of the arrow also increases quantities of the transcription factor at the head. (B/D) utilize both positive and negative network regulation (repression), indicated by the blue bars, i.e. increasing the quantities of the transcription factor at the tail of the error decreases quantities of the transcription factor at the head

2.1.1 Boolean Networks

Since their introduction, Boolean networks have remained a first tool choice for many gene network modelers, even as more powerful tools become available [125], [131], [132]. The primary motivations for this are their low cost of simulation, and a strong theoretical background from computer science theory. Additionally, because the model

38 is very simple Boolean networks tend to be easy to fit with regard to experimental data. As a technique for network approximation this remains a go-to tool, with more recent and advanced Boolean network models [132], [133] capable of modeling cell state dynamics, and even hysteresis. While these models are efficient for an overarching view, they lose traction as one attempts to model cell state with more fidelity, both due to their discretization of the input domain, and particularly because of difficulties in modeling the influence of noise. Boolean networks are modeled as a set of boolean expressions defining transitions between state n and state n+1. As an example Equations (Equations (2.1)–(2.4)) show one set of boolean rules for the networks from Figure 9. It is relatively clear that these rules are overly restrictive by the network’s inability to model oscillations, some of the steady states, or many of the empirical behaviors observed by [119]. Larger networks, however, are still capable of modeling some stochastic behaviors, including oscillations and dynamic attractors [4]. These models have performed remarkably well [122] despite their simplicity, likely due to the significantly reduced number of features required to properly fit the model. Boolean network models have fairly limited usage in simulation [133], but these are still quite successful at inferring rules for the connectivity of genes, suggesting positive and negative regulations [121], [134], [135]. Boolean Equation for Figure 9A:

fx(x, y, z) = x ∧ y ∧ z

fy(x, y, z) = x ∧ y ∧ z (2.1)

fz(x, y, z) = x ∧ y ∧ z

39 Boolean Equation for Figure 9B:

fx(x, y, z) = x ∧ ¬y ∧ ¬z

fy(x, y, z) = ¬x ∧ y ∧ ¬z (2.2)

fz(x, y, z) = ¬x ∧ ¬y ∧ z Boolean Equation for Figure 9C:

fx(x, y, z) = y ∧ z

fy(x, y, z) = y ∧ z (2.3)

fz(x, y, z) = y ∧ z Boolean Equation for Figure 9D:

fx(x, y, z) = ¬y ∧ ¬z

fy(x, y, z) = y ∧ ¬z (2.4)

fz(x, y, z) = ¬y ∧ z In this formulation each variable (x,y,z) represents the protein abundance of one gene product. Much of the success of Boolean networks can be attributed to the relatively small number of features required to fit the boolean networks, and the simplicity of the resultant models. In the examples above the model only needs to determine whether gene X is activated, repressed, or indifferent to the behavior of other nodes. Boolean models can be extended to allow for some increased flexibility, for example instead of requiring all conditions to match only a subset are required to match, but as the number of subsets increases the difficulty in fitting the model increases as well.

2.1.2 Bayesian and Neural Networks

Bayesian and Neural networks offer a more powerful alternative to the less compli- cated Boolean networks. Bayesian networks allow a generated model and parameter

40 combination to be inferred almost in their entirety from the input data [129], removing some of the need for domain knowledge. Neural networks in particular have shown great success in fields such as gene expression profiling [136] and credit card fraud detection [137], [138]. Much of this success comes at two costs: first, the input data set must be quite large relative to the dimensionality of the problem [139], and second, much of the power behind these tools must be sacrificed in order to maintain human interpretability of the resulting model. Neural networks are typically highly connected [140], meaning that the parameters required are often on the order of O(N^2). Additionally, where most other networks have only an input or current step, and an output, or next step, neural networks often contain “hidden nodes” nested between the input and output space. These nodes have an effect on the result and are necessary to model high dimensional nonlinearity [134], but they dramatically increase the number of features rendering network inference problems exceedingly underdetermined. Additionally, the meaning of these hidden nodes is often difficult, if possible, to interpret. Bayesian Equation for Figure 9A:

Px(x, y, z) = a1P (x)b1P (y)c1P (z)

Py(x, y, z) = a2P (x)b2P (y)c2P (z) (2.5)

Pz(x, y, z) = a3P (x)b3P (y)c3P (z)

Bayesian Equation for Figure 9B:

Px(x, y, z) = a1P (x)b1(1 − P (y))c1(1 − P (z))

Py(x, y, z) = a2(1 − P (x))b2P (y)c2(1 − P (z)) (2.6)

Pz(x, y, z) = a3(1 − P (x))b3(1 − P (y))c3P (z)

41 Bayesian Equation for Figure 9C:

Px(x, y, z) = b1P (y)c1P (z)

Py(x, y, z) = b2P (y)c2P (z) (2.7)

Pz(x, y, z) = b3P (y)c3P (z)

Bayesian Equation for Figure 9D:

Px(x, y, z) = b1(1 − P (y))c1(1 − P (z))

Py(x, y, z) = b2P (y)c2(1 − P (z)) (2.8)

Pz(x, y, z) = b3(1 − P (y))c3P (z)

In Equations (Equations (2.5)–(2.8)) each variable (x,y,z) represents the protein abundance of one gene product. Bayesian networks are not significantly more difficult to learn than boolean networks, adding only a small number of new parameters. Thanks to this simplicity Bayesian networks have been used to replace Boolean networks when slightly more predictive power is needed, or more data is available [141], [142]. In the examples above the model needs to determine whether x is activated, repressed, or indifferent to the behavior of other nodes, and also determine the strength of each link probabalistically. Boolean models can be extended to allow for some increased flexibility, for example instead of requiring all conditions to match only a subset are required to match, but as the number of subsets increases the difficulty in fitting the model increases as well.

2.1.3 Differential Equation Networks

Linear and nonlinear differential equation network models provide an important advance in the understanding and modeling of GRNs [119]. They present fewer variables than Bayesian networks while allowing a potentially better behavioral fit.

42 Additionally they utilize a continuous input and output domain while maintaining the interpretability and some of the simplicity of Boolean networks. While linear and nonlinear differential equation models are superficially similar, recent research has shown that the effect of transcription regulations on genes is nonlinear in nature [12]. The nonlinearity of transcription regulator binding has been demonstrated and previous models have been capable of approximating nonlinearity through additional parameters, but we recently showed that the effects are nonlinear even with a single gene activator or repressor [12]. This nonlinearity explains the existence of multiple stable steady states and encourages yet more accurate modeling of cell state, as a small change in initial conditions in an undifferentiated cell may lead to a major change between differentiation pathways. Differential Equation for Figure 9A dx xn yn zn = a + b + c − δx dt 1 kn + xn 1 kn + yn 1 kn + zn dy xn yn zn = a2 + b2 + c2 − δy (2.9) dt kn + xn kn + yn kn + zn dz xn yn zn = a + b + c − δz dt 3 kn + xn 3 kn + yn 3 kn + zn Differential Equation for Figure 9B dx xn kn kn = a + b + c − δx dt 1 kn + xn 1 kn + yn 1 kn + zn dy kn yn kn = a2 + b2 + c2 − δy (2.10) dt kn + xn kn + yn kn + zn dz kn kn zn = a + b + c − δz dt 3 kn + xn 3 kn + yn 3 kn + zn Differential Equation for Figure 9C dx yn zn = b + c − δx dt 1 kn + yn 1 kn + zn dy yn zn = b2 + c2 − δy (2.11) dt kn + yn kn + zn dz yn zn = b + c − δz dt 3 kn + yn 3 kn + zn

43 Differential Equation for Figure 9D dx kn kn = b + c − δx dt 1 kn + yn 1 kn + zn dy yn kn = b2 + c2 − δy (2.12) dt kn + yn kn + zn dz kn zn = b + c − δz dt 3 kn + yn 3 kn + zn

In this formulation each variable (x,y,z) represents the protein abundance of one gene product (Equations (2.9)–(2.12)). Here only one generic equation is used to describe the gene expression regulation including transcription, translation, and other post-transcriptional and post-translational modifications. Each equation captures the core nonlinear dynamics of gene expression regulation and omits details of the

fine-tuning of it. The parameters (a1−3,b1−3,c1−3) represent the strength of mutual or auto-regulation, be it activation or inhibition. Each function (f, g or h) has the form Fn / (kn + Fn) when representing activation, and the form kn/(kn+Fn) when representing inhibition, where F can be either x, y, or z. The activation and repression of each gene by other factors are modeled additively to account for reported mechanistic independence of transcription regulations [16], [40], [46].

2.2 RNA Sequencing

2.3 Data Aggregation

The past decade has seen numerous revolutions in “next-gen” sequencing, and with them new tools to aide in their use. This has come largely as an increase in sequencing accuracy, increased sequence length, and the introduction of single-cell sequencing. Analysis (Table 1) shows that these breakneck advances in sequencing technology have resulted in a fractured landscape of techniques that are difficult to rectify, and that

44 Experiment Cell Types Reference Library Prep Aligner and ID Genome Quantifier 200021180 ES, Neuro2A, NCBI 37.1 custom Bowtie v0.9.6 MEF 200029087 ES, MEF mm9 custom Bowtie v0.9.6 200045284 NPC, MEF mm9 Illumina TopHat v2.03 TruSeq RNA and Bowtie Sample Prep 2.0.0.6 cuffdiff Kit v2 2.02 200045719 Liver and mm9 SMARTer Ul- bowtie 0.12.7 Development tra Low Input Rpkmforgenes cells RNA for Illu- mina Sequenc- ing 200052583 Lung mm10 Nextera bowtie2-2.1.0 tophat-2.0.8 cufflinks-2.0.2 200059127 Kidney mm9 Nextera TopHat v1.4.1 GeneSpring 12.6.1-GX- NGS-RV

200059129 Kidney mm9 Nextera TopHat v1.4.1 GeneSpring 12.6.1-GX- NGS-RV

Table 1. Overview of RNA-seq experiments analyzed with a listing of tool and asset versions. result in different outputs. In the interest of meta-genomics, authors should provide a set of pre-processed data, which could be easily utilized by researchers for downstream applications. In fact many grants require authors to submit their data to the NCBI GEO repository so that replication of results and downstream analysis is more readily available. Unfortunately this data is not always usable for researchers as the tools used to quantify read counts introduce a significant bias into the reads Figure 11,

45 when this bias is varied between experiments it introduces another source of technical noise. Additionally, while NCBI’s GEO requires authors to submit processed data there is little restriction on how the data is formatted and which calculations are used, with some authors providing transcripts per million (TPM) [143], others fragments per kilobase per million mapped reads (FPKM) [144], others provide a direct read count [145], and some provide alignments for the individual reads without quantification [146]. The next possibility is that researchers could have an easy way to take arbitrary unprocessed reads and align them on their own. To this end there has been a great deal of effort; from standardization of reference genomes, to online resources such as Galaxy [147], [148], even including tools such as mygene.info [149] capable of converting between different gene nomenclatures. Despite this progress many holes remain in the analysis pipelines; quantified gene readings are not sufficiently available (or are inconsistent), and quantifying genes is too complicated, requiring significant computing power and in-depth knowledge of both the sequence library preparation technique, and of how to perform alignment and quantification in general. Unfortunately this kind of broad knowledge is not common, and while critical to the alignment phases, having this kind of information that is not of interest to most downstream computational studies. To this end we created a tool for automatically downloading experimental data, recognizing experimental parameters, and performing alignment and quantification steps such that output data has a minimal bias, and that the bias provided is uniform among experiments. We further demonstrate that this data contains a smaller amount of noise relative to the disparate processing pipelines provided by the original authors.

46 2.3.1 Acquiring the Data

Through our analysis of datasets publicly available in NCBI’s GEO repository we found a few key problems:

1. Varied adapter sequences 2. Multiple barcode handling techniques 3. Differing sequence lengths

Depending on the sequencing chemistry used there are different adapters present to attach the recerse-transcribed mRNA transcript to the sequencer. In theory these sequences should not be included in the reads provided, and as a result should have no influence on downstream processing. In our results we find that they frequently remain within the sequenced records, likely as a result of multiple adapter attachment. While analyzing one data set we found over 30% of reads containing a copy of the adapter sequence (NCBI GEO GSE52583) [143]. Unfortunately GEO does not provide a standardized field for storing the adapter sequences that were used, consequently one must analyze the sequenced data and search for overrepresented sequences occurring on the forward or reverse strand near either end of the read. The results of a failure to remove the adapter sequence are ambiguous and further depend on the alignment pipeline used. Read aligners will have varying degrees of success in aligning the reads to the genome, with some spurious alignments of the sequence against somewhat random locations in the genome that happen to match the adapter sequence. In order to remedy this our pipeline performs an automated search for likely adapter sequences in the sequenced data, and trims them to reduce this noise. In the case that the reads are too short after the adapter has been trimmed, the reads are discarded.

47 One major problem in sequencing is the random dropout of individual molecules during sequencing. To overcome this issue most experiments overcome this issue by using a PCR amplification step, resulting in multiple copies of the same read, slightly biased by the PCR procedure [150]. Brunskill et al [144] demonstrate that this PCR bias can be overcome by attaching a barcode to each unique molecule. This provides a significant reduction in noise, with a minimal loss of sequencing length. This too provides a challenge to individual researchers who may not be aware of the presence of a barcode, or the constant flanking sequences. Failure to account for the barcodes can reduce the accuracy in the same way that an ignored adapter can, while simultaneously throwing away valuable information. To account for barcodes our pipeline has a special search within the adapter finder to look for varying bases, which are automatically recognized as barcodes and utilized to remove duplicated reads. Read lengths within second generation sequencing continue to grow, starting with Illumina’s 25 base pair single-ended read, through their current 300 base pair double-ended read. This increase in read length is critical to the proper resolution of RNA isoforms, and unique mapping of identified reads. Nonetheless, it too presents challenges. Increased read lengths drastically increase the probability of reading across an RNA splice junction meaning that the genome annotation used should contain the most extensive and accurate list of splice junctions for your organism of choice. Most researchers use the University of California Santa Clara (UCSC) annotations, but even among those annotations there are multiple versions in use. This results in yet another source of noise as different annotation versions will refer to identical genes with different names that an also be ambiguous, mapping to more than one gene in annotations from a different group. Even differing versions of annotations from a

48 single group may contain different genes, or may rename genes that were previously suspected and were later confirmed.

2.3.2 Noise Minimization

Figure 10. In this figure we present the data processing pipeline from 5 other ScRNA-seq experiments available on NCBI GEO. GSE59127 and GSE59129 (left, blue), GSE 52583 (left center, red), GSE21180(center, olive), GSE45284 (right center, orange) and our pipeline (right).

As shown in Figure 11A-C the coefficient of variation(CV) of our pipeline is approximately in line with those provided by the providers of the respective data sets. In Figure 11D we demonstrate that despite having similar intra-experimental CV’s our pipeline produces a reduced inter-experimental CV compared with the mixed

49 Figure 11. Scatterplots showing ratios of Coefficient of Variation(CV), where points represent CV of a single gene within the original author’s pipeline, and ours. The x=y line drawn, representing the point where CV is equal between both pipelines. Points on the x>y side are drawn in red, and point on the y>x side are drawn in black. For the experiments in (A-C), raw scRNA-seq reads are provided by the NCBI GEO service and analyzed using our customized pipeline. In addition to raw reads each author provides estimates of gene activity levels through FPKM or TPM. In (A-C) we plot the coefficient of variation (CV) of our pipeline results against those of the original authors. In (D) we group the experiments from (A-C) to visualize inter-experimental CV. pipelines of the original authors. Furthermore we expect that the effect of this CV reduction will continue to become more pronounced as the number of aggregated works, and distinct sequencing pipelines, increases.

50 2.3.3 Future Work

We propose to add two components to the pipeline: automated parameter optimization and post-processing of the data. Due to differences in the data it is currently important to determine key features such as maximum and minimum fragment lengths, adapter and barcode presence, and in some cases even the format of the read type(color space or base space) and quality score. This is currently done manually, with some parameters determined through examination of the sequencing procedures and sequencing files with other parameters collected by aligning a subset of the data and then resequencing the remaining data. Automation of these tasks can be performed well through a rule-based system utilizing lists of sequencers, and a distribution-based analysis of read lengths. In addition, we propose to post-process data to remove previously included contextual noise, such as cell-cycle noise. Not all cells undergo cell cycle, but much of the available scRNA-seq data is from differentiating cells. This differentiation noise can obfuscate subpopulations of cells, which may have expression levels that are masked by the boom and bust of genes required for cell division [151]. In differentiating T cells upwards of 50% of non-technical noise can be accounted for by cell cycle variations [151]. The techniques to remove this noise can also be expanded to account for other sources of noise, allowing us to extract and classify noise sources, validate them biologically, then remove them from the data source. A large pool of homogenized data will also allow us to replicate algorithmic experiments from other groups to examine whether they function generally, or only in special cases, and similarly to develop algorithms that generalize well across data sets from different labs.

51 Chapter 3

NANOPORE READ SIMULATION

Nanopores represent the first commercial technology in decades to present a significantly different technique for DNA sequencing, and one of the first technologies to propose direct RNA sequencing. Despite significant differences with previous sequencing technologies, read simulators to date make similar assumptions with respect to error profiles and their analysis, resulting in incorrect characterization of nanopore error. This is a great disservice to both nanopore sequencing and to computer scientists who seek to optimize their tools for the platform. Previous works have discussed the occurrence of some bias in the identifiability of certain k-mers, but this discussion has been focused on homopolymers, leaving unanswered the question of whether k-mer bias exists over all k-mers, the strength of the bias, how it occurs, and what can be done to reduce the effects. In this work, we demonstrate that current read simulators fail to accurately represent k-mer error distributions. We explore the sources of k-mer bias in nanopore basecalls, and we present a model for predicting k-mers that are difficult to identify. We also propose SNaReSim, a new state-of-the-art simulator, and demonstrate that it provides higher accuracy with respect to 6-mer accuracy biases.

3.1 Introduction

DNA sequencing has become an integral component of biological research, with ap- plications ranging from gene network or organism identification through biological engineering. This is accomplished by reading analog nucleotides and representing

52 them as digital sequences of bases that can be further analyzed downstream. Histor- ically, sequencing has been dominated by synthesis-based approaches involving the replication of an existing DNA or cDNA molecule, with the fluorescent nucleotides, providing a stepwise 4-dimensional color space readout of the DNA sequence. This approach provides high accuracy through amplification of the input sequence, but these amplification steps obfuscate modifications on the DNA, and lead to uneven representation of input sequences, largely as a function of GC content [152]. A novel approach known as nanopore sequencing allows for the identification of an input sequence by measuring electrical current signals as the DNA strand passes through a nanopore. This eschews the pre-amplification step, reducing sequence-selection bias, and allowing for the detection of modifications on the raw DNA strands. Nanopore data introduces novel challenges when compared with previous genera- tions of sequencers. First, nanopore reads provide significantly larger mean lengths and higher error rates than synthesis-based approaches (10,000 bases at 80% accuracy, compared with 500 bases at 99.9% accuracy) [153]. Additionally, synthesis-based approaches tend to have an error profile that is not heavily biased by the underlying DNA sequence, while nanopore errors are strongly connected to the DNA sequence be- ing read [152], [154]. This bias is unaccounted for in most simulators and downstream analysis tools, and lacks a published model, but it has significant implications. For most downstream analysis, sequences that are read in must be aligned to a reference genome, or to another sequence. Because sequence alignment algorithms are O(N 2), and most genomes are immense, almost all current generation sequence aligners rely on an indexed set of perfect or nearly perfect “seed“ sequences to propose candidates for an alignment, where a seed is 6-30 nucleotides that must match on each sequence. Because of observations on the DNA sequencing data when these aligners were writ-

53 ten, seeds are typically treated as either completely equal, or as a function of “seed” uniqueness within a genome [155]–[158]. Similarly, after matching a “seed“ between two sequences the alignment is extended utilizing a cost function for matches and mismatches, but these cost functions also do not incorporate the varied probability of error within across a DNA sequence. Furthermore, the ability of the community to address these issues has be hampered by a lack of tools capable of simulating these results, and a model to explain possible sources of this bias. Biological data is often difficult to obtain, requiring significant time and reagents to culture, prepare, and process experimental samples. Because of this it is beneficial to run in silico trials of their experiments before attempting a “real” run. Because these in silico results are being used as a proxy for real data it is similarly desireable to minimize the differences between in silico results and experimental results. To this end simulators have been made for most sequencing technologies, incorporating various facets and behaviors intrinsic to technology. Because nanopore sequencers are novel in their approach to reading DNA there many of the assumptions inherited from previous simulators are no longer valid, and while some invalid assumptions have been identified there are more that remain [159]–[161]. We demonstrate that the specific sequence under investigation strongly contributes to the local probability of observing an error. While previous studies have observed that homopolymers are not identified correctly we extend this to show that it is more extensive than the previously observed homopolymer bias, and that current generation simulators are unable to properly model this variation in error probability. We propose a read simulator based on a Markov chain, with observations modified by the underlying seuqence and demonstrate that it provides a higher fidelity to the error distribution, which fit from real data. To accomplish this task, we identify the

54 accuracy of individual k-mers and their behavior in different contexts. We also predict key features of the error bias, and calculate their individual contributions to the final error. Our simulator is fully automated, allowing for both in silico amplification of read data, or simulation of data on completely novel genomes. Our contributions are summarized below:

• We demonstrate that while some base calling error is random there is a significant component that is systematic and unaccounted for (section 3.4.1). • We develop a set of features for predicting k-mer accuracy and show that our model generally correlates well with data found empirically (section 3.4.2). • We propose an algorithm for simulating reads, a variant of the Hidden Markov Model (HMM) employed by Nanosim [159], with modifications to apply observed k-mer bias. We demonstrate that it models the k-mer error distribution better than other popular read simulators [159]–[161] (section 3.4.3).

3.2 Related Work

Cost, throughput, and accuracy have been major hindrances in DNA sequencing. With the development of NGS technologies cost has reduced while throughput and accuracy continue to climb. Still, simulators offer a significant benefit due to their low cost and exceptionally high throughput, allowing testing while developing new algorithms. Simulators aim to produce sequences with the most fidelity possible for their given platform. As such, simulated reads should account for biological and technical bias [162]. Simulators typically generate synthetic reads by extracting a sequence from a reference genome and then introducing errors into that sequence. Parameters required by simulators to introduce these features into the sequence are either provided at run

55 time or are stored inside a metadata called a model or error profile. By analyzing the alignment of empirical data to a reference genome error profiles are created. Errors have been generated by first predicting a quality score [163], by base position within a simulated read [164], or by predicting an error sequence and then applying it to a read [159], [160], [165]. Modeling of third generation single-molecule reads has many advantages when compared with second generation sequencers. Polymerase chain reaction (PCR), required by second generation sequencers during the pre-amplification step, introduces significant bias, but is not required for third generation sequencers [166]. GC bias, a secondary effect of the PCR amplification step, resulting in low base accuracy and high coverage variability, is also removed [167]. Third generation sequencing on the other hand has new challenges that must be modeled. Simulators for third generation sequencers must deal with longer reads containing significantly larger stretches of errors, and a biased error that varies with nearby nucleotides. There are two main platforms for third generation sequencing, PacBio’s SMRT sequencing, and Oxford Nanopore’s nanopore sequencing. Each platform has their own read simulators, but none are unable to properly model the k-mer bias of nanopore sequencing data. NanoSim [159] is a read simulator for ONT data, modeling reads as the result of a HMM. The model is fit by aligning empirical reads to a reference genome, then collecting a list of error subtypes, lengths, and transition probabilities. Reads can be simulated by sampling a length, then generating a sequence of errors. One major limitation of this approach is that it is unable to properly model k-mer bias; this is somewhat overcome by a post-processing step where all homopolymers of length greater than 5 (e.g. “AAAAAAA“) are compressed to a length of 5. LongISLND [161] is a read simulator developed for PacBio data. It models k-mer

56 bias explicitly and directly by storing observed mutations for each k-mer it sees, stored as a key-value pair. It also uses an Extended K-mer (EKmer) model to aid in homopolymer error generation. An EKmer is a regular k-mer followed by an integer representing a length of homopolymer covering the middle term in the k-mer facilitating the simulation of arbitrary stretches of homopolymer without explicitly observing them. One major limitation of LongISLND is that while is is able to approximate the k-mer bias it lacks the accuracy level observed in true ONT data. This is likely due to ONT data having grouped errors (i.e. the probability of error given an error exists nearby is higher than normal) that are not properly modeled using their simulator.

3.3 Problem Description

DNA has a label space of Σ ∈ {A, C, T, G} representing the different nucleotide bases. Let s ∈ Σn be a DNA strand of length n. Using Nanopore technology, a series of discrete measurements δ is generated, representing the current that passes through the nanopore at each time step from time t0 to tT , where tT reflects the total time required for s to completely transit the Nanopore. This creates a corresponding vector

d ∆ ∈ R , where d >> n is the number of discrete measurements obtained from time t0 to tT .

The vector ∆ is subsequently binned into q different bins, B = b1, b2...bq, using

d different time intervals that capture ≈ n measurements per bin. The mean µi and standard deviation σi are subsequently derived for each bin bi. An approximation strand, sˆ, of the original strand s is reconstructed or “base called” using the sequence of (µi, sigmai) pairs using either a Recurrent Neural Network [168], or a HMM [153],

57 (A) (B)

(C) (D)

Figure 12. The accuracies of all 6-mers with accuracy on the vertical axis, and a kmer index on the horizontal axis. (A) All experiments sorted with respect to their accuracy. (B) Experiments sorted with respect to the Escherichia coli R9 2D experiment. (C) The R9 E. coli template experiment divided randomly into 5 groups, sorted with respect to the first group. (D) Only the R9 E. coli template experiment, aligned with both Graphmap and BWA-MEM, sorted with respect to BWA-MEM.

[169] in conjunction with a lookup table from the Nanopore manufacturer that specifies the most probable k-mer base pair for the given µi value. Using the R9 pore from

Oxford Nanopore µi is best explained by a sequence of 6 DNA bases, thus the k-mer length k = 6 is used in later analysis.

58 3.3.1 Error Source Identification

Naively, one may assume that error is distributed evenly over sˆ, however it is trivial to see that it is not the case. The expected mean current µ exists on a linear range, thus k-mers near the minimum current are not as influenced by δ readings lower than their expected mean, with the opposite true for readings near the top of the range. Second, if the density range of the k-mer currents is uneven, more error is to be expected in high density regions [154]. Third, some sequences of k-mers are easily identifiable due to large changes in current, while others are difficult to identify due to small changes [154]. To elucidate the drivers of k-mer bias we first calculate a list of k-mer accuracies K, and propose a set of features F , where each feature fi confers a nonnegative accuracy to each k-mer. Error cannot be negative, thus we can approximate the influence of each feature by solving Equation 3.1.

arg min kF x − Kk2 x (3.1) subject to x ≥ 0 Where x is a vector indicating the contribution of each error feature F. Fx is then the best approximation of the original k-mer bias with respect to euclidean distance.

3.3.2 Error Features

Error during sequencing can occur in multiple locations, and from multiple sources. Some of these sources of error are recoverable because of constraints enforced by the sequential relationship of the bins B, i.e. that each bin predicts a k-mer for itself, but also implies limitations on adjacent k-mers. Other error sources are significantly

59 more difficult to recover from, such is the case. To this end we generated features to capture specific subtypes of errors. Sequence Identifiability refers to the ability to easily differentiate one sequence of k-mers from another given a sequence of bins. These kinds of errors can arise if there is a shift in the current that persists for multiple bins, or a drift between recalibration phases. This can be approximated by calculating the number of sequences of length L with a manhattan distance lower than a provided threshold [154]. We also analyze the local density of the pore model, as highly dense regions reduce the probability that a bin’s mean value implicates a specific k-mer. Transition Identifiability Some transitions between k-mers result in an expected current change smaller than the standard deviation of either nucleotide. As a result, some of these changes are not captured successfully by the event detection algorithm, or in an attempt to correct this behavior there may be multiple bins that should only be a single bin. In nanopore base callers these gaps or duplications are known as “skips“ or “stays”. To predict this, for each k-mer, we calculate the difference in mean current value between the k-mer under inspection and all immediately previous and subsequent k-mers. We also calculate the number of steps required to return to the same k-mer, and the difference in current that should have been observed across those steps.

3.3.3 Read Simulation

Read simulation occurs in 2 phases: model fitting, then simulating reads from an input genome. Fitting the model requires existing reads to be aligned to a reference genome, alignment is performed using BWA-MEM [156] with standard options. From each read, the top alignment is selected, and unaligned reads are discarded. From the

60 remaining reads, parameters are extracted with respect to average error length for the error types {Insertion, Deletion, Mismatch}, transition probabilities, and the accuracy of all k-mers of length 6. The inverse of the k-mer accuracy is also used as a “cost” of correctly predicting a k-mer. These parameters are then stored for downstream simulation. Reads are simulated according to a HMM which generates transitions between correct and erroneous stretches, and the lengths of those stretches. This approach has the benefit of easy explainability, but has limitations in that it is unable to correctly model k-mer accuracy distribution. Moreover, this shortfall is difficult to overcome, as modeling k-mer accuracies as states would require an intractable number of parameters. To remedy this situation we allow errors and error lengths to be generated according to the HMM, but we adjust the lengths by using a cost function described above in a process detailed in Algorithm 1. As a second approach we introduce read simulation as an optimization problem where we attempt to minimize the differences between X ∈ Rn approximated by Xˆ ∈ Rn, where each x is a parameter, including accuracies of each of the 6-mers, transition probabilities between each error subtype, and average correct read length. This is formally specified in Equation (3.2).

ˆ minimize{kX − Xk2} (3.2)

Minimization is achieved by monte carlo simulation, iteratively proposing errors to the simulated sequence until the error difference reaches a threshold, or an empirically determined number of mutations has been rejected. The overall algorithm is detailed in Algorithm 2.

61 Data: read extracted from sample genome Result: mutated read while readP osition < readlength do

scaleFactor = mean k-mer cost around readPosition; isError = random ∗ scaleF actor < .5; if isError then

errorType = transition probability from HMM; end

budget = length from model based on errorType; while cost < budget do

cost += cost of k-mer at readPosition + i; i += 1; end if isError then save the mutation to a mutation list end

readPosition += i; end Algorithm 1: Generation of mutation list for sampled read using the k-mer biased HMM model.

62 Data: read extracted from sample genome Result: mutated read ˆ while kX − Xk2 > minDifference AND numReject < maxNumReject do

errorPosition = uniformRandom; errorType = random based on parameters; errorLength = length from model based on errorType; tempXˆ = Xˆ with error applied; ˆ ˆ if kX − tempXk2 < kX − Xk2 then

Xˆ = tempXˆ; else

numReject += 1; end

end Algorithm 2: Random introduction of errors to achieve minimum difference.

3.4 Results

3.4.1 Modeling K-mer Bias

Before attempting to model k-mer bias in real data is it important to understand the granularity at which it occurs, and the consistency. To this end we examined 2 public datasets 1 2 to determine whether bias was present, and to what extent. We show that while there is a consistent k-mer bias within experiments, there are significant differences between the R7 and R9 pores in Figure 12. We attribute the majority

1http://lab.loman.net/2016/07/30/nanopore-r9-data-release/

2http://lab.loman.net/2014/10/01/where-can-i-get-oxford-nanopore-miniontm-data-from/

63 (A) (B)

Figure 13. A plot of the non-negative least squares (nnls) fit of the predicted features compared to the actual distribution from the R9 E. coli 2D experiment, sorted by the true experiment. (A) All model feautures and the true accuracy of neighboring k-mers as a feature. (B) Only the features that are fully predictable from the model, as described in bias souces table. Bias Source Example Low Example High Contribution Transition Identifiability AAAAAA ATATAT 64.6% Sequence Identifiability CTGTCA TATATT 3.32% Standard Deviation CTAGAG TTGAAA 1.79% Fraction A CGTCGT AAAAAA 9.73% Fraction T CGACGA TTTTTT 2.90% Fraction C AGTAGT CCCCCC 9.33% Fraction G CATCAT GGGGGG 7.03% Table 2. Identified error sources and their contribution to k-mer overall accuracy with respect to the bias source of the differences to the pores themselves, but some influence also likely comes from different versions of ONT basecaller used on each dataset.

64 3.4.2 Identifying Bias Sources

After identifying the presence of k-mer bias we attempted to elucidate the source of the bias. To this end we generated a set of features with each providing a score for each k-mer, and we attempted to find a linear combination of each feature capable of explaining the bias, and providing a fractional contribution of each step. As shown in Table 2 multiple error sources influence the accuracy fraction of each k-mer. We find that the strongest feature for predicting the accuracy is the median accuracy of neighbors at one step away. While this is not directly helpful, it does suggest that direct modeling of error sources must incorporate neighbor accuracy aspects. Overall we find that our predictive model provides a good first attempt, providing some insight, but leaving much of the error signal uncaptured, as illustrated in Figure 13. We expect that much of the remaining error signal is a result of missing features that may be identified through a more thorough analysis. We also expect that some of the error signal cannot be captured through linear combinations. For example, the original basecallers were incapable of capturing homopolymers with a length greater than 5, using linear combinations the k-mer “AAAAAA” would need to have 0 accuracy across all features, an unrealistic expectation.

3.4.3 Simulation Results

To validate our simulation results we generated read profiles for Nanosim [159], PBSim [160], LONGIslnd [161], and the two simulators we have proposed. Nanopore sequencing experiments typically yield between 10,000 and 20,000 reads [159], with measured statistics being approximately identical at even 20% of this size as shown in Figure 12 (C). To measure this effect in simulations, we generated 2 data sets with

65 (A) (B)

(C) (D)

Figure 14. Comparison of 3 previously published sequencing simulators with the new model we proposed. Fit of R9 pore model k-mer accuracies sorted internally (A), and by 2D E. coli, the training model (B). Fit of R7 pore model k-mer accuracies sorted internally (C), and by 2D E. coli, the training model (D). each simulator: one with 4,000 reads, and one with 20,000 reads. These reads were then aligned using BWA-MEM [156] with standard parameters. The sum of squares error is the sum of the difference between the model 6-mer accuracy and the simulated k-mer accuracy, for all 6-mers. Results of both large and small simulations are shown in Table 3, while the 20,000 read simulations are shown in Figure 14. Ultimately we

66 Source Sample Size SSE Split R9 6,481 0.121 LongISLND R9 20,000 167.04 LongISLND R9 4,000 167.80 Nanosim R9 20,000 44.26 Nanosim R9 4,000 46.76 SNRG R9 20,000 9.52 SNRG R9 4,000 9.66 LongISLND R7 20,000 88.78 LongISLND R7 4,000 88.96 Nanosim R7 20,000 30.76 Nanosim R7 4,000 36.16 SNRG R7 20,000 5.24 SNRG R7 4,000 5.48

Table 3. Sum of squares error (SSE) between k-mer simulators and their respective training sets, and between the fragmented R9 data set and the complete R9 data set

find that sample size has very little influence on existing simulators, as the k-mer bias is either completely un-modeled, or has systematic difficulties capturing the k-mer error rate (LongISLND). In our simulators we find some improvement through increased sample size, though most of the remaining difference does appear to be from systematic errors.

3.4.4 Simulator Complexity

In addition to validity, the performance of a simulator is also of great importance. If reads can be simulated properly but with an excessive requirement on the processing power or memory the simulation is of little value. To this end we benchmarked

67 Genome (MB) Sample Size Cores max RSS (MB) Runtime (h:m:s) E. Coli (4.64) 2000 4 659.51 0:48:20 E. Coli (4.64) 2000 1 532.87 2:18:41 E. Coli (4.64) 10000 4 1,105.60 3:33:01 E. Coli (4.64) 10000 1 636.41 12:36:30 E. Coli (4.64) 20000 4 1,666.17 6:58:18 E. Coli (4.64) 20000 1 810.46 22:53:12 Saccharomyces (12.1) 2000 4 670.12 0:51:37 Saccharomyces (12.1) 2000 1 558.66 2:44:32 Saccharomyces (12.1) 10000 4 2,040.84 3:57:08 Saccharomyces (12.1) 10000 1 560.92 17:11:49 Saccharomyces (12.1) 20000 4 1023.23 12:29:48 Saccharomyces (12.1) 20000 1 556.47 31:38:11 Table 4. memory and wall-clock time consumption of SNaReSim under various conditions the simulator on an 8-core Ubuntu system with 32GB of RAM under 3 different genome sizes and experiment sizes. The 2 genomes are chosen to be E. Coli, and Saccharomyces Cerevisae, as each are model organisms, with at an order of magnitude of size difference one magnitude of difference in size. The sizes chosen are 2,000, 10,000, and 20,000, to represent simulation of small, medium, and very large experimental runs. data was collected by running the simulator through ’time -v’, where each result was tracked for speed and memory consumption. In order to improve results, each test was conducted 5 times with only the median reported. The results are shown in table 4.

The results illustrate a few strengths, and a few weaknesses, with many weaknesses arriving from inefficient Python optimizations. Standard Python implementations for

68 example (CPython) contain a Global Interpreter Lock (GIL), meaning that threading where global variables are shared between threads is not capable of utilizing more than one CPU core completely. To this end many projects are implemented using processes, where each global variable is replicated and communication must occur using pipes, or a block of data that will be returned by the process. SNaReSim implements data passing in the second way, largely because of implementation simplicities, but this results in a significant memory waste observed as the difference between the 4-core and 1-core versions. When running in a single-thread mode simulated reads are immediately dumped into an open file descriptor, meaning that the resident set size is able to be quite small. The Multi-core implementation on the other hand must pass the data blocks back to the head process before writing to the file. This difference in memory consumption could be largely resolved by having child processes write to a queue, and having the parent process constantly write the queue contents into a file.

3.5 Conclusion

We have demonstrated the presence of biased k-mer accuracy within Oxford Nanopore Technologies sequencing platform. We show that this bias is consistent within ex- periments, and between sequence aligners, but varies between pore models. This information is of significant value to sequence aligners, which typically use in index of seed sequences to begin alignment, in order to reduce the search space and increase speed. Due to the bias in k-mer accuracy we show that some seed sequences are virtually impossible to correctly identify in sequenced samples, while other seeds have an accuracy significantly greater than the mean. We demonstrate that this k-mer bias has some observable and predictable founda- tion in the mean current readings of each k-mer(i.e. the manufacturers pore model),

69 although significant progress remains to be made. This provides a way to estimate the read accuracy for a provided pore model without performing large sequencing runs; it also suggests that a model-guided approach for nanopore design could improve overall accuracy. Alternatively it could allow for the development of specific pores, where accurate discrimination of some k-mer subtypes is more important than others. Finally we propose a novel nanopore read simulator capable of modeling the k-mer accuracy bias observed in this experiment. We demonstrate that our simulator has a significantly reduced sum of squares error with respect to 6-mer accuracy when compared with other third generation sequencing simulators. This provides a more realistic benchmark for sequence aligners and genome assemblers to compare against. The availability of high accuracy reads allows for the exploration of new applications, including; sequencing of larger organisms, organism disambiguation when sequencing a population, exact sequence detection in diploid and polyploid organisms, and the ability to scaffold genomes across exceptionally long repeat regions. Providing an understanding of accuracy in nanopore design, and development of tools to aid the alignment of the produced reads is thus critical to continued progress.

70 Chapter 4

READ CORRECTION

Nanopore sequencing has introduced the ability to sequence long stretches of DNA, enabling the resolution of repeating segments, or paired SNPs across long stretches of DNA. Unfortunately significant error rates >15%, introduced through systematic and random noise inhibit downstream analysis. We propose a novel method, using unsupervised learning, to correct biologically amplified reads before downstream analysis proceeds. We also demonstrate that our method has performance comparable to existing techniques without limiting the detection of repeats, or the length of the input sequence.

4.1 Introduction

DNA sequencing has become a critical part of most biological research, in tasks ranging from gene network identification, to biological engineering, to organism identification and the generation of phylogenies. Most of these applications have 2 primary foci: the length of DNA sequence reads, and the per-base error in those reads. Third generation sequencing technologies presented by Oxford Nanopore Technologies(ONT) and PacBio offer read lengths 10-100x what was possible with previous sequencing technologies, but at a per-base read accuracy near 85%, down from 99.99% using second generation technologies. While the increased read length enables many new biological applications [170], [171] the low read accuracy hinders others. Extensive research has gone into developing techniques that can robustly improve Nanopore read accuracy and analyzing the trade-offs [153], [172], [173].

71 We propose an approach to correcting errors that does not make assumptions about the presence of a reference genome and is robust to mixed biological populations where reads with significant overlap must be identified as different. To accomplish this task multiple copies of the same original DNA sequence must be read, resulting in a large number reads, of which a small number are high-error copies of one another. We then use the K-means algorithm to cluster reads based on an engineered feature space. This technique does not require replicated reads to be biologically attached [174], meaning that the individual sequence lengths are not constrained. Additionally our approach groups reads with their own replicates, removing the need for a reference genome. Our contributions are summarized below:

• We demonstrate that while some base calling error is random there is a significant component that is systematic and can be modeled. in section 4.4.1 we present a simple model for predicting easily identifiable k-mers and show that our model correlates well with data found empirically. • We propose a feature space based on the easily identifiable k-mers that accu- rately groups copies of unique DNA strands. In section 4.5.1, we demonstrate comparable clustering performance to MinHash [175] with a number of simulated reads similar to biological experiments. • We demonstrate that a simple K-means approach can replace a more complex biological process to group unique DNA strands together. • The proposed method allows for the analysis of larger DNA strands compared to INC-Seq [174].

72 (a) replicated reads (b) nanopore sequencing (c) base call

GGCACGTTAC GGTCAGTTAC

GGCACGTTAC GATCAGTTAC

GGCACGTTAC GGGTCACGTTAC . . .

Figure 15. (A) Strands of DNA bases are added to the nanopore sequencer as an analog data input (B) As strands pass through the nanopores electrical current readings are taken. These readings are consequently grouped into “events”, with each event approximately representing a k-mer (sequence of bases). These events may be improperly split, events may have a mean value higher or lower than expected for a DNA k-mer, and event lengths may be different. (C) A base caller uses the sequence of events to predict the original sequence of bases.

4.2 Related Works

Given the importance of sequence read accuracy, a variety of methods have been proposed for increasing read accuracy. These approaches centered around either detecting overlapping regions in genomic alignments to provide read corrections [153], [157], [176], or correcting reads before alignment [172], [174], [177], [178]. Read overlapping [157], and correction from genomic alignments [153], [176] has shown significant potential for read correction. These techniques have shown the ability to increase sequence accuracy into the high 90%’s, with relatively minimal requirements [153]. One major shortcoming is that they are limited by error rate and genomic similarity, with large genomes or high error rates resulting in excessive correction times or erroneous results [176]. Some of these shortcomings can be softened

73 by the presence of a reasonable reference genome, as the error rate for read-genome alignments is approximately half of that seen in read-read alignments. Read pre-correction has gained significant steam recently, and particular advantages have been seen with small genomes which contain more than 10x even within a single sequencing run. Early attempts at pre-correction focused on aligning high accuracy second generation reads to long nanopore reads [172], [178]. This provides the possibil- ity to resolve many features but introduces systematic error within repeating regions of the reads. More recent attempts have focused on nanopore reads exclusively [174], trading a significantly reduced read length for increased read accuracy. To a large extent both of these correction efforts are unacceptable. Introducing a systematic bias into sequencing results prevents the detection of mutations in sequencing experiments, or leads to incorrect references during de novo genome constructions [172], [177]. Decreasing the relative read length is similarly detrimental in that much of the benefit of 3rd generation sequencing is consequently lost. After correction reads typically must be mapped to a reference to be used. To aid in this mapping high-information sequences have been explored. Although because previous technologies yielded a very high level of accuracy ’high-information’ is taken to mean rare k-mers, ones that appear less frequently than random chance would suggest. Focusing on these highly informative subsequences has allowed for significantly faster sequence matching [175], sequence pseudoalignment [179], or full alignment [158], [180]. Unlike second generation sequencing, nanopore has significantly higher error, and it is introduced as a biased error in readings. Because of this, the definition of high-information subsequences can be stretched to incorporate the probability of being able to correctly identify a k-mer.

74 4.3 Problem Description

DNA has a label space of Σ ∈ {A, C, T, G} representing the different nucleotide bases. Let s ∈ Σn be a DNA strand of length n. Using Nanopore technology, a series of discrete measurements δ is generated that represents the change in electrical current as each base passes through the Nanopore from time t0 to tT , where tT reflects the total time required for s to completely transit the Nanopore. This creates a corresponding vector ∆ ∈ Rd, where d >> n (d ≈ n ∗ 12 ) is the number of discrete measurements obtained from time t0 to tT .

The vector ∆ is subsequently binned into q(≈ n) different bins, B = b1, b2...bq,

d using different time intervals that capture ≈ q measurements per bin. The mean µi, and standard deviation σi are subsequently derived for each bin bi. An approximation strand, sˆ, of the original strand s is approximated from or “base called“ using µi in conjunction with a lookup table from the Nanopore manufacturer that specifies the most probable k-mer base pair for the given µi value. Using the R9 pore from Oxford

Nanopore µi is best explained by a sequence of 6 DNA bases, thus the k-mer length k = 6 is used. Thus, a function that uses manufacturer provided information will

6 yield a 6-mer f(µi, σi) = [A|T |C|G] . Unfortunately, the approximation sˆ obtained from this function has > 15% [174] median error rate.

Let each approximated DNA strand be treated as a single data point {sˆ1, sˆ2, ...sˆv}, where v is the total number of strands being analyzed. Let there also be G =

G1,G2, ...Gp different groups associated with p unique strands. If copies are made of each strand such that v >> p, we would like to group each strand with its respective copies. The problem is formally defined as:

p X X 2 arg min ||φ(ˆs) − ck|| , (4.1) G k=1 φ(ˆs)∈Gk

75 where φ(·) represents the transformation of a DNA strand to a descriptive feature space. ck is the centroid of group Gk defined as

1 X ck = φ(ˆs), (4.2) |Gk| φ(ˆs)∈Gk which is the mean of group k in the transformed feature space. Therefore, selecting an appropriate feature space transformation is vital to clustering the copies together. The objective becomes finding a feature space transformation that minimizes the distance of the centroid and its members, such that the centroid is accurately representative of each unique strand.

4.4 Proposed Method

4.4.1 High-Accuracy K-mers

One key step in clustering DNA sequences is identifying reliable features to generate a “thumb print” for each DNA strand. This thumb print allows copies of each respective DNA strand to be grouped together with high accuracy so that subsequent sequencing corrections can be performed. In order to find the most reliable thumb print, we analyze the transition behavior of the DNA bases passing through a Nanopore. Each 6-mer DNA sequence will have six different transitions reflecting the transition of the sequence, one base at a time, through the pore. These transitions are reflected by the change in the previously defined µ from time ti to ti+1. Some sequences and transitions have been previously identified as being difficult to base call correctly [181]. The reference measurements provided by the nanopore manufacturer provide the mean and standard deviation associated with what one should expect to observe given a particular 6-mer DNA sequence. A state transition table is subsequently derived for

76 (A) (B)

mean visualized k-mers currents currents AAAAAAA (83.5, 83.5)

AAAAAAG (83.5, 81.1)

TTGTATG (79.5, 97.1)

AAACCTT (80.7, 98.9)

TTGTCACG (76.6, 109.1,92.4)

Figure 16. (A) DNA sequences can be interpreted as a sequence of k-mers, each k-mer can be replaced by its expected mean current to provide a mapping from DNA sequences into a high-dimensional current space. (B) Clusters of 2-dimensional points (sequence length 7), clusters are found using DBSCAN with neighborhood size 1 and epsilon 2 each K-mer DNA base. Each transition has an associated transition value τ that reflects the change in value from one K-mer to the next. We used DBSCAN [182], a density-based clustering approach, to identify the most unique τ patterns. The DBSCAN algorithm clusters these transitions based on density and their distance to one another. Clusters that are generated can be interpreted as sequences of k-mers which are easy to confuse with other members of the cluster, but easy to differentiate from other sequences. As a result, those clusters that tend to be smaller and more isolated reflect transition sequences that are more unique and less likely to be confused with other sequences; the more unique a sequence is, the more likely it is to be base called correctly. Because of these requirements we set the DBSCAN required density to 1 (i.e. a K-mer should be treated as a cluster even if

77 it has no similar k-mers), and we varied the DBSCAN epsilon(neighbor euclidean distance requirement) between .5 and 5; empirically we found that an epsilon of 2 (approximately the median standard deviations provided by the nanopore the model) with k-mers of length 10 (4 transitions) gave us a sufficient number of high-quality clusters. To determine the features used we selected the 500 smallest clusters, and selected the longest common subsequence within the cluster. These k-mers were then treated as high-accuracy k-mers. A virtual DNA thumb print that is capable of clustering each unique strand with its respective copies was then generated using the high-accuracy k-mers. The results-guided search allowed us to verify and refine our model predictions through the analysis of real-world data. To this end we gathered E. Coli K12 data previously made public by Nick Loman’s lab {http://lab.loman.net/2016/07/30/ nanopore-r9-data-release/}. These reads were then aligned to the genome using Graphmap [155], while discarding unmapped reads. From the resulting alignment files the segments of exactly matching sequence were extracted, and all k-mers (for k=3-10) within them were extracted and counted. These counts were then divided by the true counts, defined by the regions on the reference genome matched by the individual reads. This ratio provides recall for all k-mers, precision is calculated by selecting all k-mers in the same range present in the reads that were not eliminated. We find a high concordance in the k-mers identified by the model-guided and results-guided approaches, with a significant portion of the disagreement arising from subsequences with low change between dimensions as demonstrated in the first example of 16(A), where low current change between k-mers results in a higher probability of event caller error that is not accounted for in the model.

78 4.4.2 Utilization of K-means

After DNA sequences have been reduced to their thumb print, the K-means algorithm was employed over the feature space to cluster DNA copies. K-means is a robust clustering algorithm that minimizes the objective function previously defined in Equation 4.1 while allowing us to define the number of clusters. The number of unique strands define how many clusters the K-means algorithm should find. With simulated results the exact number of unique strands is known, and future biological experiments will provide the same information (through a colony count). We find though that we are able to recover the error-corrected original sequences even in cases where the number of clusters is modestly overpredicted or underpredicted, which is of great value when the counts may not be exact due to possible sample contamination. K-means was employed in our system using the L2 norm within the reduced subspace created by our DNA thumb print. One limitation of K-means is that it is only guaranteed to find a local minimum; to remedy this we perform 10 runs and select the run with the lowest sum of squared error (SSE). Aside from being a generic choice in K-means clustering results we find that SSE has a strong negative correlation with cluster purity on our simulated data. Finally, our implementation was employed using individual points as initial clusters; We find that this significantly removes the probability that a cluster will be empty. Additionally any cluster with less than 5 reads is treated as empty and ignored, as there is no chance of reads within being corrected to an acceptable level.

4.4.3 Synthetic Data Generation

Current read simulators [159], [160] are incapable of properly modeling k-mer accuracy distributions identified here. nonetheless due to the low cost, high availability, and

79 Organism Simulator number of MHAP K-Means reads purity purity E. Coli NanoSim 2,000 98.2% 96.6% E. Coli NanoSim 20,000 77.2% 75.9% Table 5. Comparison of MHAP and K-means clusters over the engineered feature space, the number of K-means clusters was set to the number of components identified by MHAP relatively high fidelity it is still valuable to validate computational techniques on simulated data. To this end we generated synthetic reads using Nanosim [159]. Each unique read was replicated according to a Poisson random value with an expected value of 50, this was empirically chosen to ensure (>99%) that at least 30 replicates were generated as we determined was required for read correction. Each replicate originates with the same sequence, but errors are calculated and applied independently. In total we generated two data sets; one with 2,000 (40 unique) synthetic reads 20,000 synthetic reads (414 unique) and performed downstream experiments on this set.

4.5 Results

4.5.1 Clustering on Synthetic Data

There are not currently any standard tools for identifying replicated reads without a reference genome. Most existing tools were generated for 2nd generation sequencing technologies where reads are already high accuracy, and duplicated reads provide no additional information, and as such they are removed after alignment [183]. In this regard the closest tool we could find is the popular read overlapper MinHash Alignment Process (MHAP) [175] used for de novo genome construction. As a metric of comparison we use cluster purity as shown in figure 5. To generate the following

80 data we ran MHAP using default parameters, from the output we removed any links with less than 70% overlap, and counted connected components as clusters. In small data sets our performance is good, but less than MHAP due to MHAP performing an alignment post-processing step; in this way the full reads can be used allowing for the identification of many false positives. In larger data sets there are larger numbers of more significant overlaps that come from distinct reads. Despite the utilization of cluster purity as a metric, the ultimate goal is to generate accurate reads to be used for downstream processing. To this end we utilized PBDAGCON [177] to create consensus sequences from aligning reads within each cluster. We find that in the smaller E. Coli data set we are able to accurately recover each of the 40 unique reads with a mean read accuracy of 97%, up from an individual mean read accuracy of 81%. In the larger data set we are able to recover 408 of 414 individual sequences, again with a mean accuracy of 97%. On inspection most of the failing reads are caused by a division of read copies between multiple clusters. Interestingly, while some sequences can be base called with only 15 copies, depending on the error distribution the consensus caller can fail with as many as 30 copies, indicating that biological sequencing experiments should aim to have more than 30 copies for read correction.

4.6 Conclusion

We have proposed a novel unsupervised learning technique capable of significantly increasing accuracy of nanopore base calls. We demonstrate that our technique provides significant advantages over current techniques by providing similar levels of accuracy while removing read length limitations imposed by previous techniques. We

81 tested our method on data simulated from a state of the art simulator, and hypothesize that on actual results the performance should be even better. The availability of high accuracy reads allows for the exploration of new applications, including; sequencing of larger organisms, organism disambiguation when sequencing a population, exact sequence detection in diploid and polyploid organisms, and the ability to scaffold genomes across exceptionally long repeat regions. Providing a path to these high accuracy-reads without compromising read length is therefore crucial to continued progress.

82 Chapter 5

CONCLUSIONS

This chapter concludes the dissertation by summarizing the contributions of the work and highlighting future directions.

5.1 Methodological Contributions

1. Modeling Gene Network Motifs In Chapter 1 this thesis presents a thorough analysis of a differential equation-based fully-connected triadic network construct. This construct is analyzed in both stable and dynamic environments to identify stable steady states, and to validate their differential stability, and transition dynamics. Implementing these discoveries requires a way to map real genes within a network into this triadic construct though, so this thesis delves deeper into network inference techniques. 2. Techniques for Data Aggregation In Chapter 2 this thesis reviews tech- niques for network inference, and provides techniques of data aggregation and normalization to improve inference quality. This is performed by collecting publicly available data and normalizing the process for interpreting raw reads. We find that while this approach shows some reduction to the noise incurred by combining data sets there is still asignificant amount of noise remaining, and for significant progress to be made high quality data is required. 3. Modeling and Simulation of Nanopore Sequencing Data in Chapter 4 this thesis reviews error patters in third generation nanopore sequencing. We demonstrate that existing read simulators fail to adequately represent the

83 accuracy bias that is present in nanopore data. To explain the sources of error a set of features are generated from the model and fit to the error profile, providing an explanation for the sources of error. Finally an HMM-based simulator capable of generating the accuracy-biased nanopore reads is demonstrated and validated against current read simulators. 4. Error Correction in Nanopore Sequencing In the final chapters this thesis introduces computational and biological techniques for improving the quality of nanopore reads using basecalled reads, using a directed acyclic graph, and by correcting the raw signal provided by reads, which can later be basecalled with higher fidelity.

5.2 Future Work

1. Further Exploration of Nanopore Error Sources. while this thesis presents some analysis on sources of errors in nanopore sequencing there is a significant fraction of error that remains unexplained. A more thorough understanding of this error, and the ability to predict k-mer accuracy errors completely in silico would be of significant value in the future of pore design. 2. Integration of Error Bias Findings Into Existing Software Tools. Our findings indicate that many existing software tools that assume parity among all k-mers are incorrect in their assumptions. Collaborations with other researchers to integrate our findings both into sequence aligners and basecallers would be of significant value to the scientific community as a whole.

84 NOTES

85 REFERENCES

[1] LC Junqueira, J Carneiro, and RO Kelley. “Basic Histology. Appleton and Lange”. In: Stamford, Connecticut. USA pp (1995), pp. 283–300.

[2] B. D. Macarthur, A. Ma’ayan, and I. R. Lemischka. “Systems biology of stem cell fate and cellular reprogramming”. In: Nat Rev Mol Cell Biol 10 (Oct. 2009). 10, pp. 672–81. doi: 10.1038/nrm2766.

[3] C. H. Waddington. The strategy of the genes; a discussion of some aspects of theoretical biology. London, Allen & Unwin, 1957. ix, 262 p.

[4] S.A. Kauffman. “Metabolic stability and epigenesis in randomly constructed genetic nets”. In: Journal of Theoretical Biology 22.3 (Mar. 1969), pp. 437–467. doi: 10.1016/0022-5193(69)90015-0. (Visited on 09/26/2013).

[5] S. Huang, G. Eichler, Y. Bar-Yam, et al. “Cell fates as high-dimensional attractor states of a complex gene regulatory network”. In: Phys Rev Lett 94 (Apr. 1, 2005). 12, p. 128701.

[6] Manu, S. Surkova, A. V. Spirov, et al. “Canalization of gene expression and domain shifts in the Drosophila blastoderm by dynamical attractors”. In: PLoS Comput Biol 5 (Mar. 2009). 3, e1000303. doi: 10.1371/journal.pcbi.1000303.

[7] M. Acar, A. Becskei, and A. van Oudenaarden. “Enhancement of cellular memory by reducing stochastic transitions”. In: Nature 435 (May 12, 2005). 7039, pp. 228–32.

[8] T. S. Gardner, C. R. Cantor, and J. J. Collins. “Construction of a genetic toggle switch in Escherichia coli”. In: Nature 403 (Jan. 20, 2000). 6767, pp. 339–42.

[9] Attila Becskei, Bertrand Séraphin, and Luis Serrano. “Positive feedback in eukaryotic gene networks: cell differentiation by graded to binary response conversion”. In: The EMBO Journal 20.10 (May 15, 2001), pp. 2528–2535. doi: 10.1093/emboj/20.10.2528. (Visited on 09/25/2013).

[10] F. J. Isaacs, J. Hasty, C. R. Cantor, et al. “Prediction and measurement of an autoregulatory genetic module”. In: Proc Natl Acad Sci U S A 100 (June 24, 2003). 13, pp. 7714–9. doi: 10.1073/pnas.1332628100.

[11] Tom Ellis, Xiao Wang, and James J. Collins. “Diversity-based, model-guided construction of synthetic gene networks with predicted functions”. In: Nature

86 27.5 (May 2009), pp. 465–471. doi: 10.1038/nbt.1536. (Visited on 07/03/2013).

[12] Min Wu, Ri-Qi Su, Xiaohui Li, et al. “Engineering of regulated stochastic cell fate determination”. In: Proceedings of the National Academy of Sciences 110.26 (June 25, 2013), pp. 10610–10615. doi: 10.1073/pnas.1305423110. (Visited on 07/03/2013).

[13] Daniel Huang, William J. Holtz, and Michel M. Maharbiz. “A genetic bistable switch utilizing nonlinear protein degradation”. In: Journal of Biological En- gineering 6.1 (July 9, 2012), p. 9. doi: 10.1186/1754-1611-6-9. (Visited on 07/27/2013).

[14] W. Xiong and J. E. Ferrell. “A positive-feedback-based bistable ’memory module’ that governs a cell fate decision”. In: Nature 426 (Nov. 27, 2003). 6965, pp. 460–5.

[15] Tetsuya Shiraishi, Shinako Matsuyama, and Hiroaki Kitano. “Large-Scale Analysis of Network Bistability for Human Cancers”. In: PLoS Comput Biol 6.7 (July 8, 2010), e1000851. doi: 10.1371/journal.pcbi.1000851. (Visited on 07/27/2013).

[16] S. Huang, Y. P. Guo, G. May, et al. “Bifurcation dynamics in lineage- commitment in bipotent progenitor cells”. In: Dev Biol 305 (May 15, 2007). 2, pp. 695–713.

[17] I. H. Park, R. Zhao, J. A. West, et al. “Reprogramming of human somatic cells to pluripotency with defined factors”. In: Nature 451 (Jan. 10, 2008). 7175, pp. 141–6. doi: 10.1038/nature06534.

[18] Kazutoshi Takahashi and Shinya Yamanaka. “Induction of Pluripotent Stem Cells from Mouse Embryonic and Adult Fibroblast Cultures by Defined Factors”. In: Cell 126.4 (Aug. 25, 2006), pp. 663–676. doi: 10.1016/j.cell.2006.07.024. (Visited on 10/08/2013).

[19] K. Takahashi, K. Tanabe, M. Ohnuki, et al. “Induction of pluripotent stem cells from adult human fibroblasts by defined factors”. In: Cell 131 (Nov. 30, 2007). 5, pp. 861–72. doi: 10.1016/j.cell.2007.11.019.

[20] J. Yu, M. A. Vodyanik, K. Smuga-Otto, et al. “Induced pluripotent stem cell lines derived from human somatic cells”. In: Science 318 (Dec. 21, 2007). 5858, pp. 1917–20. doi: 10.1126/science.1151526.

87 [21] L. A. Boyer, T. I. Lee, M. F. Cole, et al. “Core transcriptional regulatory circuitry in human embryonic stem cells”. In: Cell 122 (Sept. 23, 2005). 6, pp. 947–56. doi: 10.1016/j.cell.2005.08.020.

[22] Jonghwan Kim, Jianlin Chu, Xiaohua Shen, et al. “An Extended Transcriptional Network for Pluripotency of Embryonic Stem Cells”. In: Cell 132.6 (Mar. 21, 2008), pp. 1049–1061. doi: 10.1016/j.cell.2008.02.039. (Visited on 07/03/2013).

[23] T. Kalmar, C. Lim, P. Hayward, et al. “Regulated fluctuations in nanog expression mediate cell fate decisions in embryonic stem cells”. In: PLoS Biol 7 (July 2009). 7, e1000149.

[24] Ian Chambers, Jose Silva, Douglas Colby, et al. “Nanog safeguards pluripotency and mediates germline development”. In: Nature 450.7173 (Dec. 20, 2007), pp. 1230–1234. doi: 10.1038/nature06403.

[25] T. Graf and M. Stadtfeld. “Heterogeneity of embryonic and adult stem cells”. In: Cell Stem Cell 3 (Nov. 6, 2008). 5, pp. 480–3. doi: 10.1016/j.stem.2008.10.007.

[26] Maurice A Canham, Alexei A Sharov, Minoru S H Ko, et al. “Functional het- erogeneity of embryonic stem cells revealed through translational amplification of an early endodermal transcript”. In: PLoS biology 8.5 (May 2010), e1000379. doi: 10.1371/journal.pbio.1000379.

[27] E. M. Chan, S. Ratanasirintrawoot, I. H. Park, et al. “Live cell imaging distinguishes bona fide human iPS cells from partially reprogrammed cells”. In: Nat Biotechnol 27 (Nov. 2009). 11, pp. 1033–7. doi: 10.1038/nbt.1580.

[28] Katsuhiko Hayashi, Susana M Chuva de Sousa Lopes, Fuchou Tang, et al. “Dynamic equilibrium and heterogeneity of mouse pluripotent stem cells with distinct functional and epigenetic states”. In: Cell stem cell 3.4 (Oct. 9, 2008), pp. 391–401. doi: 10.1016/j.stem.2008.07.027.

[29] Ben D MacArthur, Ana Sevilla, Michel Lenz, et al. “Nanog-dependent feedback loops regulate murine embryonic stem cell heterogeneity”. In: Nature cell biology 14.11 (Nov. 2012), pp. 1139–1147. doi: 10.1038/ncb2603.

[30] Todd S Macfarlan, Wesley D Gifford, Shawn Driscoll, et al. “Embryonic stem cell potency fluctuates with endogenous retrovirus activity”. In: Nature 487.7405 (July 5, 2012), pp. 57–63. doi: 10.1038/nature11244.

[31] J. Silva and A. Smith. “Capturing pluripotency”. In: Cell 132 (Feb. 22, 2008). 4, pp. 532–6. doi: 10.1016/j.cell.2008.02.006.

88 [32] Yayoi Toyooka, Daisuke Shimosato, Kazuhiro Murakami, et al. “Identification and characterization of subpopulations in undifferentiated ES cell culture”. In: Development (Cambridge, England) 135.5 (Mar. 2008), pp. 909–918. doi: 10.1242/dev.017400.

[33] Jamie Trott, Katsuhiko Hayashi, Azim Surani, et al. “Dissecting ensemble networks in ES cell populations reveals micro-heterogeneity underlying pluripo- tency”. In: Molecular bioSystems 8.3 (Mar. 2012), pp. 744–752. doi: 10.1039/ c1mb05398a.

[34] J. Hanna, K. Saha, B. Pando, et al. “Direct cell reprogramming is a stochastic process amenable to acceleration”. In: Nature 462 (Dec. 3, 2009). 7273, pp. 595– 601. doi: 10.1038/nature08592.

[35] S. Yamanaka. “Elite and stochastic models for induced pluripotent stem cell generation”. In: Nature 460 (July 2, 2009). 7251, pp. 49–52. doi: 10.1038/ nature08180.

[36] A. Brock, H. Chang, and S. Huang. “Non-genetic heterogeneity–a mutation- independent driving force for the somatic evolution of tumours”. In: Nat Rev Genet 10 (May 2009). 5, pp. 336–42. doi: 10.1038/nrg2556.

[37] Sui Huang. “Tumor progression: chance and necessity in Darwinian and Lamar- ckian somatic (mutationless) evolution”. In: Progress in biophysics and molecular biology 110.1 (Sept. 2012), pp. 69–86. doi: 10.1016/j.pbiomolbio.2012.05.001.

[38] Wenzhe Ma, Ala Trusina, Hana El-Samad, et al. “Defining Network Topologies that Can Achieve Biochemical Adaptation”. In: Cell 138.4 (Aug. 21, 2009), pp. 760–773. doi: 10.1016/j.cell.2009.06.013. (Visited on 07/03/2013).

[39] R. Milo, S. Shen-Orr, S. Itzkovitz, et al. “Network motifs: simple building blocks of complex networks”. In: Science 298 (Oct. 25, 2002). 5594, pp. 824–7. doi: 10.1126/science.298.5594.824.

[40] Jian Shu, Chen Wu, Yetao Wu, et al. “Induction of pluripotency in mouse somatic cells with lineage specifiers”. In: Cell 153.5 (May 23, 2013), pp. 963–975. doi: 10.1016/j.cell.2013.05.001.

[41] I. Chambers and A. Smith. “Self-renewal of teratocarcinoma and embryonic stem cells”. In: Oncogene 23 (Sept. 20, 2004). 43, pp. 7150–60. doi: 10.1038/sj. onc.1207930.

89 [42] T. Graf and T. Enver. “Forcing cells to change lineages”. In: Nature 462 (Dec. 3, 2009). 7273, pp. 587–94. doi: 10.1038/nature08533.

[43] H. Niwa. “How is pluripotency determined and maintained?” In: Development 134 (Feb. 2007). 4, pp. 635–46. doi: 10.1242/dev.02787.

[44] Piyush B. Gupta, Christine M. Fillmore, Guozhi Jiang, et al. “Stochastic State Transitions Give Rise to Phenotypic Equilibrium in Populations of Cancer Cells”. In: Cell 146.4 (Aug. 19, 2011), pp. 633–644. doi: 10.1016/j.cell.2011.07.026. (Visited on 07/27/2013).

[45] Avigdor Eldar and Michael B. Elowitz. “Functional roles for noise in genetic circuits”. In: Nature 467.7312 (Sept. 9, 2010), pp. 167–173. doi: 10.1038/ nature09326. (Visited on 07/27/2013).

[46] Peter Laslo, Chauncey J. Spooner, Aryeh Warmflash, et al. “Multilineage Transcriptional Priming and Determination of Alternate Hematopoietic Cell Fates”. In: Cell 126.4 (Aug. 25, 2006), pp. 755–766. doi: 10.1016/j.cell.2006.06. 052. (Visited on 09/04/2013).

[47] M. Kaufman, C. Soule, and R. Thomas. “A new necessary condition on inter- action graphs for multistationarity”. In: J Theor Biol 248 (Oct. 21, 2007). 4, pp. 675–85. doi: 10.1016/j.jtbi.2007.06.016.

[48] J. Hasty, J. Pradines, M. Dolnik, et al. “Noise-based switches and amplifiers for gene expression”. In: Proc Natl Acad Sci U S A 97 (Feb. 29, 2000). 5, pp. 2075–80.

[49] Santhosh Palani and Casim A Sarkar. “Transient noise amplification and gene expression synchronization in a bistable mammalian cell-fate switch”. In: Cell reports 1.3 (Mar. 29, 2012), pp. 215–224. doi: 10.1016/j.celrep.2012.01.007.

[50] Harukazu Suzuki, Alistair R. R. Forrest, Erik van Nimwegen, et al. “The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line”. In: Nature Genetics 41.5 (May 2009), pp. 553–562. doi: 10.1038/ng.375. (Visited on 08/08/2013).

[51] Hao Yuan Kueh, Ameya Champhekhar, Stephen L. Nutt, et al. “Positive Feedback Between PU.1 and the Cell Cycle Controls Myeloid Differentiation”. In: Science 341.6146 (Aug. 9, 2013), pp. 670–673. doi: 10.1126/science.1240831. (Visited on 08/08/2013).

90 [52] Richard A. Young. “Control of the Embryonic Stem Cell State”. In: Cell 144.6 (Mar. 18, 2011), pp. 940–954. doi: 10.1016/j.cell.2011.01.032. (Visited on 06/04/2014).

[53] Bard Ermentrout. Simulating, analyzing, and animating dynamical systems : a guide to XPPAUT for researchers and students. Software, environments, tools. Philadelphia: Society for Industrial and Applied Mathematics, 2002. xiv, 290 p.

[54] Steven H. Strogatz. Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering. Reading, Mass.: Addison-Wesley Pub., 1994. xi, 498.

[55] K. C. Chen, L. Calzone, A. Csikasz-Nagy, et al. “Integrative analysis of cell cycle control in budding yeast”. In: Mol Biol Cell 15 (Aug. 2004). 8, pp. 3841– 62.

[56] W. J. Blake, K. AErn M, C. R. Cantor, et al. “Noise in eukaryotic gene expression”. In: Nature 422 (Apr. 10, 2003). 6932, pp. 633–7.

[57] W. J. Blake, G. Balazsi, M. A. Kohanski, et al. “Phenotypic consequences of promoter-mediated transcriptional noise”. In: Mol Cell 24 (Dec. 28, 2006). 6, pp. 853–65.

[58] M. B. Elowitz, A. J. Levine, E. D. Siggia, et al. “Stochastic gene expression in a single cell”. In: Science 297 (Aug. 16, 2002). 5584, pp. 1183–6.

[59] Nicholas J Guido, Xiao Wang, David Adalsteinsson, et al. “A bottom-up approach to gene regulation”. In: Nature 439.7078 (Feb. 16, 2006), pp. 856–860. doi: 10.1038/nature04473.

[60] J. M. Raser and E. K. O’Shea. “Control of stochasticity in eukaryotic gene expression”. In: Science 304 (June 18, 2004). 5678, pp. 1811–4.

[61] G. M. Suel, J. Garcia-Ojalvo, L. M. Liberman, et al. “An excitable gene regulatory circuit induces transient cellular differentiation”. In: Nature 440 (Mar. 23, 2006). 7083, pp. 545–50.

[62] M. Thattai and A. van Oudenaarden. “Intrinsic noise in gene regulatory net- works”. In: Proc Natl Acad Sci U S A 98 (July 17, 2001). 15, pp. 8614–9.

[63] M. L. Bell, J. B. Earl, and S. G. Britt. “Two types of Drosophila R7 photorecep- tor cells are arranged randomly: a model for stochastic cell-fate determination”. In: J Comp Neurol 502 (May 1, 2007). 1, pp. 75–85.

91 [64] V. Chickarmane, T. Enver, and C. Peterson. “Computational modeling of the hematopoietic erythroid-myeloid switch reveals insights into cooperativity, priming, and irreversibility”. In: PLoS Comput Biol 5 (Jan. 2009). 1, e1000268.

[65] C. F. Chuang, M. K. Vanhoven, R. D. Fetter, et al. “An innexin-dependent cell network establishes left-right neuronal asymmetry in C. elegans”. In: Cell 129 (May 18, 2007). 4, pp. 787–99. doi: 10.1016/j.cell.2007.02.052.

[66] A. Raj, S. A. Rifkin, E. Andersen, et al. “Variability in gene expression underlies incomplete penetrance”. In: Nature 463 (Feb. 18, 2010). 7283, pp. 913–8. doi: 10.1038/nature08781.

[67] G. Yao, T. J. Lee, S. Mori, et al. “A bistable Rb-E2F switch underlies the restriction point”. In: Nat Cell Biol 10 (Apr. 2008). 4, pp. 476–82.

[68] D. Adalsteinsson, D. McMillen, and T. C. Elston. “Biochemical Network Stochastic Simulator (BioNetS): software for stochastic modeling of biochemical networks”. In: BMC Bioinformatics 5 (Mar. 8, 2004), p. 24.

[69] Koichi Akashi, Xi He, Jie Chen, et al. “Transcriptional accessibility for genes of multiple tissues and hematopoietic lineages is hierarchically controlled during early hematopoiesis”. In: Blood 101.2 (Jan. 15, 2003), pp. 383–389. doi: 10. 1182/blood-2002-06-1780.

[70] M. A. Cross and T. Enver. “The lineage commitment of haemopoietic progenitor cells”. In: Curr Opin Genet Dev 7 (Oct. 1997). 5, pp. 609–13.

[71] M. Hu, D. Krause, M. Greaves, et al. “Multilineage gene expression precedes commitment in the hemopoietic system”. In: Genes Dev 11 (Mar. 15, 1997). 6, pp. 774–85.

[72] Carla F Bender Kim, Erica L Jackson, Amber E Woolfenden, et al. “Identifica- tion of bronchioalveolar stem cells in normal lung and lung cancer”. In: Cell 121.6 (June 17, 2005), pp. 823–835. doi: 10.1016/j.cell.2005.03.032.

[73] T. Miyamoto, H. Iwasaki, B. Reizis, et al. “Myeloid or lymphoid promiscuity as a critical step in hematopoietic lineage commitment”. In: Dev Cell 3 (July 2002). 1, pp. 137–47.

[74] I. S. Gradshteyn, I. M. Ryzhik, and Alan Jeffrey. Table of integrals, series, and products. 6th. San Diego: Academic Press, 2000. xlvii, 1163 p. url: http: //www.loc.gov/catdir/description/els031/00104373.html%20http://www.loc. gov/catdir/toc/els031/00104373.html.

92 [75] Joseph Xu Zhou, M. D. S. Aliyu, Erik Aurell, et al. “Quasi-potential landscape in complex multi-stable systems”. In: Journal of The Royal Society Interface (Aug. 29, 2012), rsif20120434. doi: 10 . 1098 / rsif . 2012 . 0434. (Visited on 03/19/2014).

[76] Ben D. MacArthur and Ihor R. Lemischka. “Statistical Mechanics of Pluripo- tency”. In: Cell 154.3 (Aug. 1, 2013), pp. 484–489. doi: 10.1016/j.cell.2013.07. 024. (Visited on 08/02/2013).

[77] T. Vierbuchen, A. Ostermeier, Z. P. Pang, et al. “Direct conversion of fibroblasts to functional neurons by defined factors”. In: Nature 463 (Feb. 25, 2010). 7284, pp. 1035–41. doi: 10.1038/nature08797.

[78] T. Nakagawa, M. Sharma, Y. Nabeshima, et al. “Functional hierarchy and reversibility within the murine spermatogenic stem cell compartment”. In: Science 328 (Apr. 2, 2010). 5974, pp. 62–7. doi: 10.1126/science.1182868.

[79] Y. H. Loh, Q. Wu, J. L. Chew, et al. “The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells”. In: Nat Genet 38 (Apr. 2006). 4, pp. 431–40. doi: 10.1038/ng1760.

[80] Miguel Fidalgo, Francesco Faiola, Carlos-Filipe Pereira, et al. “Zfp281 mediates Nanog autorepression through recruitment of the NuRD complex and inhibits somatic cell reprogramming”. In: Proceedings of the National Academy of Sciences of the United States of America 109.40 (Oct. 2, 2012), pp. 16202– 16207. doi: 10.1073/pnas.1208533109.

[81] Pablo Navarro, Nicola Festuccia, Douglas Colby, et al. “OCT4/SOX2- independent Nanog autorepression modulates heterogeneous Nanog gene expression in mouse ES cells”. In: The EMBO journal 31.24 (Dec. 12, 2012), pp. 4547–4562. doi: 10.1038/emboj.2012.321.

[82] H Niwa, J Miyazaki, and A G Smith. “Quantitative expression of Oct-3/4 defines differentiation, dedifferentiation or self-renewal of ES cells”. In: Nature genetics 24.4 (Apr. 2000), pp. 372–376. doi: 10.1038/74199.

[83] Aliaksandra Radzisheuskaya, Gloryn Le Bin Chia, Rodrigo L dos Santos, et al. “A defined Oct4 level governs cell state transitions of pluripotency entry and differentiation into all embryonic lineages”. In: Nature cell biology 15.6 (June 2013), pp. 579–590. doi: 10.1038/ncb2742.

[84] Violetta Karwacki-Neisius, Jonathan Göke, Rodrigo Osorno, et al. “Reduced Oct4 Expression Directs a Robust Pluripotent State with Distinct Signaling

93 Activity and Increased Enhancer Occupancy by Oct4 and Nanog”. In: Cell Stem Cell 12.5 (Feb. 5, 2013), pp. 531–545. doi: 10.1016/j.stem.2013.04.023. (Visited on 06/04/2014).

[85] J. Wang, D. N. Levasseur, and S. H. Orkin. “Requirement of Nanog dimerization for stem cell self-renewal and pluripotency”. In: Proc Natl Acad Sci U S A 105 (Apr. 29, 2008). 17, pp. 6326–31. doi: 10.1073/pnas.0802288105.

[86] Joon-Lin Chew, Yuin-Han Loh, Wensheng Zhang, et al. “Reciprocal transcrip- tional regulation of Pou5f1 and Sox2 via the Oct4/Sox2 complex in embryonic stem cells”. In: Molecular and cellular biology 25.14 (July 2005), pp. 6031–6046. doi: 10.1128/MCB.25.14.6031-6046.2005.

[87] Vijay Chickarmane, Carl Troein, Ulrike A Nuber, et al. “Transcriptional Dynam- ics of the Embryonic Stem Cell Switch”. In: PLoS Comput Biol 2.9 (Sept. 15, 2006), e123. doi: 10.1371/journal.pcbi.0020123. (Visited on 06/04/2014).

[88] Irene Aksoy, Ralf Jauch, Jiaxuan Chen, et al. “Oct4 switches partnering from Sox2 to Sox17 to reinterpret the enhancer code and specify endoderm”. In: The EMBO journal 32.7 (Apr. 3, 2013), pp. 938–953. doi: 10.1038/emboj.2013.31.

[89] Dina A. Faddah, Haoyi Wang, Albert Wu Cheng, et al. “Single-Cell Analysis Reveals that Expression of Nanog Is Biallelic and Equally Variable as that of Other Pluripotency Factors in Mouse ESCs”. In: Cell Stem Cell 13.1 (July 3, 2013), pp. 23–29. doi: 10.1016/j.stem.2013.04.019. (Visited on 08/08/2013).

[90] Adam Filipczyk, Konstantinos Gkatzis, Jun Fu, et al. “Biallelic Expression of Nanog Protein in Mouse Embryonic Stem Cells”. In: Cell Stem Cell 13.1 (July 3, 2013), pp. 12–13. doi: 10.1016/j.stem.2013.04.025. (Visited on 08/08/2013).

[91] Yusuke Miyanari and Maria-Elena Torres-Padilla. “Control of ground-state pluripotency by allelic regulation of Nanog”. In: Nature 483.7390 (Feb. 12, 2012), pp. 470–473. doi: 10.1038/nature10807. (Visited on 03/22/2012).

[92] Austin Smith. “Nanog Heterogeneity: Tilting at Windmills?” In: Cell Stem Cell 13.1 (July 3, 2013), pp. 6–7. doi: 10.1016/j.stem.2013.06.016. (Visited on 08/08/2013).

[93] O. Adewumi, B. Aflatoonian, L. Ahrlund-Richter, et al. “Characterization of human embryonic stem cell lines by the International Stem Cell Initiative”. In: Nat Biotechnol 25 (July 2007). 7, pp. 803–16. doi: 10.1038/nbt1318.

94 [94] Kevin F Murphy, Rhys M Adams, Xiao Wang, et al. “Tuning and controlling gene expression noise in synthetic gene networks”. In: Nucleic acids research 38.8 (May 2010), pp. 2712–2726. doi: 10.1093/nar/gkq091.

[95] Hong-xi Zhao, Yang Li, Hai-feng Jin, et al. “Rapid and efficient repro- gramming of human amnion-derived cells into pluripotency by three factors OCT4/SOX2/NANOG”. In: Differentiation; research in biological diversity 80.2 (Oct. 2010), pp. 123–129. doi: 10.1016/j.diff.2010.03.002.

[96] R. Jaenisch and R. Young. “Stem cells, the molecular circuitry of pluripotency and nuclear reprogramming”. In: Cell 132 (Feb. 22, 2008). 4, pp. 567–82. doi: 10.1016/j.cell.2008.01.015.

[97] Vidyadhar G. Kulkarni. Modeling and analysis of stochastic systems. New York: Chapman & Hall, 1995. url: http://www.loc.gov/catdir/enhancements/ fy0744/95015182-d.html.

[98] Filipe Grácio, Joaquim Cabral, and Bruce Tidor. “Modeling Stem Cell Induction Processes”. In: PLoS ONE 8.5 (May 8, 2013), e60240. doi: 10.1371/journal. pone.0060240. (Visited on 08/08/2013).

[99] John E Pimanda, Katrin Ottersbach, Kathy Knezevic, et al. “Gata2, Fli1, and Scl form a recursively wired gene-regulatory circuit during early hematopoietic development”. In: Proceedings of the National Academy of Sciences of the United States of America 104.45 (Nov. 6, 2007), pp. 17692–17697. doi: 10.1073/ pnas.0707045104.

[100] Michael Kyba and George Q Daley. “Hematopoiesis from embryonic stem cells: lessons from and for ontogeny”. In: Experimental hematology 31.11 (Nov. 2003), pp. 994–1006.

[101] Sunhong Kim, Yongsung Kim, Jiwoon Lee, et al. “Regulation of FOXO1 by TAK1-Nemo-like kinase pathway”. In: The Journal of biological chemistry 285.11 (Mar. 12, 2010), pp. 8122–8129. doi: 10.1074/jbc.M110.101824.

[102] B. Pourcet, I. Pineda-Torra, B. Derudas, et al. “SUMOylation of human peroxisome proliferator-activated receptor alpha inhibits its trans-activity through the recruitment of the nuclear corepressor NCoR”. In: J Biol Chem 285 (Feb. 26, 2010). 9, pp. 5983–92. doi: 10.1074/jbc.M109.078311.

[103] F. Wei, H. R. Scholer, and M. L. Atchison. “Sumoylation of Oct4 enhances its stability, DNA binding, and transactivation”. In: J Biol Chem 282 (July 20, 2007). 29, pp. 21551–60. doi: 10.1074/jbc.M611041200.

95 [104] Alex K. Shalek, Rahul Satija, Xian Adiconis, et al. “Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells”. In: Nature 498.7453 (June 13, 2013), pp. 236–240. doi: 10.1038/nature12172. url: http: //www.nature.com/nature/journal/v498/n7453/full/nature12172.html (visited on 08/28/2014).

[105] Ting Lu, Michael Ferry, Ron Weiss, et al. “A molecular noise generator”. In: Physical biology 5.3 (2008), p. 036006. doi: 10.1088/1478-3975/5/3/036006.

[106] Mathieu Gabut, Payman Samavarchi-Tehrani, Xinchen Wang, et al. “An Al- ternative Splicing Switch Regulates Embryonic Stem Cell Pluripotency and Reprogramming”. In: Cell 147 (Sept. 2011), pp. 132–146. doi: 10.1016/j.cell. 2011.08.023. (Visited on 10/18/2011).

[107] Yu Lu, Yuin-Han Loh, Hu Li, et al. “Alternative Splicing of MBD2 Supports Self-Renewal in Human Pluripotent Stem Cells”. In: Cell stem cell (May 7, 2014). doi: 10.1016/j.stem.2014.04.002.

[108] T. L. Deans, C. R. Cantor, and J. J. Collins. “A tunable genetic switch based on RNAi and repressor proteins for regulating gene expression in mammalian cells”. In: Cell 130 (July 27, 2007). 2, pp. 363–72.

[109] A. S. Khalil and J. J. Collins. “Synthetic biology: applications come of age”. In: Nat Rev Genet 11 (May 2010). 5, pp. 367–79. doi: 10.1038/nrg2775.

[110] T. K. Lu, A. S. Khalil, and J. J. Collins. “Next-generation synthetic gene networks”. In: Nat Biotechnol 27 (Dec. 2009). 12, pp. 1139–50. doi: 10.1038/ nbt.1591.

[111] J. Stricker, S. Cookson, M. R. Bennett, et al. “A fast, robust and tunable synthetic gene oscillator”. In: Nature 456 (Nov. 27, 2008). 7221, pp. 516–9. doi: 10.1038/nature07389.

[112] M. Tigges, T. T. Marquez-Lago, J. Stelling, et al. “A tunable synthetic mam- malian oscillator”. In: Nature 457 (Jan. 15, 2009). 7227, pp. 309–12.

[113] R. Lu, F. Markowetz, R. D. Unwin, et al. “Systems-level dynamic analyses of fate change in murine embryonic stem cells”. In: Nature 462 (Nov. 19, 2009). 7271, pp. 358–62. doi: 10.1038/nature08575.

[114] Ingmar Glauche, Maria Herberg, and Ingo Roeder. “Nanog Variability and Pluripotency Regulation of Embryonic Stem Cells - Insights from a Mathe-

96 matical Model Analysis”. In: PLoS ONE 5.6 (June 21, 2010), e11238. doi: 10.1371/journal.pone.0011238. (Visited on 07/03/2013).

[115] A. Remenyi, K. Lins, L. J. Nissen, et al. “Crystal structure of a POU/HMG/DNA ternary complex suggests differential assembly of Oct4 and Sox2 on two enhancers”. In: Genes Dev 17 (Aug. 15, 2003). 16, pp. 2048–59. doi: 10.1101/gad.269303.

[116] A. Dhooge, W. Govaerts, and Yu. A. Kuznetsov. “MATCONT: A MATLAB Package for Numerical Bifurcation Analysis of ODEs”. In: ACM Trans. Math. Softw. 29.2 (June 2003), pp. 141–164. doi: 10.1145/779359.779362. (Visited on 11/15/2013).

[117] Daniel Gillespie. “Exact stochastic simulation of coupled chemical reactions”. In: J. Phys. Chem. 81 (1977), pp. 2340–2361.

[118] Tong Ihn Lee, Nicola J. Rinaldi, Francois Robert, et al. “Transcriptional Regu- latory Networks in Saccharomyces cerevisiae”. In: Science 298.5594 (Oct. 25, 2002), pp. 799–804. doi: 10 . 1126 / science . 1075090. url: http : / / science . sciencemag.org.ezproxy1.lib.asu.edu/content/298/5594/799 (visited on 07/13/2017).

[119] Philippe C. Faucon, Keith Pardee, Roshan M. Kumar, et al. “Gene Networks of Fully Connected Triads with Complete Auto-Activation Enable Multistability and Stepwise Stochastic Transitions”. In: PLOS ONE 9.7 (July 24, 2014), e102873. doi: 10.1371/journal.pone.0102873. url: http://journals.plos.org/ plosone/article?id=10.1371/journal.pone.0102873 (visited on 09/14/2016).

[120] Tanya Barrett, Stephen E. Wilhite, Pierre Ledoux, et al. “NCBI GEO: archive for functional genomics data sets”. In: Nucleic Acids Research 41 (D1 Jan. 1, 2013), pp. D991–D995. doi: 10.1093/nar/gks1193. url: https://academic.oup. com/nar/article/41/D1/D991/1067995/NCBI-GEO-archive-for-functional- genomics-data-sets (visited on 07/13/2017).

[121] Markus Maucher, Barbara Kracher, Michael Kühl, et al. “Inferring Boolean network structure via correlation”. In: Bioinformatics 27.11 (June 1, 2011), pp. 1529–1536. doi: 10.1093/bioinformatics/btr166. url: http://bioinformatics. oxfordjournals.org/content/27/11/1529 (visited on 03/21/2016).

[122] Daniel Marbach, James C. Costello, Robert Küffner, et al. “Wisdom of crowds for robust gene network inference”. In: Nature Methods 9.8 (Aug. 2012), pp. 796– 804. doi: 10.1038/nmeth.2016. url: http://www.nature.com.ezproxy1.lib.asu. edu/nmeth/journal/v9/n8/full/nmeth.2016.html (visited on 07/13/2017).

97 [123] Jing Zhao, Lili Miao, Jian Yang, et al. “Prediction of Links and Weights in Networks by Reliable Routes”. In: Scientific Reports 5 (July 22, 2015), p. 12261. doi: 10.1038/srep12261. url: http://www.nature.com/articles/srep12261 (visited on 04/04/2016).

[124] Gordon E Moore et al. Cramming more components onto integrated circuits. McGraw-Hill, 1965.

[125] Ilya Shmulevich, Edward R. Dougherty, Seungchan Kim, et al. “Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks”. In: Bioinformatics 18.2 (Feb. 1, 2002), pp. 261–274. doi: 10.1093/bioinformati cs/18.2.261. (Visited on 09/24/2013).

[126] E P van Someren, L F Wessels, and M J Reinders. “Linear modeling of genetic networks from experimental data”. In: Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology 8 (2000), pp. 355–366.

[127] P D’haeseleer, X Wen, S Fuhrman, et al. “Linear modeling of mRNA expres- sion levels during CNS development and injury”. In: Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing (1999), pp. 41–52.

[128] Min Zou and Suzanne D. Conzen. “A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data”. In: Bioinformatics 21.1 (Jan. 1, 2005), pp. 71–79. doi: 10.1093/bioinfor matics/bth463. (Visited on 10/10/2013).

[129] Nir Friedman, Michal Linial, Iftach Nachman, et al. “Using Bayesian Net- works to Analyze Expression Data”. In: Journal of Computational Biology 7.3 (Aug. 2000), pp. 601–620. doi: 10.1089/106652700750050961. (Visited on 09/25/2013).

[130] E. J. Moler, M. L. Chow, and I. S. Mian. “Analysis of molecular profile data using generative and discriminative methods”. In: Physiological Genomics 4.2 (Jan. 1, 2000), pp. 109–126. (Visited on 09/25/2013).

[131] Sui Huang, Ingemar Ernberg, and Stuart Kauffman. “Cancer attractors: A systems view of tumors from a gene network dynamics and developmental perspective”. In: Seminars in Cell & Developmental Biology 20.7 (Sept. 2009), pp. 869–876. doi: 10.1016/j.semcdb.2009.07.003. (Visited on 09/26/2013).

98 [132] Sui Huang and Donald E. Ingber. “The structural and mechanical complexity of cell-growth control”. In: Nature Cell Biology 1.5 (Sept. 1999), E131–E138. doi: 10.1038/13043. (Visited on 10/08/2013).

[133] Sui Huang. “Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery”. In: Journal of Molecular Medicine 77.6 (June 1, 1999), pp. 469–480. doi: 10.1007/s001099900 023. (Visited on 10/08/2013).

[134] Kevin Murphy and Saira Mian. “Modelling Gene Expression Data using Dy- namic Bayesian Networks”. In: (May 1999). url: http://www-devel.cs.ubc.ca/ ~murphyk/Papers/ismb99.pdf.

[135] Huilei Xu, Yen-Sin Ang, Ana Sevilla, et al. “Construction and validation of a regulatory network for pluripotency and self-renewal of mouse embryonic stem cells”. In: PLoS computational biology 10.8 (Aug. 2014), e1003777. doi: 10.1371/journal.pcbi.1003777.

[136] Javed Khan, Jun S. Wei, Markus Ringner, et al. “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural net- works”. In: Nature Medicine 7.6 (June 2001), pp. 673–679. doi: 10.1038/89044. (Visited on 09/27/2013).

[137] R. Brause, T. Langsdorf, and M. Hepp. “Neural data mining for credit card fraud detection”. In: 11th IEEE International Conference on Tools with Artificial Intelligence, 1999. Proceedings. 11th IEEE International Conference on Tools with Artificial Intelligence, 1999. Proceedings. 1999, pp. 103–106. doi: 10.1109/ TAI.1999.809773.

[138] M. Syeda, Yan-Qing Zhang, and Y. Pan. “Parallel granular neural networks for fast credit card fraud detection”. In: Proceedings of the 2002 IEEE International Conference on Fuzzy Systems, 2002. FUZZ-IEEE’02. Proceedings of the 2002 IEEE International Conference on Fuzzy Systems, 2002. FUZZ-IEEE’02. Vol. 1. 2002, pp. 572–577. doi: 10.1109/FUZZ.2002.1005055.

[139] Richard Bellman and Robert E Kalaba. Dynamic programming and modern control theory. Academic Press New York, 1965.

[140] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, et al. “Empirical Eval- uation of Gated Recurrent Neural Networks on Sequence Modeling”. In: arXiv:1412.3555 [cs] (Dec. 11, 2014). arXiv: 1412.3555. url: http://arxiv.org/ abs/1412.3555 (visited on 10/11/2016).

99 [141] Mukesh Bansal, Vincenzo Belcastro, Alberto Ambesi-Impiombato, et al. “How to infer gene networks from expression profiles”. In: Molecular Systems Biology 3.1 (Jan. 1, 2007), p. 78. doi: 10.1038/msb4100120. url: http://msb.embopress. org/content/3/1/78 (visited on 07/13/2017).

[142] Jing Yu, V. Anne Smith, Paul P. Wang, et al. “Advances to Bayesian network inference for generating causal networks from observational biological data”. In: Bioinformatics 20.18 (Dec. 12, 2004), pp. 3594–3603. doi: 10.1093/bioin formatics/bth448. url: https://academic.oup.com/bioinformatics/article/ 20/18/3594/202541/Advances-to-Bayesian-network-inference-for (visited on 07/13/2017).

[143] Barbara Treutlein, Doug G. Brownfield, Angela R. Wu, et al. “Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq”. In: Nature 509.7500 (May 15, 2014), pp. 371–375. doi: 10.1038/nature13173.

[144] Eric W. Brunskill, Joo-Seop Park, Eunah Chung, et al. “Single cell dissection of early kidney development: multilineage priming”. In: Development (Cambridge, England) 141.15 (Aug. 2014), pp. 3093–3101. doi: 10.1242/dev.110601.

[145] Saiful Islam, Una Kjallquist, Annalena Moliner, et al. “Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq”. In: Genome Research 21.7 (July 2011), pp. 1160–1167. doi: 10.1101/gr.110882.110.

[146] Robert N. Plasschaert, Sebastien Vigneau, Italo Tempera, et al. “CTCF binding site sequence differences are associated with unique regulatory and functional trends during embryonic stem cell differentiation”. In: Nucleic Acids Research 42.2 (Jan. 2014), pp. 774–789. doi: 10.1093/nar/gkt910.

[147] Jeremy Goecks, Anton Nekrutenko, James Taylor, et al. “Galaxy: a com- prehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences”. In: Genome Biology 11.8 (2010), R86. doi: 10.1186/gb-2010-11-8-r86.

[148] Daniel Blankenberg, Gregory Von Kuster, Nathaniel Coraor, et al. “Galaxy: a web-based genome analysis tool for experimentalists”. In: Current Protocols in Molecular Biology Chapter 19 (Jan. 2010), pages. doi: 10.1002/0471142727. mb1910s89.

[149] Chunlei Wu, Ian Macleod, and Andrew I. Su. “BioGPS and MyGene.info: organizing online, gene-centric information”. In: Nucleic Acids Research 41 (Database issue Jan. 2013), pp. D561–565. doi: 10.1093/nar/gks1114.

100 [150] Zhide Fang and Xiangqin Cui. “Design and validation issues in RNA-seq experiments”. In: Briefings in Bioinformatics 12.3 (May 1, 2011), pp. 280–287. doi: 10.1093/bib/bbr004. url: http://bib.oxfordjournals.org/content/12/3/ 280 (visited on 10/09/2014).

[151] Florian Buettner, Kedar N. Natarajan, F. Paolo Casale, et al. “Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells”. In: Nature Biotechnology 33.2 (Feb. 2015), pp. 155–160. doi: 10.1038/nbt.3102. url: http://www.nature.com.ezproxy1. lib.asu.edu/nbt/journal/v33/n2/full/nbt.3102.html (visited on 07/18/2017).

[152] Michael G. Ross, Carsten Russ, Maura Costello, et al. “Characterizing and measuring bias in sequence data”. In: Genome Biology 14.5 (May 29, 2013), R51. doi: 10.1186/gb-2013-14-5-r51. url: http://genomebiology.biomedcentral. com.ezproxy1.lib.asu.edu/articles/10.1186/gb-2013-14-5-r51 (visited on 04/11/2017).

[153] Nicholas J. Loman, Joshua Quick, and Jared T. Simpson. “A complete bacterial genome assembled de novo using only nanopore sequencing data”. In: Nature Methods 12.8 (Aug. 2015), pp. 733–735. doi: 10 . 1038 / nmeth . 3444. url: http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v12/n8/full/ nmeth.3444.html (visited on 09/12/2016).

[154] Philippe Christophe Faucon, Robert Trevino, Parithi Balachandran, et al. “High Accuracy Base Calls in Nanopore Sequencing”. In: bioRxiv (Apr. 11, 2017), p. 126680. doi: 10.1101/126680. url: http://biorxiv.org/content/early/2017/ 04/11/126680 (visited on 04/11/2017).

[155] Ivan Sović, Mile Šikić, Andreas Wilm, et al. “Fast and sensitive mapping of nanopore sequencing reads with GraphMap”. In: Nature Communications 7 (Apr. 15, 2016), p. 11307. doi: 10.1038/ncomms11307. url: http://www. nature.com.ezproxy1.lib.asu.edu/ncomms/2016/160415/ncomms11307/full/ ncomms11307.html (visited on 01/09/2017).

[156] Heng Li. “Toward better understanding of artifacts in variant calling from high-coverage samples”. In: Bioinformatics 30.20 (Oct. 15, 2014), pp. 2843– 2851. doi: 10.1093/bioinformatics/btu356. url: https://academic.oup.com/ bioinformatics/article/30/20/2843/2422145/Toward-better-understanding- of-artifacts-in (visited on 04/11/2017).

[157] Konstantin Berlin, Sergey Koren, Chen-Shan Chin, et al. “Assembling large genomes with single-molecule sequencing and locality-sensitive hashing”. In: Nature Biotechnology 33.6 (June 2015), pp. 623–630. doi: 10.1038/nbt.3238.

101 url: http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v33/n6/abs/ nbt.3238.html (visited on 01/24/2017).

[158] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, et al. “Gapped BLAST and PSI-BLAST: a new generation of protein database search pro- grams”. In: Nucleic Acids Research 25.17 (Sept. 1, 1997), pp. 3389–3402. doi: 10.1093/nar/25.17.3389. url: http://nar.oxfordjournals.org/content/25/17/ 3389 (visited on 12/15/2016).

[159] Chen Yang, Justin Chu, Ren, et al. “NanoSim: nanopore sequence read simulator based on statistical characterization”. In: bioRxiv (Mar. 18, 2016), p. 044545. doi: 10.1101/044545. url: http://biorxiv.org/content/early/2016/03/18/ 044545.1 (visited on 12/16/2016).

[160] Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. “PBSIM: PacBio reads simulator—toward accurate genome assembly”. In: Bioinformatics 29.1 (Jan. 1, 2013), pp. 119–121. doi: 10.1093/bioinformatics/bts649. url: http://bioinfor matics.oxfordjournals.org/content/29/1/119 (visited on 12/16/2016).

[161] Bayo Lau, Marghoob Mohiyuddin, John C. Mu, et al. “LongISLND: in silico sequencing of lengthy and noisy datatypes”. In: Bioinformatics 32.24 (Dec. 15, 2016), pp. 3829–3832. doi: 10.1093/bioinformatics/btw602. url: http://www. ncbi.nlm.nih.gov/pmc/articles/PMC5167071/ (visited on 02/15/2017).

[162] Merly Escalona, Sara Rocha, and David Posada. “A comparison of tools for the simulation of genomic next-generation sequencing data”. In: Nature Reviews Genetics 17.8 (Aug. 2016), pp. 459–469. doi: 10.1038/nrg.2016.57. url: http://www.nature.com.ezproxy1.lib.asu.edu/nrg/journal/v17/n8/abs/nrg. 2016.57.html (visited on 04/18/2017).

[163] Matthew Frampton and Richard Houlston. “Generation of Artificial FASTQ Files to Evaluate the Performance of Next-Generation Sequencing Pipelines”. In: PLOS ONE 7.11 (Nov. 12, 2012), e49110. doi: 10.1371/journal.pone.0049110. url: http://journals.plos.org/plosone/article?id=10.1371/journal.pone. 0049110 (visited on 04/18/2017).

[164] Dent A. Earl, Keith Bradnam, John St John, et al. “Assemblathon 1: A competitive assessment of de novo short read assembly methods”. In: Genome Research (Sept. 16, 2011), gr.126599.111. doi: 10.1101/gr.126599.111. url: http://genome.cshlp.org/content/early/2011/09/16/gr.126599.111 (visited on 04/18/2017).

102 [165] Weichun Huang, Leping Li, Jason R. Myers, et al. “ART: a next-generation sequencing read simulator”. In: Bioinformatics 28.4 (Feb. 15, 2012), pp. 593– 594. doi: 10.1093/bioinformatics/btr708. url: https://academic.oup.com/ bioinformatics/article/28/4/593/213322/ART-a-next-generation-sequencing- read-simulator (visited on 04/18/2017).

[166] Juliane C. Dohm, Claudio Lottaz, Tatiana Borodina, et al. “Substantial biases in ultra-short read data sets from high-throughput DNA sequencing”. In: Nucleic Acids Research 36.16 (Sept. 2008), e105. doi: 10.1093/nar/gkn425. url: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2532726/ (visited on 04/18/2017).

[167] Yen-Chun Chen, Tsunglin Liu, Chun-Hui Yu, et al. “Effects of GC Bias in Next-Generation-Sequencing Data on De Novo Genome Assembly”. In: PLOS ONE 8.4 (Apr. 29, 2013), e62856. doi: 10.1371/journal.pone.0062856. url: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0062856 (visited on 04/18/2017).

[168] Vladimír Boža, Broňa Brejová, and Tomáš Vinař. “DeepNano: Deep Recur- rent Neural Networks for Base Calling in MinION Nanopore Reads”. In: arXiv:1603.09195 [q-bio] (Mar. 30, 2016). arXiv: 1603 . 09195. url: http : //arxiv.org/abs/1603.09195 (visited on 10/11/2016).

[169] Matei David, Lewis Jonathan Dursi, Delia Yao, et al. “Nanocall: An Open Source Basecaller for Oxford Nanopore Sequencing Data”. In: bioRxiv (Mar. 28, 2016), p. 046086. doi: 10.1101/046086. url: http://biorxiv.org/content/early/ 2016/03/28/046086 (visited on 10/11/2016).

[170] Joshua Quick, Nicholas J. Loman, Sophie Duraffour, et al. “Real-time, portable genome sequencing for Ebola surveillance”. In: Nature 530.7589 (Feb. 11, 2016), pp. 228–232. doi: 10 .1038 /nature16996. url: http :/ /www .nature .com . ezproxy1.lib.asu.edu/nature/journal/v530/n7589/full/nature16996.html (visited on 09/12/2016).

[171] Miten Jain, Ian T. Fiddes, Karen H. Miga, et al. “Improved data analysis for the MinION nanopore sequencer”. In: Nature Methods 12.4 (Apr. 2015), pp. 351–356. doi: 10.1038/nmeth.3290. url: http://www.nature.com/nmeth/ journal/v12/n4/full/nmeth.3290.html (visited on 09/13/2016).

[172] Sara Goodwin, James Gurtowski, Scott Ethe-Sayers, et al. “Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome”. In: Genome Research 25.11 (Nov. 1, 2015), pp. 1750–1756. doi:

103 10.1101/gr.191395.115. url: http://genome.cshlp.org/content/25/11/1750 (visited on 09/12/2016).

[173] Tamas Szalay and Jene A. Golovchenko. “De novo sequencing and variant calling with nanopores using PoreSeq”. In: Nature Biotechnology 33.10 (Oct. 2015), pp. 1087–1091. doi: 10.1038/nbt.3360. url: http://www.nature.com/ nbt/journal/v33/n10/full/nbt.3360.html (visited on 09/13/2016).

[174] Chenhao Li, Kern Rei Chng, Esther Jia Hui Boey, et al. “INC-Seq: accurate single molecule reads using nanopore sequencing”. In: GigaScience 5 (Aug. 2, 2016), p. 34. doi: 10.1186/s13742-016-0140-7. url: http://dx.doi.org/10.1186/ s13742-016-0140-7 (visited on 12/15/2016).

[175] Brian D. Ondov, Todd J. Treangen, Adam B. Mallonee, et al. “Fast genome and metagenome distance estimation using MinHash”. In: bioRxiv (Oct. 26, 2015), p. 029827. doi: 10.1101/029827. url: http://biorxiv.org/content/early/ 2015/10/26/029827 (visited on 12/15/2016).

[176] Sergey Koren, Brian P. Walenz, Konstantin Berlin, et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation | bioRxiv. Aug. 24, 2016. url: http://biorxiv.org/content/early/2016/08/24/ 071282 (visited on 12/15/2016).

[177] Chen-Shan Chin, David H. Alexander, Patrick Marks, et al. “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data”. In: Nature Methods 10.6 (June 2013), pp. 563–569. doi: 10.1038/nmeth.2474. url: http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v10/n6/ abs/nmeth.2474.html (visited on 12/15/2016).

[178] Sergey Koren, Michael C. Schatz, Brian P. Walenz, et al. “Hybrid error cor- rection and de novo assembly of single-molecule sequencing reads”. In: Nature Biotechnology 30.7 (July 2012), pp. 693–700. doi: 10.1038/nbt.2280. url: http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v30/n7/full/nbt. 2280.html (visited on 12/15/2016).

[179] Nicolas L. Bray, Harold Pimentel, Páll Melsted, et al. “Near-optimal proba- bilistic RNA-seq quantification”. In: Nature Biotechnology 34.5 (May 2016), pp. 525–527. doi: 10.1038/nbt.3519. url: http://www.nature.com.ezproxy1. lib.asu.edu/nbt/journal/v34/n5/full/nbt.3519.html (visited on 12/15/2016).

[180] Stephen F. Altschul, Warren Gish, Webb Miller, et al. “Basic local alignment search tool”. In: Journal of Molecular Biology 215.3 (Oct. 5, 1990), pp. 403–410.

104 doi: 10.1016/S0022-2836(05)80360-2. url: http://www.sciencedirect.com/ science/article/pii/S0022283605803602 (visited on 12/15/2016).

[181] T. Laver, J. Harrison, P. A. O’Neill, et al. “Assessing the performance of the Oxford Nanopore Technologies MinION”. In: Biomolecular Detection and Quantification 3 (Mar. 2015), pp. 1–8. doi: 10.1016/j.bdq.2015.02.001. url: http://www.sciencedirect.com/science/article/pii/S2214753515000224 (visited on 03/15/2016).

[182] Martin Ester, Hans-Peter Kriegel, Jörg Sander, et al. “A density-based algorithm for discovering clusters in large spatial databases with noise.” In: Kdd. Vol. 96. 1996, pp. 226–231.

[183] Philip Brennecke, Simon Anders, Jong Kyoung Kim, et al. “Accounting for technical noise in single-cell RNA-seq experiments”. In: Nature Methods 10.11 (Nov. 2013), pp. 1093–1095. doi: 10.1038/nmeth.2645. url: http://www. nature.com/nmeth/journal/v10/n11/full/nmeth.2645.html (visited on 03/27/2015).

105