A meta-analysis of models reveals design principles of gene regulatory networks

Claus Kadelkaa,∗, Taras-Michael Butrieb,1, Evan Hiltonc,1, Jack Kinsetha,1, Haris Serdarevica,1

aDepartment of Mathematics, Iowa State University, Ames, Iowa 50011 bDepartment of Aerospace Engineering, Iowa State University, Ames, Iowa 50011 cDepartment of Computer Science, Iowa State University, Ames, Iowa 50011

Abstract Gene regulatory networks (GRNs) describe how a collection of genes governs the processes within a cell. Understanding how GRNs manage to consistently perform a particular function constitutes a key question in cell biology. GRNs are frequently modeled as Boolean networks, which are intuitive, simple to describe, and can yield qualitative results even when data is sparse. We generate an expandable database of published, expert-curated Boolean GRN models, and extracted the rules governing these networks. A meta-analysis of this diverse set of models enables us to identify fundamental design principles of GRNs. The biological term canalization reflects a cell’s ability to maintain a stable phenotype despite ongoing environmental perturbations. Accordingly, Boolean canalizing functions are functions where the output is already determined if a specific variable takes on its canalizing input, regardless of all other inputs. We provide a detailed analysis of the prevalence of canalization and show that most rules describing the regulatory logic are highly canalizing. Independent from this, we also find that most rules exhibit a high level of redundancy. An analysis of the prevalence of small network motifs, e.g. feed-forward loops or loops, in the wiring diagram of the identified models reveals several highly abundant types of motifs, as well as a surprisingly high overabundance of negative regulations in complex feedback loops. Lastly, we provide the strongest evidence thus far in favor of the hypothesis that GRNs operate at the critical edge between order and chaos. Keywords: robustness, stability, canalization, gene regulatory networks, Boolean networks

Introduction

Gene regulatory networks (GRNs) describe how a collection of genes governs the processes within a cell. Un- derstanding how GRNs perform particular functions and do so consistently despite ubiquitous perturbations constitutes a fundamental biological question [1]. Over the last two decades, a variety of design principles of GRNs have been proposed and studied, with a focus on discovering causal links between network form and function.

arXiv:2009.01216v1 [q-bio.MN] 2 Sep 2020 GRNs have been shown to be enriched for certain sub-graphs with a specific structure, so-called network motifs, like feed-forward loops, feedback loops but also larger subcircuits [2, 3, 4, 5]. Theoretical studies of the dynamic properties of these motifs revealed specific functionalities [6, 7]. For example, coherent feed-forward loops can delay the activation or inhibition of a target gene, while incoherent ones can act as accelerators [8]. Other hypothesized design principles include e.g. redundancy in the regulatory logic [9, 10], and a high prevalence of canalization [11]. Canalization, a concept originating from the study of embryonal development [12], refers to the ability of a GRN to maintain a stable phenotype despite ample genotypic as well as environmental variation. Over the past decades, Boolean networks (reviewed in [13]) have become an increasingly popular modeling framework for the study of biological systems, as they are intuitive and simple to describe. When data is

∗Corresponding author, email: [email protected] 1These authors contributed equally to this project as part of a one-semester research experience for first-year Honor students. sparse, as is still often the case for less-studied organisms and processes, complicated models (e.g., continuous differential equation models), which harbor the potential for quantitative predictions, cannot be appropriately fitted to the data due to their high number of parameters [14]. In this case, Boolean network models can often still yield qualitative results. Static network models are comprised of (i) a set of considered nodes (genes, external parameters, etc.), and (ii) a wiring diagram, which describes which node regulates which and often also contains information about the respective type of regulation (positive due to e.g. transcriptional activation vs negative due to f.e. inhibition). A dynamic Boolean network model possesses these same features but obtains its dynamics from an additional set of update rules (i.e., Boolean functions) that describe the regulatory logic governing the expression of each gene. Each gene is either on (i.e., high concentration, expressed) or off (i.e., low concentration, unexpressed) and time is discretized as well. While large, genome-wide static transcriptional network models can be readily assembled from existing databases, the formulation of dynamic Boolean network models requires a careful calibration of the update rules by a subject expert. Therefore, all dynamic Boolean GRN models published thus far focus on specific biological processes of interest and contain only those genes involved in these processes. Here, we describe a comprehensive meta-analysis of all published, expert-curated Boolean GRN models, aiming to obtain a better understanding of the design principles of GRNs, which can explain how they operate smoothly and perform particular functions.

Results

Using the biomedical literature search engine Pubmed, we created a database of 151 Boolean GRN mod- els (see Methods for details). The identified networks describe the regulatory logic underlying a variety of processes in numerous species, ranged in size from 3 to 302 genes (mean = 43.17, median = 23.5), and encom- passed a total of 5698 genes as well as 835 external parameters, which remain constant over time(Fig. 1A). We only included expert-curated models where the nodes and the update rules were selected by hand and not by a prediction algorithm or where default choices like threshold rules were used throughout. We further included only one version of highly similar models, which led to the exclusion of 19 models, resulting in a total of 132 models used in the meta-analysis. A majority of the models (104, 78.8%) contained external parameters like e.g. oxygen or CO2 levels, and as expected, network models with more genes contained on average more external parameters (ρSpearman = 0.5, Fig. 1B). On the other hand, the size of a network was not significantly correlated with the average connectivity, which describes the average number of regulators per gene (ρSpearman = −0.04, Fig. 1C). The average connectivity differed widely across the 132 models; we observed a range of [1.18, 5.49] and a mean average connectivity of 2.37. The distribution of a , in which the edges are distributed randomly, is a Poisson distribution [15]. When considering all update rules separately, we identified that the in- resembled a Poisson distribution, while the out-degree distribution possessed a power-law tail (Fig. 1D), as has been observed for many diverse types of networks [16, 15] including the yeast transcriptional regulatory network [17]. The tails of the two degree distributions differed significantly; we found many more instances of high out-degree versus high in-degree, highlighting the presence of key transcription factors that act as network hubs [18]. Next, we investigated the prevalence of different types of regulations. If gene A regulates gene B, there are three possibilities: (i) gene A may activate gene B, meaning that an increased expression of gene A (i.e., a change from 0 to 1 in the Boolean world) leads to an increased expression of gene B for some states of the other regulators, and possibly to no change in B for other states of the other regulators (ii) gene A may inhibit gene B, meaning that an increase in A leads to a decrease in B for some states of the other regulators, and possibly to no change in B for other states of the other regulators, and (iii) gene A’s effect on gene B may be conditional (i.e., not monotonic), meaning that for some states of the other regulators, A activates B, while for other states of the other regulators, A inhibits B. Except for ten rules with more than 16 inputs, we investigated all update rules, resulting in a total of 15379 analyzed regulators. The majority of regulations were activations (11549, 75.1%), followed by inhibitions (3401, 22.1%) and conditional behavior (197, 1.3%). Regulatory networks in eukaryotes operate mainly by activation of otherwise inactive promoters [19]. On the contrary, many promoters in prokaryotes are by default expressed and require repressors to reduce gene activity [20]. Most of the considered GRN models are for eukaryotes, which may explain the increased

2 genes 132 networks 300 5698 genes external 835 external parameters parameters

200

100 Number of nodes

0 Gene regulatory networks

100 r = 0.5 r = 0.04 p = 8e 10 5 p = 0.63

4 10 3

2 1 Average connectivity 0 1 101 102 101 102 Number of external parameters Number of genes Number of genes

100 1.0 in-degree out-degree 0.8 10 1

0.6 10 2 0.4

proportion proportion activation 10 3 0.2 inhibition conditional 10 4 0.0 0 1 2 5 10 20 50 1 2 3 4 5 6 7 degree number of essential variables

Figure 1: Summary statistics of the analyzed GRN models. (A) Plot of the number of genes and constants (external parameters) for each model sorted by number of genes. (B-C) For each model, the number of genes is plotted against (B) the number of constants and (C) the average in-degree of the genes. The Spearman correlation coefficient and associated p-value are shown in red. (D) In-degree (red circles) and out-degree (black stars) distribution derived from all 5698 update rules. (E) Prevalence of type of regulation (activation: red, inhibition: blue, conditional: black) stratified by number of regulators (in-degree; x-axis).

3 15

10

5

0 0 5 10 15 Number of essential regulators Number of regulators

Figure 2: Discrepancies in some published update rules. For 5688 update rules, the number of regulators in the identified published rule (x-axis) is plotted against the number of essential inputs after simplification of the rule (y-axis).

1.0 canalizing depth canalizing depth Observed 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0.8 Random 0 2 0 0 0 0 0 0 0 0 0 0 0 1000 0 0 0 0 0 0 0 0 0 0 1 0 3089 0 0 0 0 0 0 0 0 0 1 0 1000 0 0 0 0 0 0 0 0 0 0.6 2 0 0 1370 0 0 0 0 0 0 0 0 2 190 0 810 0 0 0 0 0 0 0 0 3 31 2 0 770 0 0 0 0 0 0 0 3 595 109 0 296 0 0 0 0 0 0 0 0.4 4 62 6 0 0 448 0 0 0 0 0 0 4 957 31 3 0 9 0 0 0 0 0 0 5 48 18 5 0 0 211 0 0 0 0 0 5 1000 0 0 0 0 0 0 0 0 0 0 6 47 14 11 0 3 0 118 0 0 0 0 6 1000 0 0 0 0 0 0 0 0 0 0 0.2 Canalizing strength 7 28 10 9 4 1 0 0 44 0 0 0 7 1000 0 0 0 0 0 0 0 0 0 0 8 21 2 1 5 1 0 0 0 42 0 0 8 1000 0 0 0 0 0 0 0 0 0 0 0.0 9 11 7 9 1 2 6 0 0 0 13 0 9 1000 0 0 0 0 0 0 0 0 0 0 3 4 5 6 number of essential inputs essential of number inputs essential of number 10 5 4 2 0 1 0 0 0 0 0 10 10 1000 0 0 0 0 0 0 0 0 0 0 Number of essential inputs

Figure 3: Prevalence of canalization. (A) Stratification of all identified update rules based on the number of essential inputs (rows) and the canalizing depth (columns). Update rules with more than ten inputs were omitted. (B) Expected distribution of the canalizing depth for random Boolean functions for different numbers of essential inputs (0-10). For each row, 1000 random functions were generated. (C) The distribution of the canalizing strength of all observed 0-canalizing functions with 3-6 essential inputs is shown (blue), as well as the expected distribution for random Boolean functions (orange), derived from 1000 samples each. Horizontal lines depict the respective mean values.

prevalence of activation. We further found that activation seems particularly prevalent in situations where a gene’s state is determined by one or only a few regulators (Fig. 1E). Interestingly, we found that 232 of the 15379 regulators (1.5%) were non-essential. That is, these regulators were part of the published update rules but did not have any effect at all. These non-essential regulators were spread across 129 (2.3%) update rules and 26 (19.7%) models. Fig. 2 shows the discrepancy between the number of inputs in the published update rules and the number of inputs that have an actual effect on the update rules. In one extreme case, an update rule with twelve different inputs simplifies to the Boolean zero function. In the rest of this paper, only essential regulators were considered.

Canalization The concept of canalization, already introduced in the 1940s in the context of embryonal development [12], has been proposed as a possible explanation for the remarkable stability of GRNs in the face of ubiquitous perturbations, and accordingly, Boolean canalizing functions have been proposed as ideally suited update functions in Boolean GRN models [11]. Recently, the class of canalizing functions has been further stratified and studied [21, 22]. Some smaller studies support the general hypothesis by revealing an overabundance of canalizing functions in GRN models [23, 24] but a rigorous, comprehensive analysis that considers various types of canalization is still missing. A canalizing function possesses at least one input variable such that, if this variable takes on a certain “canalizing” value, then the output value is already determined, regardless of the values of the remaining input variables. If this variable takes on another value, and there is a second variable with this same property, the function is 2-canalizing. If k variables follow this pattern, the function is k-canalizing [21], and the number of variables that follow this pattern is the canalizing depth of the function [25]. If the canalizing

4 number of symmetry groups number of symmetry groups number of symmetry groups 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 3089 0 0 0 0 0 0 0 0 0 0 1 1000 0 0 0 0 0 0 0 0 0 0 1 1000 0 0 0 0 0 0 0 0 0 0 2 968 402 0 0 0 0 0 0 0 0 0 2 587 413 0 0 0 0 0 0 0 0 0 2 502 498 0 0 0 0 0 0 0 0 0 3 289 449 65 0 0 0 0 0 0 0 0 3 63 569 368 0 0 0 0 0 0 0 0 3 117 592 291 0 0 0 0 0 0 0 0 4 84 264 153 15 0 0 0 0 0 0 0 4 1 40 265 694 0 0 0 0 0 0 0 4 31 289 421 259 0 0 0 0 0 0 0 5 23 132 85 32 10 0 0 0 0 0 0 5 0 0 3 26 971 0 0 0 0 0 0 5 4 107 267 327 295 0 0 0 0 0 0 6 16 62 46 49 14 6 0 0 0 0 0 6 0 0 0 0 0 1000 0 0 0 0 0 6 2 54 121 218 206 399 0 0 0 0 0 7 2 22 21 23 19 6 3 0 0 0 0 7 0 0 0 0 0 0 1000 0 0 0 0 7 1 17 44 103 184 138 513 0 0 0 0 8 6 19 15 14 14 2 2 0 0 0 0 8 0 0 0 0 0 0 0 1000 0 0 0 8 0 8 32 98 157 172 126 407 0 0 0 9 1 5 12 12 8 6 4 1 0 0 0 9 0 0 0 0 0 0 0 0 1000 0 0 9 0 2 10 30 65 68 115 158 552 0 0 10 0 7 0 8 1 3 1 1 1 0 0 10 0 0 0 0 0 0 0 0 0 1000 0 10 0 2 8 22 54 85 130 95 96 508 0

number of essential inputs essential of number 11 1 3 0 0 0 3 5 0 0 0 0 inputs essential of number 11 0 0 0 0 0 0 0 0 0 0 1000 inputs essential of number 11 0 0 2 8 22 56 78 83 87 58 606

Figure 4: Prevalence of redundancy. (A) Stratification of all identified update rules based on the number of essential inputs (rows) and the redundancy, measured by the number of symmetry groups (columns). Update rules with more than eleven inputs were omitted. (B-C) Expected distribution of the number of symmetry groups for random Boolean functions for different numbers of essential inputs (1-11). For each row, 1000 random functions were generated. In (C), the distribution of the canalizing depth of the random Boolean functions was matched to the one observed for the GRN models, shown in Fig. 3A.

depth equals the number of inputs (i.e. if all variables follow the described pattern), the function is also called nested canalizing. To test the level of canalization in published GRN models, we stratified all 5698 update rules based on their number of essential inputs and their canalizing depth. The number of Boolean functions with a certain canalizing depth is known [21], and the fraction of random Boolean functions which are canalizing (i.e., those with canalizing depth ≥ 1) decreases quickly as the number of inputs increases (Fig. 3A). Most identified update rules, however, possessed a high canalizing depth, even rules with many inputs (Fig. 3B). Most update rules were even nested canalizing, meaning that all their variables become “eventually” canalizing. These findings agree with earlier, smaller studies [23, 24], which focused solely on the abundance of canalizing and nested canalizing functions but lacked the finer level of detail added by the canalizing depth. Our findings raised an important question: Are GRN models enriched for canalizing functions solely because of the strong overabundance of nested canalizing functions, or is there broader evidence for canalization in general. To answer this, we relied on a broader mathematical definition of the biology-inspired concept of canalization, called collective canalization [26]. Rather than focusing on single inputs that determine the output of a function regardless of the values of the remaining inputs, we studied the proportion of sets of inputs that have this canalizing ability. The recently introduced canalizing strength of a Boolean function summarizes this information in a single measure [27]. By comparing the canalizing strength of all identified non-canalizing update rules with 3 to 6 inputs (i.e., those with canalizing depth = 0) with random non-canalizing Boolean functions, we found that even those update rules, non-canalizing according to the stringent definition, exhibited a higher level of collective canalization than expected (Fig. 3C).

Redundancy Genetic redundancy constitutes another important feature of gene regulation, as the presence of duplicate genes provides robustness against null mutations [9, 10]. We tested the level of redundancy contained in the GRN models by quantifying the number of symmetry groups for each update rule. Two regulators are in the same symmetry group if they have exactly the same effect on the targeted gene, for all possible states of all other regulators. Redundant genes perform the same function and would thus be part of the same symmetry group. We found a much higher level of redundancy in the GRN models (i.e., much fewer symmetry groups; Fig. 4A) than expected by chance (Fig. 4B). This comparison is somewhat skewed by the fact that canalizing functions on average possess fewer symmetry groups. We corrected for this confounding effect and considered random functions with a distribution of the canalizing depth matched to the one observed for update rules in published GRN models (Fig. 4C). Although much smaller, this matched comparison still revealed an increased level of redundancy in the GRN models.

Network motifs Network motifs are sub-graphs with a specific structure that recur throughout a network and often carry out a certain function [3, 4]. Several network motifs are commonly found in large, static GRN models such as the transcriptional network of E. coli [2]. One such motif is the feed-forward (FFL), which consists of three genes: one gene that regulates both others, one target gene that is jointly regulated by both others, and one intermediate gene. In a coherent FFL, the direct effect of the master regulator on the target has the

5 n te F lse novn ee.O h te ad rncitoa ewrsof motif relate networks that transcriptional studies computational hand, or than theoretical other abundant novel requires the more likely much On discrepancies and these for 12, meta-analysis genes. explanation type this 4 in than is involving observed abundant abundance also cluster cerevisiae more its was slightly FFL and which was FFLs hubs, other both 11 factor any in type transcription we involved Interestingly, of regulator motifs, 1D). presence master expected, FFL known (Fig. a the As single features to 6B). cluster the due FFL (Fig. with likely of model As type 75 by This (64472, additionally models. genes abundant. and we 5 GRN involve 6A) that 1 132 to (Fig. clusters the clusters (1535, type such in FFL by most clusters of occurrences found differences we FFL wide of types [28]. revealed 85284 number 15 clusters networks the of FFL engineered all stratified total of and enumerates distribution types natural a 5 the different least of identified on the at set (Fig. We based of diverse share motif clusters distribution a that FFL the the of FFLs of in analysis in two types recent different edges of A distinguish consist inhibiting considered). can dynamic which we and in FFLs, FFLs, FFLs activating single of of of with types clusters different As of the theoretical node. occurrence of differences, one functions the observed the investigated in the on further differences explain focus We To genuine which to interest. needed, due of be models. subset or may GRN process small [6] publication, biological relatively to earlier certain a least similar the on the a studies be focus in in to which involved sizes 8 models, genes sample type GRN of FFL low dynamic found versus to we networks types contrary, of due transcriptional FFL the with be genome-wide models On incoherent FFL may GRN three [6]. incoherent This 5-7) the static the (types of abundant. In FFLs Second, two not incoherent frequent than other did regulation. inhibition. as prevalent all least one This inhibitory outnumbered more at only much one were was contain frequent. only inhibitions, 5) which two less with involving (type of 1E), most be regulations most (Fig. edges. types, inhibitory should type, regulations inhibiting FFL three regulations coherent FFL two inhibitory other all incoherent inhibitory and than two First, any activating activating more the case. as more one involving the including times contain be FFLs types, three types to that prove FFL FFL about expect by remaining total three would in all the occurrences we identified as outnumbered of of we surprising models far that number is GRN FFL Given static This the 3 the stratified type ones. for coherent the (3 reported and Interestingly, As FFLs models 142 type. lent. GRN 5B). exact the (Fig. their of model in and downregulated. determination by or FFLs the motif additionally up- prevented the 3847 and is which that target 5A) of means the sign-sensitive (Fig. total when Here, type either a the direction, [6]. of one identified delays expression in sign-sensitive the We only of as is accelerators function act FFL sign-sensitive a FFLs the as performs act Otherwise, coherent may while gene. FFLs intermediate gene, Incoherent the target types). through FFL effect eight indirect all net displays FFLs the of as types were sign different type, same the exact of their bar) omitted. stacked of are (color-coded determination FFLs proportion any the and without preventing line) Models regulations (black conditional model. Number each contained (B) for which (gray). FFLs, unknown as models. classified GRN 132 the 5: Figure

total count .cerevisiae S. 1000 500 0 . bnac ffe-owr op nteGNmodels GRN the in loops feed-forward of Abundance %.A ntetasrpinlntok of networks transcriptional the in As 8%). otie lotecuieytp 2adhrl n fteohr4gn F lses[8.An [28]. clusters FFL 4-gene other the of any hardly and 12 type exclusively almost contained coherent FFLs 1 2 3 6,teFLwt he ciaigegs(ye4i i.5)poe yfrtems preva- most the far by proved 5A) Fig. in 4 (type edges activating three with FFL the [6], 4 incoherent FFLs 5 6 7 8

unknown

Proportion of specific FFLs 0.0 0.2 0.4 0.6 0.8 1.0 6 .coli E. . %,floe y4(97,22 (19277, 4 by followed 6%), A oa ubro h ieettpso Fsin FFLs of types different the of number Total (A) . and .Coli E. Gene regulatorynetworks .cerevisiae S. . % otie odtoa regulations, conditional contained 7%) and .cerevisiae S. 2] ye6wstemost the was 6 type [28], . % n y3genes 3 by and 6%) incoherent h ye8FFL 8 type the , .coli E. .Coli E. and Fg 5 (Fig. 1 1 1 0 0 0 0 1 2

S. Number of FFLs 15000

10000

total count 5000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1.0 104

0.8 103 0.6 102 0.4 101 0.2

100 Number of FFL clusters 0.0 Gene regulatory networks Proportion of specific FFL cluster

Figure 6: Abundance of clusters of feed-forward loops in the GRN models. (A) Total number of the different types of clusters of FFLs in the 132 GRN models. Nodes in the motif graphs are color-coded based on their role in the two clustered FFLs: blue nodes only occur as regulators, yellow nodes only occur as intermediate genes, red nodes only occur as target genes, and gray depicts nodes with mixed roles. (B) Number (black line) and proportion (color-coded stacked bar) of the different types of clusters of FFLs for each network. Models without any clusters of FFLs are omitted. structure to motif function. Feedback loops (FBLs) constitute another classical network motif. The parity of the number of negative regulations determines if a FBL is positive (even number) or negative (uneven number). Each gene in a positive (negative) FBL exerts a positive (negative) effect on its own downstream expression. In general, negative FBLs buffer a perturbation and ensure homeostasis, while positive FBLs amplify perturbations and are necessary for bi- or multistationarity [29, 30, 31]. We identified all FBLs involving up to six genes present in the GRN models. For each FBL, we counted the number of positive and negative regulations involved (Fig. 7). Just like FFLs, some FBLs contained conditional regulations, which prevented the determination of their exact type. As expected by chance, we found more complex loops than short 2-loops or even auto- regulatory loops (i.e., 1-loops). Also, loops with a balanced number of positive and negative regulations are combinatorially more likely and were accordingly found more frequently. A general comparison of the num- ber of positive versus negative FBLs is difficult as the expected distribution depends on the particular size of the loop considered. Nevertheless, we observed many more self-reinforcing than self-inhibitory regulations (1-loops). On the other hand, when comparing complex loops of the same type, with the same number of genes and the same number of combinatorially expected occurrences (e.g., 4-loops with 3 versus 1 positive regulations, or 6-loops with 5 versus 1 or 4 versus 2 positive regulations), we discovered a surprising over- abundance of negative regulations in FBLs. The fact that activating (i.e., positive) regulations outnumbered inhibitory (i.e., negative) regulations 3:1(Fig. 1E) makes this finding even more surprising.

Dynamical Robustness Gene regulation is a highly stochastic process due to e.g. low copy numbers of expressed molecules, random transitions between chromatin states, or extrinsic environmental perturbations [32, 31]. Despite these various sources of stochasticity, GRNs must maintain a stable phenotype in order to ensure consistent operation of

7 300 200 150

200 150 100 100 100 50 50 total number of 1-loops total number of 2-loops total number of 3-loops 0 0 0 1+ 0+ N.A. 2+ 1+ 0+ N.A. 3+ 2+ 1+ 0+ N.A. 0 1 0 1 2 0 1 2 3

500 300 1000 400 800 200 300 600 200 400 100 100 200 total number of 4-loops total number of 6-loops total number of 5-loops 0 0 0 4+ 3+ 2+ 1+ 0+ N.A. 5+ 4+ 3+ 2+ 1+ 0+ N.A. 6+ 5+ 4+ 3+ 2+ 1+ 0+ N.A. 0 1 2 3 4 0 1 2 3 4 5 0 1 2 3 4 5 6

Figure 7: Total number of different types of feedback loops in the GRN models. Feedback loops are stratified based on the number of involved genes (sub panels) and based on the number of positive (+) and negative (−) regulations (x-axis). Loops, which contained conditional regulations preventing the determination of their type, were classified as N.A.

r = 0.15 1.2 p = 0.1

1.0

0.8 Derrida value for a single perturbation

101 102 Number of genes

Figure 8: Dynamical robustness of the GRN models. For each GRN model, the average size of a single perturbation after a round of synchronous updates (Derrida value) is plotted against the size of the model. The Spearman correlation coefficient and associated p-value are shown in red.

8 the cellular processes. GRNs must also be able to adapt to lasting changes in the environment. Due to this obvious tradeoff, GRNs have been hypothesized to operate in the so-called critical dynamical regime, on the edge of order and chaos [33]. In Boolean networks, dynamical robustness can be measured using the concept of average sensitivity or more general Derrida plots [34, 35], which describe how a small perturbation affects the network over time. If, on average, the perturbation reduces in size, the system operates in the ordered regime; if it amplifies on average, the system is in the chaotic regime, and if it remains, on average, of the same size, the system exhibits criticality. Many biological systems, modeled using Boolean networks, have already been found to operate in the critical regime [36, 24]. For each GRN model, we computed the Derrida value for a single perturbation (i.e., the average size of a single perturbation after each gene was once synchronous updated). We found that most models exhibited an average sensitivity of close to 1, which constitutes the critical threshold between order and chaos (Fig. 8). Moreover, model size was not associated with the dynamical robustness (ρSpearman = −0.15). This finding provides the strongest evidence presented thus far in favor of the criticality hypothesis.

Conclusion

Gene expression constitutes the most fundamental process in which the genotype determines the phenotype. A detailed understanding of the design principles that regulate this process is therefore of great importance. We utilized combined knowledge from numerous experts in their respective fields to perform a meta-analysis of gene regulatory networks. Boolean networks constituted the perfect modeling framework for this kind of analysis due to their simplicity, easy comparability, and widespread use. A large literature search provided us with the most extensive database of expert-curated Boolean models thus far, which may be queried to generate and test various types of hypotheses. We highlighted the usefulness of this resource by focusing on several proposed design principles of GRNs. We confirmed that the regulatory logic is not random but highly canalized. Using a broader definition of canalization, we showed that even regulatory interactions that were not considered canalizing in previous analyses, exhibited a high level of canalization. Canalization and genetic redundancy are two correlated concepts; GRNs proved to be independently enriched for both. We further studied the presence of small network motifs and discovered various types of motifs that were vastly more or less abundant than expected by chance. Finally, we provided strong evidence for the hypothesis that most GRNs dynamically operate at the edge of order and chaos due to a tradeoff between stability and adaptability. The described analysis suffers from several obvious limitations. First, not all biological phenomena can be accurately described in simple Boolean logic. There are a variety of published models that allow for more than two states. A similar analysis of more general models might provide more detailed insights into gene regulation but will itself suffer from the increased complexity of describing the studied concepts in the non-Boolean case. Second, there exists no feasible way to test the representativeness or completeness of our generated database of Boolean models. Even if a complete database of all published Boolean network models existed, the results would still be biased as some processes and species receive more attention and are modeled more frequently than others. Third, design principles of GRNs will likely differ among kingdoms of life or even among lower taxonomic levels. A stratification of the results based on the organism under consideration would likely yield valuable information.

Methods

Database creation Aiming to identify all published Boolean network models of GRNs, we developed an algorithm that parses all of the more than 30 million abstracts indexed in the literature search engine Pubmed and used keywords to rank the abstracts based on how likely they were to contain a Boolean network model. To identify the keywords, we relied on the Cell Collective, a pre-existing repository of Boolean network models, which, at the time of access, contained 78 Boolean network models published in 65 distinct papers [37]. The abstracts of these 65 papers served as a training set for the identification of keywords indicative of the presence of a Boolean network model. We considered as possible indicators (i) any word that occurred in at least two Cell Collective abstracts and was not among the most common 3000 words found in an English dictionary,

9 (ii) all fixed combinations of two and three words like “logical modeling” or “Boolean network model”, and (iii) all co-occurrences of two or three single words in the same abstract, e.g. the co-occurrence of the words “logical”, “regulatory” and “modelling” in an abstract, not necessarily in the same fixed order. For any possible indicator, we calculated a quality score as the ratio of the number of Cell Collective abstracts in which it occurred over the total number of Pubmed abstracts containing this indicator. This procedure resulted in 1297 publications with at least one indicator with a quality score of 5% or greater. We then manually investigated these 1297 publications to decide if they indeed contained a GRN model. During the manual review, an additional 369 referenced publications were investigated, as they were manually deemed to be of potential interest despite lacking an indicator with quality score ≥ 5%, resulting in a total of 1666 reviewed publications.

Model exclusion To avoid the introduction of various of kinds of bias into the analysis, we used strict criteria for the inclusion of models. We excluded models where the update rules were solely generated using an inference method or where default updates like threshold rules were consistently used. Our goal was to include only models where the update rules were built based on biological expertise and knowledge gained from appropriate experiments. In addition, identical models that were presented in multiple publications were only included once, and we aimed to include the earliest publication that initially presented the model.

Model extraction and standardization Boolean network models are presented in various formats in the literature. Using customized Python scripts, we extracted all published Boolean network models that were not excluded (see Model exclusion) and trans- formed them into a standardized format. In this format, each line describes the regulation of one gene; the name of the regulated gene is on the left, followed by “=”, followed by the Boolean update rule with operators AND, OR, and NOT. External parameters do not have an update rule and only occur in the update rules of the genes they regulate. For example, A = B OR C B = A OR (C AND D) C = NOT A represents a model with three genes, A, B and C, and one external parameter D.

Meta-analysis All analyses were performed in Python 3.7 using the libraries numpy, scipy.stats, and itertools. In particular, we wrote a Python script that takes as input a Boolean model, described in standardized format, and returns, among other things, an of the wiring diagram of the model, as well as completely evaluated update rules. That is, each update rule of k inputs is represented as a vector of length 2k, which together with the wiring diagram enables all presented analyses. An additional quality check ensured that highly similar models were only analyzed once. For any combination of two models, we used the Jaccard index to determine the similarity of the set of model variables. We then manually reviewed all pairs with greater than 80% similarity. For each pair/multiple of highly similar models, we removed all but one model from the analysis. This additional quality control step led to the exclusion of 19 out of the 151 identified GRN models. For computational reasons, we restricted most analyses to update rules with 16 or fewer inputs. The four GRN models that contained a total of ten rules with more inputs were excluded from the analyses reported in Figs. 5, 6, 7, 8, 2, as the specific types of regulation (activation, inhibition, conditional) and number of essential inputs could not be determined for rules with so many inputs.

Acknowledgements

Haris Serdarevic, Jack Kinseth, Taras-Michael Butrie and Evan Hilton all contributed equally to this project as part of a one-semester research experience for first-year Honor students. Throughout the course of the semester, they each considered about 200 papers and extracted any Boolean network models they encoun- tered. The authors thank Audrey McCombs for helpful comments on the manuscript.

10 Data availability

The database of Boolean GRN models is available upon email request, and will be made publicly available once the paper is accepted for journal publication.

References

References

[1] J. Stelling, U. Sauer, Z. Szallasi, F. J. Doyle III, J. Doyle, Robustness of cellular functions, Cell 118 (6) (2004) 675–685. [2] S. S. Shen-Orr, R. Milo, S. Mangan, U. Alon, Network motifs in the transcriptional regulation network of escherichia coli, Nature genetics 31 (1) (2002) 64–68. [3] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, U. Alon, Network motifs: simple building blocks of complex networks, Science 298 (5594) (2002) 824–827.

[4] U. Alon, Network motifs: theory and experimental approaches, Nature Reviews Genetics 8 (6) (2007) 450–461. [5] M. B. Gerstein, A. Kundaje, M. Hariharan, S. G. Landt, K.-K. Yan, C. Cheng, X. J. Mu, E. Khurana, J. Rozowsky, R. Alexander, et al., Architecture of the human regulatory network derived from encode data, Nature 489 (7414) (2012) 91–100. [6] S. Mangan, U. Alon, Structure and function of the feed-forward loop network motif, Proceedings of the National Academy of Sciences 100 (21) (2003) 11980–11985. [7] J. Cotterell, J. Sharpe, An atlas of gene regulatory networks reveals multiple three-gene mechanisms for interpreting morphogen gradients, Molecular systems biology 6 (1) (2010) 425.

[8] S. Mangan, A. Zaslaver, U. Alon, The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks, Journal of molecular biology 334 (2) (2003) 197–204. [9] M. A. Nowak, M. C. Boerlijst, J. Cooke, J. M. Smith, Evolution of genetic redundancy, Nature 388 (6638) (1997) 167–171.

[10] Z. Gu, L. M. Steinmetz, X. Gu, C. Scharfe, R. W. Davis, W.-H. Li, Role of duplicate genes in genetic robustness against null mutations, Nature 421 (6918) (2003) 63–66. [11] S. A. Kauffman, Metabolic stability and epigenesis in randomly constructed genetic nets, Journal of theoretical biology 22 (3) (1969) 437–467.

[12] C. H. Waddington, Canalization of development and the inheritance of acquired characters, Nature 150 (1942) 563–565. [13] J. D. Schwab, S. D. K¨uhlwein, N. Ikonomi, M. K¨uhl,H. A. Kestler, Concepts in boolean network modeling: What do they all mean?, Computational and Structural Biotechnology Journal. [14] G. Karlebach, R. Shamir, Modelling and analysis of gene regulatory networks, Nature Reviews Molecular Cell Biology 9 (10) (2008) 770–780. [15] R. Albert, A.-L. Barab´asi,Statistical mechanics of complex networks, Reviews of modern physics 74 (1) (2002) 47. [16] A.-L. Barab´asi,R. Albert, of scaling in random networks, science 286 (5439) (1999) 509–512.

[17] N. Guelzim, S. Bottani, P. Bourgine, F. K´ep`es,Topological and causal structure of the yeast transcrip- tional regulatory network, Nature genetics 31 (1) (2002) 60–63.

11 [18] N. M. Luscombe, M. M. Babu, H. Yu, M. Snyder, S. A. Teichmann, M. Gerstein, Genomic analysis of regulatory network dynamics reveals large topological changes, Nature 431 (7006) (2004) 308–312. [19] L. Raeymaekers, Dynamics of boolean networks controlled by biologically meaningful functions, Journal of Theoretical Biology 218 (3) (2002) 331–341.

[20] K. Struhl, Fundamentally different logic of gene regulation in eukaryotes and prokaryotes, Cell 98 (1) (1999) 1–4. [21] Q. He, M. Macauley, Stratification and enumeration of boolean functions by canalizing depth, Physica D: Nonlinear Phenomena 314 (2016) 1–8. [22] C. Kadelka, J. Kuipers, R. Laubenbacher, The influence of canalization on the robustness of boolean networks, Physica D: Nonlinear Phenomena 353 (2017) 39–47. [23] S. E. Harris, B. K. Sawhill, A. Wuensche, S. Kauffman, A model of transcriptional regulatory networks based on biases in the observed regulation rules, Complexity 7 (4) (2002) 23–40. [24] B. C. Daniels, H. Kim, D. Moore, S. Zhou, H. B. Smith, B. Karas, S. A. Kauffman, S. I. Walker, Criticality distinguishes the ensemble of biological regulatory networks, Physical review letters 121 (13) (2018) 138102. [25] L. Layne, E. Dimitrova, M. Macauley, Nested canalyzing depth and network stability, Bulletin of math- ematical biology 74 (2) (2012) 422–433. [26] C. O. Reichhardt, K. E. Bassler, Canalization and symmetry in Boolean models for genetic regulatory networks, Journal of Physics A: Mathematical and Theoretical 40 (2007) 4339. [27] C. Kadelka, B. Keilty, R. Laubenbacher, Collectively canalizing Boolean functions, arXiv 2008.13741. [28] T. E. Gorochowski, C. S. Grierson, M. di Bernardo, Organization of feed-forward loop motifs reveals architectural principles in natural and engineered networks, Science advances 4 (3) (2018) eaap9751.

[29] R. Thomas, R. d’Ari, Biological feedback, CRC press, 1990. [30] R. Thomas, D. Thieffry, M. Kaufman, Dynamical behaviour of biological regulatory networks?i. biolog- ical role of feedback loops and practical use of the concept of the loop-characteristic state, Bulletin of mathematical biology 57 (2) (1995) 247–276.

[31] M. Kaern, T. C. Elston, W. J. Blake, J. J. Collins, Stochasticity in gene expression: from theories to phenotypes, Nature Reviews Genetics 6 (6) (2005) 451–464. [32] M. B. Elowitz, A. J. Levine, E. D. Siggia, P. S. Swain, Stochastic gene expression in a single cell, Science 297 (5584) (2002) 1183–1186. [33] M. Aldana, E. Balleza, S. Kauffman, O. Resendiz, Robustness and evolvability in genetic regulatory networks, Journal of theoretical biology 245 (3) (2007) 433–448. [34] B. Derrida, G. Weisbuch, Evolution of overlaps between configurations in random Boolean networks, Journal de Physique 47 (1986) 1297–1303. [35] B. Derrida, Y. Pomeau, Random networks of automata: a simple annealed approximation, Europhysics Letters 1 (1986) 45. [36] E. Balleza, E. R. Alvarez-Buylla, A. Chaos, S. Kauffman, I. Shmulevich, M. Aldana, Critical dynamics in genetic regulatory networks: examples from four kingdoms, PLoS One 3 (6) (2008) e2456. [37] T. Helikar, B. Kowal, S. McClenathan, M. Bruckner, T. Rowley, A. Madrahimov, B. Wicks, M. Shrestha, K. Limbu, J. A. Rogers, The cell collective: toward an open and collaborative approach to systems biology, BMC systems biology 6 (1) (2012) 96.

12