Z34bio: an SMT-Based Framework for Analyzing Biological Computation
Total Page:16
File Type:pdf, Size:1020Kb
Z34Bio: An SMT-based Framework for Analyzing Biological Computation Boyan Yordanov, Christoph M. Wintersteiger, Youssef Hamadi and Hillel Kugler Microsoft Research, Cambridge, UK, http://research.microsoft.com/z3-4biology Abstract The basic principles governing the development and function of living organisms remain only partially understood, despite significant progress in molecular and cellular biology and tremendous breakthroughs in experimental methods. The development of system-level, mechanistic, computational models has the potential to become a foundation for improving our understanding of natural biological systems, and for designing engineered biological systems with wide-ranging applications in nanomedicine, nanomaterials and computing. We describe Z34Bio (Z3 for Biology), a unified SMT-based framework for the automated analysis of natural and engineered biological systems. Z34Bio enables addressing important biological questions, and studying models more complex than previously possible. The framework provides a formalization of the semantics of several model classes used widely for biological systems, which we illustrate through the treatment of chemical reaction networks and Boolean networks. We present case-studies which we make available as SMT-LIB benchmarks, to enable comparison of different analysis techniques, and towards making this new domain accessible to the formal verification community. 1 Introduction Many mechanisms and properties of biological systems remain only partially understood, thus limiting our comprehension of natural living systems and processes. Recently, advanced ex- perimental techniques have enabled the rational design and construction of biological systems, delineating a branch of biology as an engineering discipline, with potential applications in nanomedicine, nanomaterials and computing. However, understanding the system-level behav- ior of organisms or designing ones with specific behavior remains a major challenge for the engineering and the reverse engineering of biological systems. Computational modeling, together with methods enabling the automated analysis of realis- tic models for diverse biological queries, can help address these challenges and tackle important questions related to biological computation - the information processing within living organisms. Along this direction, we introduce Z34Bio (Z3 for Biology) as a framework that allows flexi- ble and scalable analysis of biological models using Satisfiability Modulo Theories (SMT)-based procedures. The framework provides a formalization of the semantics of several widely used for- malisms in biological modeling, which we illustrate through the treatment of chemical reaction networks (CRNs) and Boolean networks (BNs), as well as combinations thereof. These for- malisms are useful for describing DNA computing circuits (as well as more general biochemical mechanisms within natural systems) and biological interaction networks such as gene regulation networks (GRNs). We formalize the semantics of CRNs and BNs as transition systems, which we represent and analyze symbolically using SMT to allow flexible and convenient encoding of (possibly infinite-state) biological models. The richness of the various SMT logics also allows us to express a range of important biological properties that are not easily captured by other specification formalisms. For instance, 1 Z34Bio: A Framework for Analyzing Biological Computation Yordanov, Wintersteiger, Hamadi, Kugler we are able to formalize and study certain mass-conservation properties and the effect of gene knockouts on system dynamics. The availability of efficient decision procedures for some SMT logics such as uninterpreted functions and bit vectors (UFBV) with quantifiers [25] provides a foundation for the analysis of such questions, even for large and complex systems. While the Z3 theorem prover [9] is used in Z34Bio, arbitrary SMT solvers can be substituted in the framework through the SMT-LIB input language. In [27] we showed how SMT-based methods can be applied to engineered biological systems, and, more specifically, in DNA computing and synthetic biology. Here we present a framework supporting this approach accompanied by an online tool, extend it to allow modeling and reasoning about biological computation within living systems via Boolean networks, and provide support for hybrid models, composed of CRNs and BNs. We outline a number of case-studies illustrating the analysis of engineered DNA circuits and genetic regulatory networks (GRNs), which we curate and make available as SMT-LIB benchmarks, with the goal of improving the evaluation of existing SMT algorithms, helping in the development of new methods, and making this auspicious application domain more accessible to the SMT community. 2 Chemical Reaction Networks and Boolean Networks In the field of DNA computing, which aims at engineering and understanding forms of computa- tion performed by biological material (e.g., reacting DNA strands), chemical reaction networks (CRNs) serve as models of circuits [24, 18]. More generally, CRNs are often used to describe a number of natural and engineered biochemical mechanisms. Here, we study such systems with single-molecule resolution, abstracting from the exact reaction kinetics (rates), thereby approximating probabilistic systems by non-deterministic ones. While certain information is not captured in this representation of the behavior of a CRN, it is a useful level of detail for various studies of DNA circuits, including cases where functional correctness is under investi- gation. Where studies of natural biological systems are concerned, this is often also a useful abstraction, when the rates of certain reactions are unknown and a precise measurement in a wet-lab is challenging. We treat a CRN as a pair (S; R) of species (different DNA strands) and reactions where a reaction r 2 R is a pair of multisets r = (Rr;Pr) describing the reactants (inputs) and products (outputs) of r with their stoichiometries (the numbers of participating strands). We formalize the behavior of a CRN as the transition system T = (Q; T ) where a state q 2 Q is a multiset of species, where q(s) indicates how many strands of s are available in a state q, and T is the 0 W V 0 transition relation defined as T (q; q ) $ r2R[on(r; q) ^ s2S q (s) = q(s) − Rr(s) + Pr(s)], where on(r; q) is true if in state q there are enough molecules of each reactant of r for it to fire. The complementarity of DNA sequences, dictated by the binding of Watson-Crick DNA base pairs (A-T and G-C), provides a mechanisms for engineering chemical reaction networks using DNA. In this approach, various single and double-stranded DNA molecules are designated as chemical species. The binding, unbinding and displacement reactions possible between the complementary DNA domains (subsequences) of these species form the desired CRN structure. When specific computational operations are implemented using such a strategy, the resulting system is called a DNA circuit (see [18] and the references therein for additional details on the formalization and design of DNA circuits). Figure 1 (left panel) shows a simple DNA circuit implementing a logical AND gate. The system is represented as a CRN with seven different species (A, B, C, Gate, GateA, GateB, GateAB) and four reactions, two of which are reversible as indicated by the bi-directional arrows. Species A and B represent the two system inputs, species Gate is the actual AND gate, and species C 2 Z34Bio: A Framework for Analyzing Biological Computation Yordanov, Wintersteiger, Hamadi, Kugler Figure 1: A simple DNA circuit implementing a logical AND gate. For each species (Gate,A,B,GateA,GateB,GateAB,C), domains labeled by 1;:::; 4 represent different DNA se- quences, while complementary sequences are denoted by ∗ (e.g. domains 1 and 1∗ are comple- mentary). The binding of complementary domains and the subsequent displacement of adjacent complementary sequences determines the possible chemical reactions (r0; : : : ; r5) between the DNA species (left panel). The DNA circuit is represented as a transition system (right panel) where a state captures the number of molecules from each species and the initial state is high- lighted using a thick black border. For this system, a state can be reached where no additional reactions are possible (shown with a red border), where computation terminates. For a single molecule of species Gate, the output C is produced at the end of the computation only if both input species A and B are present, which captures the required logical AND behavior. is the output (all other species are intermediates). A state of the system captures the number of available molecules from each DNA species, which change as reactions take place, leading to the transition system representation in Figure 1 (right panel). In some applications, it is sufficient to describe species more coarsely, using a small number of discrete levels of activity. This has proven to be a most useful abstraction, especially for analyzing the dynamics of species within gene regulatory networks (GRNs) [17] e.g. during the life-cycle of a cell or an organism. Unlike the biological engineering applications described above, the focus here is on understanding natural systems and, often, only the species' presence or absence or the activity or inactivity of genes is tracked. A Boolean network is a popular representation of a GRN, which is given as a pair (S; F) of species and a set of update functions. We capture the behavior over