Identification of Nonlinear State-Space Systems From
Total Page:16
File Type:pdf, Size:1020Kb
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCNS.2017.2758966, IEEE Transactions on Control of Network Systems 1 Identification of Nonlinear State-Space Systems from Heterogeneous Datasets Wei Pan, Ye Yuan, Lennart Ljung, Jorge Gonçalves and Guy-Bart Stan Abstract—This paper proposes a new method to identify availability of “big data” obtained from sophisticated bio- nonlinear state-space systems from heterogeneous datasets. The logical instruments, e.g., large ‘omics’ datasets, attention has method is described in the context of identifying biochemical/gene turned to the efficient and effective integration of these data networks (i.e., identifying both reaction dynamics and kinetic parameters) from experimental data. Simultaneous integration of and to the maximum extraction of information from them. various datasets has the potential to yield better performance for Such datasets typically contain (a) data from replicates of an system identification. Data collected experimentally typically vary experiment performed on a biological system of interest under depending on the specific experimental setup and conditions. Typ- identical experimental conditions, or (b) data measured from ically, heterogeneous data are obtained experimentally through a biochemical network subjected to different experimental (a) replicate measurements from the same biological system or (b) application of different experimental conditions such conditions, for example, different biological inducers, temper- as changes/perturbations in biological inductions, temperature, ature, stress factors, gene knock-out or gene over-expression. gene knock-out, gene over-expression, etc. We formulate here The challenges for simultaneously considering heterogeneous the identification problem using a Bayesian learning framework datasets during system identification are: (a) the system itself that makes use of “sparse group” priors to allow inference of is unknown, i.e., neither the structure nor the corresponding the sparsest model that can explain the whole set of observed, heterogeneous data. To enable scale up to large number of parameters are known; (b) it is unclear how heterogeneous features, the resulting non-convex optimisation problem is relaxed datasets collected under different experimental conditions in- to a re-weighted Group Lasso problem using a convex-concave fluence the “quality” of the identified system; (c) each single procedure. As an illustrative example of the effectiveness of our time-series data may be short. These second and third points method, we use it to identify a genetic oscillator (generalised are particularly important as biological experiments become eight species repressilator). Through this example we show that our algorithm outperforms Group Lasso when the number of increasingly costly in time and resources when long time- experiments is increased, even when each single time-series series dataset are required. Furthermore, repeat or perturbation dataset is short. We additionally assess the robustness of our experiments may be conducted over different time ranges, with algorithm against noise by varying the intensity of process noise different sampling frequencies, under various conditions, and and measurement noise. in different laboratories, which likely affects the success of identification. I. INTRODUCTION Another important consideration comes from the purpose of dynamic models. Highly detailed or complex models are typi- The problem of identifying biological networks from exper- cally difficult to handle using rigorous control design methods. imental time-series data is of fundamental interest in systems Therefore, one typically prefers to use simple or sparse models and synthetic biology [1]–[3]. Tools from system identifica- that capture at best the dynamics expressed in the collected tion [4] can be applied for such purposes. However, most data. The identification and use of simple or sparse models system identification methods produce estimates of model inevitably introduces model class uncertainties and parameter parameters based on data coming from a single experiment. uncertainties [5], [6]. To assess these uncertainties, replicates The interest in identification methods able to handle several of multiple experiments are typically necessary. datasets simultaneously is twofold. Firstly, with the increasing In the context of biology, the use of kinetic models to Dr Wei Pan gratefully acknowledges the support of Microsoft Research understand the function of biological systems has already been through the PhD Scholarship Program for his stay at Imperial College London. successfully illustrated in [7], [8]. Furthermore, the use of Dr Guy-Bart Stan gratefully acknowledges the support of the EPSRC grant heterogeneous dataset during system identification has been EP/P009352/1 and of the EPSRC Fellowship for Growth EP/M002187/1. (Corresponding author: Prof. Ye Yuan) proposed as a means to improve the accuracy of genetic regula- W. Pan is with the Department of Bioengineering, Imperial College Lon- tory network reconstruction methods [9]. Typically, biological don, United Kingdom and with DJI Innovations, Shenzhen, China. Email: experiments are accompanied by a set of corresponding refer- [email protected]. Y. Yuan is with School of Automation, Huazhong University of Science ence control experiments, whose profiles are used to determine and Technology, China. Email: [email protected]. differential gene expression [10], [11]. Modern techniques try L. Ljung is with Division of Automatic Control, Department of Electrical to harness the “wisdom of crowd” concept by integrating the Engineering, Linköping University, Sweden. Email: [email protected]. J. Gonçalves is with the Control Group, Department of Engineering, predictions from multiple datasets into a single reconstructed University of Cambridge, United Kingdom and with the Luxembourg Centre network termed the “consensus gene regulatory network” [12], for Systems Biomedicine, Luxembourg. Email: [email protected]. [13]. For instance, in [13], the authors grouped the algorithms G.-B. Stan is with the Department of Bioengineering, Imperial College London, United Kingdom. Email: [email protected]. by applying the Euclidean distance on the confidence scores of The corresponding code is available at https://github.com/panweihit/BSID. the links in the inferred networks. They showed that integration 2325-5870 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCNS.2017.2758966, IEEE Transactions on Control of Network Systems 2 of diverse algorithms outperformed each individual inference A 0 means A is positive semidefinite. A vector γ 0 methods. The consensus network was obtained in three ways: means each element in γ is non-negative. average of the estimated coefficients over conditions, a priori biological knowledge, and pre-calculated coefficients obtained II. PROBLEM FORMULATION from the application of a Gaussian graphical model [14] on the combined data sets. However, the problem of accurate A. Model reconstruction of gene regulatory networks is far from fully We consider dynamical systems described by nonlinear resolved. Recent works [15], [16] advanced the state-of-art differential/difference equation with additive process noise: by using new type of regularisation techniques. Unfortunately, the dynamical models considered so far have been mostly δ(xnt) = fn(xt; ut)vn + ξnt n = 1; : : : ; nx constrained to linear systems, an assumption that is rarely XNn (1) = vnsfns(xt; ut) + ξnt; satisfied by biological systems. s=1 Our approach is based on the concept of sparse Bayesian where xt is the state variable, ut is the external control learning [2], [17], [18] and on the definition of a unified input; xnt represent the n-th state variable at time t (sim- optimisation problem allowing model identification from het- ilar for unt); δ(xnt) =x _ nt for continuous-time system; erogeneous datasets, and whose solution is a model consistent δ(xnt) = xnt or xnt − xn;t−1 or some known transformation with all datasets available for identification. The ability to of historical data for discrete-time system; vns 2 R and nx+nu consider various datasets simultaneously can potentially avoid fns(xt; ut): R ! R and vn are basis functions non-identifiability issues arising when a single dataset is and corresponding parameters respectively that govern the used [19]. dynamics, where nx and nu are the dimension of x and The main contributions of this paper are as follows: u respectively. The functions fns(xt; ut) are assumed to be • Formulation of a nonlinear identification problem using Lipschitz continuous. ξnt represents additive process noise, datasets from heterogeneous experiments. which is assumed to be i.i.d. Gaussian. Note that we do • Derivation of a sparse Bayesian formulation of this iden- not assume a priori knowledge of the form of the nonlinear tification problem by introducing “sparse group” priors. functions appearing on the right-hand side of the equations • Relaxation of the resulting non-convex optimisation prob- in (1),