In silico cell and biochemistry: a systems biology approach

Diogo M. Camacho

Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in , Bioinformatics and Computational Biology

Pedro Mendes, Chair Ina Hoeschele Reinhard Laubenbacher Vladimir Shulaev Brenda Winkel

June 1, 2007 Blacksburg, Virginia

Keywords: Systems biology, computer simulation, mathematical modeling, reverse engineering, computational biology, biochemistry Copyright 2007, Diogo M. Camacho Abstract

In silico cell biology and biochemistry: a systems biology approach

Diogo M. Camacho

In the post-‘omic’ era the analysis of high-throughput data is regarded as one of the ma- jor challenges faced by researchers. One focus of this data analysis is uncovering biological network topologies and dynamics. It is believed that this kind of research will allow the devel- opment of new mathematical models of biological systems as well as aid in the improvement of already existing ones. The work that is presented in this dissertation addresses the prob- lem of the analysis of highly complex data sets with the aim of developing a methodology that will enable the reconstruction of a biological network from time series data through an iterative process. The first part of this dissertation relates to the analysis of existing methodologies that aim at inferring network structures from experimental data. This spans the use of statistical tools such as correlations analysis (presented in Chapter 2) to more complex mathematical frameworks (presented in Chapter 3). A novel methodology that focuses on the inference of biological networks from time series data by least squares fitting will then be introduced. Using a set of carefully designed inference rules one can gain important information about the system which can aid in the inference process. The application of the method to a data set from the response of the yeast Saccharomyces cerevisiae to cumene hydroperoxide is explored in Chapter 5. The results show that this method can be used to generate a coarse-level mathematical model of the biological system at hand. Possible developments of this method are discussed in Chapter 6.

This work was financially sponsored by the National Institutes of Health under grant R01- GM068947. To my parents, to my sister

iii Acknowledgments

I would like to thank, first and foremost, Dr. Pedro Mendes, my principal advisor, for all his help and guidance through my whole Ph.D., by providing his insight on the subject at hand, by inciting discussions that would lead to new ideas, new approaches and stimulating me to work harder to achieve the goals I would propose to get. Secondly, I would like to thank my advisory committee: Dr. Brenda Winkel, Dr. Ina Hoeschele, Dr. Reinhard Laubenbacher and Dr. Vladimir Shulaev. A special thanks also to Dr. Ana Martins, who was always a good friend in this adventure to foreign lands to pursue a career in science. Many thanks to my fellow country-men and country-women that I met here in Blacksburg. Angela, Inˆes, Jo˜ao, Polanah, Beatriz and the German-American-Portuguese crowd Renate, Romy and Katja, this thesis is also a bit yours, as your friendly manners and the joy you always bring to wherever we may be made my days brighter. A warm thank you note for all my friends back home, in the US, or in other parts of the world especially Pedro, Nuno and Catarina. A word of appreciation for everyone that, either in Pedro’s group or other research groups at the VBI, worked or exchanged ideas with me. To avoid missing anyone, I’ll refrain from naming you all. A very special thanks to Wei Sha, a cubicle-mate since the day I came to the VBI, for her joy in life and interest in science made our frustrations seem insignificant and stimulated our interest in whatever problem we had to face in the research projects we were involved in on both our dissertations. To Emily, for being you, for being here when I needed you. And last but certainly not the least, a big big thank you to my parents and to my sister, to whom I dedicate this work, for their unlimited love and support.

iv Attribution

Pedro Mendes, Ph.D. (Virginia Bioinformatics Institute), now also a faculty member of the Manchester Interdisciplinary Biocentre at the University of Manchester and a Professor at the Computer Science Department of the University of Manchester. Dr. Mendes is the primary adivsor and committee chair. Dr. Mendes provided important guidance in all of the projects that I was involved during my research, from the conception of the ideas to the completion of the project. Dr. Mendes also provided funding that allowed me to pursue my research at Virginia Tech.

Chapter 2 Alberto de la Fuente, Ph.D. (Virginia Bioinformatics Institute), now coordinator of the RAGNO Group at the Center for Advanced Studies, Research and Development in Sardinia, was a graduate student at Vrije Universiteit in Amsterdam working under the guidance of Dr. Mendes. Dr. de la Fuente was responsible for the derivation of the mathematical link between metabolic control analysis and correlations.

Ana Martins, Ph.D. (Virginia Bioinformatics Institute) is a Research Associate work- ing under Dr. Mendes. Dr. Martins was the lab coordinator and in the person of charge of experimental setups, from growth conditions to sample collection and preparation for the different techniques to be applied.

Wei Sha, Ph.D. (Virginia Bioinformatics Institute) was a graduate student of Dr. Mendes. Dr. Sha was responsible for the statistical analysis of microarray data.

Joel Shuman, Ph.D. (Virginia Bioinformatics Institute) is a Metabolomics Specialist and Laboratory Manager working under Dr. Vladimir Shulaev. Dr. Shuman was responsible for the metabolite profiling experiments.

v Chapter 3 Paola Vera-Licona, Ph. D. (Department of Mathematics, Virginia Tech), now at the BioMaps Institute at Rutgers University. Dr. Vera-Licona was a graduate student working under Dr. Reinhard Laubenbacher performed an analysis on Dynamic Bayesian networks and their performance under the conditions of the study performed in this Chapter.

Chapter 4 Abdul Jarrah, Ph. D. (Department of Mathematics, Virginia Tech), is a Research Asso- ciate working under Dr. Laubenbacher. Dr. Jarrah was provided me with insightful pointers and discussions throughout the development of the method presented.

Brandy Stigler, Ph. D. (Department of Mathematics, Virginia Tech), now employed at the Mathematical Biosciences Institute at Ohio State University. Dr. Stigler was a grad- uate student working under Dr. Laubenbacher and contributed to this Chapter in discussing the problem of reverse engineering approaches and possible improvements that could be done in such approaches.

Chapter 5 Ana Martins, Ph. D. (Virginia Bioinformatics Institute) is a Saccharomyces cerevisiae expert with a specific emphasis on oxidative stress response and was instrumental not only in the design of experiments (as in Chapter 2) but also in discussing the results and impli- cations in yeast physiology and their validity and impact in the community.

vi Contents

Acknowledgements iv

Attribution v

1 Introduction 1 1.1 Abstract ...... 3 1.2 Introduction ...... 3 1.3 Systems biology: biology in the post genomic era ...... 5 1.3.1 Holism and the birth of systems biology ...... 5 1.3.2 The complexity of biology unleashed ...... 8 1.3.3 Modules and parts lists ...... 11 1.4 Simulation and modeling in the life sciences ...... 12 1.4.1 Historical overview ...... 13 1.4.2 Bottom-up versus top-down modeling ...... 17 1.4.3 Data limitations and modeling ...... 18 1.5 Reverse engineering for the “omics” ...... 20 1.5.1 Inference of gene regulatory interaction networks ...... 22 1.5.2 Availability vs. applicability ...... 26 1.6 Data analysis for the “omes” ...... 26 1.6.1 Correlation analysis in “omics” research ...... 27

vii 2 On the origin of strong correlations in metabolomics data 29 2.1 Abstract ...... 31 2.2 Introduction ...... 31 2.3 Methods ...... 33 2.3.1 Theoretical ...... 33 2.3.2 Computational ...... 34 2.3.3 Yeast model expansion ...... 35 2.4 Discussion ...... 36 2.4.1 Metabolic control analysis and correlations ...... 36 2.4.2 Scatter plots and correlation ...... 39 2.4.3 Simulations ...... 43 2.5 Yeast metabolism: model expansion and correlations ...... 46 2.5.1 Simulations and model validation ...... 51 2.6 Conclusions ...... 52

3 Comparison of reverse engineering methods 54 3.1 Abstract ...... 56 3.2 Introduction ...... 56 3.2.1 In silico networks ...... 58 3.2.2 Reverse engineering algorithms ...... 61 3.2.3 Benchmarking and reverse engineering ...... 63 3.3 Model ...... 64 3.3.1 Computational ...... 64 3.3.2 Genetic perturbations ...... 65 3.3.3 Environmental perturbations ...... 65 3.3.4 Adding noise ...... 66 3.3.5 Data requirements ...... 66 3.3.6 Method evaluation: measures of correctness ...... 66 3.4 Results ...... 68

viii 3.4.1 Gene network ...... 68 3.4.2 Artificial biochemical network ...... 75 3.5 Discussion ...... 78

4 Reverse engineering biological networks by least-squares fitting 80 4.1 Abstract ...... 82 4.2 Introduction ...... 82 4.2.1 Reverse engineering gene networks by least squares fitting ...... 83 4.3 Methods ...... 87 4.3.1 Computational ...... 87 4.3.2 Model ...... 87 4.3.3 Fitting the model to the data ...... 88 4.4 Results and Discussion ...... 88 4.5 Conclusions ...... 90

5 Beyond reverse engineering: applications to experimental data 92 5.1 Abstract ...... 94 5.2 Introduction ...... 94 5.2.1 The yeast response to oxidative stress ...... 95 5.3 Methods ...... 95 5.3.1 Experimental setup ...... 95 5.3.2 Data preparation ...... 96 5.3.3 Computational approach ...... 96 5.3.4 Yeast regulatory network ...... 97 5.4 Results and Discussion ...... 97 5.4.1 Revamping the inference rules ...... 97 5.4.2 Yeast response to oxidative stress ...... 98 5.4.3 Coping with reality ...... 102 5.5 Conclusions ...... 103

ix 6 Looking ahead: Future research 107 6.1 Abstract ...... 109 6.2 Back to the future: Part I ...... 109 6.3 Reverse engineering life ...... 110 6.3.1 Linear vs. non-linear worlds ...... 110 6.3.2 Speedy delivery ...... 112 6.3.3 Specifying the approximation ...... 113 6.4 Back to the future: Part II ...... 114 6.4.1 Modular biology and systems approaches ...... 115 6.5 Back to the future: Part III ...... 116

Bibliography 117

x List of Figures

1.1 Number of publications in systems biology and reverse engineering...... 6 1.2 The systems biology cycle ...... 7 1.3 Complexity of biological networks ...... 8 1.4 The K¨onigsberg bridges problem ...... 10 1.5 Random, small-world and scale-free networks ...... 11 1.6 Increasing complexity in Lotka-Volterra models ...... 15 1.7 Comparison between top-down and bottom-up modeling strategies ...... 19 1.8 Effects of data availability in the construction of biological models ...... 20

2.1 Metabolite scatter plots ...... 32 2.2 Comparison of correlations between metabolite pairs ...... 40 2.3 Relationship between metabolite scatter plots and co-response profiles . . . . 42 2.4 Changes in correlation and co-response profiles ...... 46 2.5 Metabolite correlations in yeast exponential growth phase ...... 47 2.6 Metabolite correlations in yeast post-diauxic growth phase ...... 50

3.1 Gene network with 10 genes ...... 57 3.2 Synthetic biochemical network ...... 59 3.3 Receiver-operator characteristic (ROC) curve for the gene regulatory network 67 3.4 Accuracy of reverse engineering methods for gene regulatory network . . . . 70 3.5 True positives rate for gene regulatory network and its robustness to noise . 72

xi 3.6 True negatives rate for gene regulatory network and its dependence on noise 73 3.7 Rate of true positives for the synthetic biochemical network and its depen- dence on the noise level in the data ...... 74

4.1 7 Gene network to be used to demonstrate the network inference method . . 84 4.2 Network inference for 7 gene network ...... 89 4.3 Improvement of reverse engineering method by addition of knockout data . . 90

5.1 Revamping the inference rules ...... 98 5.2 Application of the reverse engineering method to the yeast response to CHP data ...... 99 5.3 The yeast regulatory network for response to CHP insults when all available data is used ...... 104

6.1 The knowledge iceberg ...... 113

xii List of Tables

2.1 Changes in metabolite correlations (Spearman) between growth phases . . . 49 2.2 Comparison of metabolite correlations between model and experiments . . . 52

3.1 Perturbations applied to the gene network ...... 61 3.2 Perturbations applied to the biochemical network ...... 63 3.3 Data requirements in reverse engineering methods ...... 64 3.4 Confusion matrix for gene networks ...... 65 3.5 Choosing the right parameters ...... 66 3.6 Effect of noise in the performance of reverse engineering methods ...... 69

4.1 Improvement in the measures of correctness as knockout data is added . . . 91

5.1 Data driven inferences on regulation of gene expression ...... 101

xiii Chapter 1 Introduction Introduction 3

“But one thing is certain: in order to understand the whole you must look at the whole.” - Henrik Kacser

1.1 Abstract

With the advancement of analytical technologies brought about with the boom in functional genomics, the focus of life sciences has shifted from small scale lab experiments to large scale approaches in which a biological system is studied as a whole. These holistic approaches have revolutionized research in biology and demanded different takes on the analysis of complex data sets. The advent of systems biology, together with an overview on modeling and simulation in the life sciences and the tackling of some of the problems that emerged with large scale approaches will be analyzed in this chapter.

1.2 Introduction

Biochemistry has undergone an with the advent of the high-throughput tech- nologies that have emerged from the ‘omics’ revolution. These techniques have caused the amount of experimental data collected to increase sharply, providing new challenges to the experimentalist in terms of what information is being obtained and what knowledge can be taken from these large data sets. An example of the complexity of the data sets being generated is found in a paper by Godon and co-workers (1) on the response of a strain of Saccharomyces cerevisiae, at the protein level, to hydrogen peroxide. The authors identified 115 proteins on a 2-D gel that are affected by the stimulus (the collection of these proteins being termed the ‘stimulon’) and some conclusions are drawn between the response to the stress and metabolism. At the metabolite level, Weckwerth and co-workers (2) were able to quantify over 1,200 metabolites in potato leaf using a GC-TOF instrument, which repre- sented an increase of two orders of magnitude in the amount of metabolites measured in a single experiment over a similar study performed earlier by Roessner-Tunali and co-workers in potato tubers (3). These advances both in proteomics and metabolomics, allied to the already existent genomics data sets, illustrate the new challenges of biochemistry: how can we make sense of all these data and what information can we extract from them? Scientific research is driven by the collection of data to solve a given problem. A clear example is the finding of the double helix structure of DNA by Watson and Crick (4; 5; 6). By a careful analysis of data collected at several labs, and borrowing ideas presented earlier 4 Introduction by Linus Pauling (7; 8), Watson and Crick concluded that the only possible structure for DNA would be a double helix with specific torsion angles for the major and minor grooves of the helix. But if the data will provide the final answer (or bits and pieces of it), theory and experiment are essential before and after the data is collected. A good theoretical framework will lead to the proposal of experiments that will produce the data that will prove or disprove, and therefore improve, the initial theory. It is in this setting that research in the life sciences has proceeded for many centuries. A given problem is selected, a theory formulated, an experi- ment performed and data collected. The analysis of this data will confirm or deny the initial theory. This approach is known, in philosophy, as the confirmation holism, which states that no single scientific theory can be tested in isolation, being always dependent on other theories and hypotheses (as in Wikipedia, http://www.wikipedia.org). A classical example is the discovery of the planet Neptune based on Newton’s gravitational law, where the study of the planet Uranus and its orbit led to the discovery of Neptune due to inconsistencies in the initial theory. The concepts of holism will be important for the discussion on systems biology later on (see Section 1.3) The complexity of problems in the life sciences is astounding. From the proposal of mechanisms (9; 10; 11; 12; 13) to the identification of the gene that causes cystic fibrosis (14; 15; 16) or how the p53 protein is involved in cancer (see 17, for a recent review on the diverse roles of p53). These studies, though intensive, deal with a sub-system of the whole biological system. A paper by Lazebnik (18) provides an amusing and yet interesting view on how a biologist would approach the problem of fixing a radio, and compares it with how an engineer would do it. The main conclusion that can be drawn is that the biologist has a keen interest in exploring each and every individual piece of the radio and how they work and put the radio together based on assumptions that are made along the way. This approach implies that we study the whole (the radio) by knowing what the parts (the components of the radio) do. However, the concept of modularity borrowed from engineering presents itself as the possible best in view of the large complexity that biology presents. A discussion on modularity in biology is left for later (see Section 1.3.3). Functional genomics (19; 20) brought a new level of complexity to the life sciences. The amount of data that microarrays (21), proteomics (1; 22), metabolite fingerprinting (23) and metabolite profiling (3; 24; 25; 26) generate allows one to study the biological system as a whole. This new look on the life sciences requires that the classical approaches be revised and revamped, bringing in new analysis methods for multi-dimensional data sets. Using the radio analogy again, functional genomics allows one to approach the life sciences from a new angle, where one must look at the whole (the radio) in order to understand how it works. The collection of information (data) about the radio will allow for some understanding of how its parts work. Introduction 5

1.3 Systems biology: biology in the post genomic era

1.3.1 Holism and the birth of systems biology

From the Greek holos for all, holism is concerned with the study of systems as a whole rather than a collection of parts (as in Webster-Dictionary online, http://www.m-w.com). As Aristotle would point out, “the whole is more than the sum of its parts” (27). In the life sciences holism has traditionally been abandoned to favor a reductionistic approach in which one would study a single ‘part’ and make predictions as to how would this part fit in the ‘whole’. Cases in which one studies a particular gene or protein associated with a specific disease or systems response abound in the literature. However, some attempts to implement and direct the attention of researchers to the need of holistic rather than reductionistic approaches have been tried (28; 29). This transition from reductionistic to holistic approaches was presented as one of the major challenges that the life sciences would face (30). With the increase in the amount of data collected came the need to improve techniques for their analysis, and a desire to move from a reductionistic biology to a holistic one. This new look at biology, which has become known as systems biology, attempts to give a more generalized, but nevertheless complete, view of a biological system, with its goal being the assessment of how all of its elements interact and relate to each other (31). The understanding of the biological system at a systems-level is fundamental in systems biology (32) and can be achieved by analysis and insight of properties of the system (33), namely (i) how is the system structurally organized, (ii) how does the system behave dynamically, (iii) what are the control mechanisms of the system and (iv) how can a system so complex be improved by careful design. To achieve this level of understanding, systems biology sets its foundation on multiple disciplines (32; 33). These disciplines span from computational to theoretical and experimental and their interaction, making use not only of the data collected in the lab but also of mathematical models and statistical analyses of the data, is the backbone of systems biology. Although systems biology is being touted as a new and emerging field, with direct implications in biomedical engineering and drug targeting (32; 33; 34), the concepts are not new. As an example, under the concepts of metabolic control analysis, Henrik Kacser (29) proposed that a full realization of how a biological system behaves can only be obtained if one looks at the system as a whole rather than focusing on specific parts. Technical difficulties, allied to rudimentary computers, made the implementation of these holistic views not feasible. The study of the system as a whole requires (i) the ability to measure all of the components of a given organism, (ii) the ability to store this information for analysis and (iii) computational power to perform the analyses and carry out simulations of mathematical models. These issues could not be addressed when Kacser introduced his holistic view for the life sciences. However, developments in experimental and technical capabilities of the past decade have allowed, for the first time, the collection of large data sets. This has in turn created the grounds for a holistic approach to the study of biological 6 Introduction

700 Systems Biology 600 Reverse engineering 500 400 300 200

Number of publications 100 0 1990 1992 1994 1996 1998 2000 2002 2004 2006 Year

Figure 1.1: Number of publications in systems biology and reverse engineering. In the past 16 years there has been an almost exponential growth in the amount of publications that are indexed as systems biology by PUBMED. Between January 2005 and August of 2006, the number of publications (in english) in this database that had the keyword “systems biology” alone was 604. Comparatively, it is also depicted the number of publications in another area of intense research in biology in the advent of “omics”.

systems and has enabled the field of systems biology to expand. The boom in systems biology coincided with the announcement of the opening of the Institute for Quantitative Systems Biology in 1999 by Leroy Hood (35), followed by the 1st International Conference in Tokyo in 2000 (36). A paper dated back to 1993 (37) is the first in PUBMED to use the keyword “systems biology” in its abstract (and for several years the only one). Figure 1.1 shows how the number of publications in systems biology have grown in the past 16 years. The rate at which publications in systems biology emerge make it one of the hottest topics in the life sciences today. Inter-disciplinarity is not a new concept in science and the life sciences rely heavily on quantitative disciplines like mathematics and statistics. In fact, a special issue of the journal Methods in Enzymology, published in 1992, where several issues on the use of statistics and statistical analysis are discussed (see, e.g., 38; 39), highlights the importance of such disciplines in the life sciences. The predictive power of computer simulations is also not new, with the first publication of a computational model of a biochemical event being described by Chance (40) for the peroxidase enzyme. The concept of a systems analysis of a given problem has been attributed to Bertalanffy (41), who showed that the solution to certain ‘systems’ relies on the study of the ‘wholeness’ rather than the parts, be it in biology, economics or social science. Therefore, one cannot isolate systems biology from general systems theory. The revamped systems biology emerged as a consequence of the large amounts of data col- lected through ‘omics’ projects (42; 43). These large data sets are obtained from a biological Introduction 7

Figure 1.2: The systems biology cycle. In systems biology there is a close interaction between experimental biology with theoretical/computational biology. A theory drives the experimental biologist to confirm or reject the theory by a set of experiments. These experiments are used to formulate mathematical models of the system which will eventually lead to an improvement or reformulation of the starting theory. Of course that the theoretical framework may lead to some computational studies and simulations that will, in turn, be confirmed or denied with a set of experiments performed that obey some experimental design. These experimental results will help in the confirmation or refuting of the original theoretical framework.

system that illustrates some condition of interest. The observation consisting on large-scale data reflects how this system reacts as a whole to this condition, therefore hinting at regula- tory and organizational aspects of the system itself. Because the focus is on the whole rather than on the parts, there is a need to depart from the previous concept of ‘key players’ of any given system. Instead, there needs to be an unbiased analysis of the system. This is one of the major contributions of systems biology: there are no a priori assumptions of what could be the important components of the system as these will come from the data analysis of the global system’s response as a whole to perturbations. The ultimate goal is the generation of a predictive model that can be used to approach problems such as drug discovery (44), improve metabolic engineering (45) or help understand secondary metabolism (46; 47). Systems biology makes use of quantitative disciplines such as statistics and mathematics, the predictive power of computer simulations and modeling, and the experimental expertise of biochemistry and biology. This inter-relationship was popularized by the yin-yang depiction of systems biology made by Kitano (32) (see Figure 1.2). This relationship between all disci- plines has no particular order to be followed other than that of the interest and focus of the researcher. Nevertheless, if one wishes to solve a problem under the “systems biology” scope, one must realize that, more than ever, science must be undertaken in a collaborative manner (42). The choice of which of the main branches of systems biology (experimental, computa- tional, or theoretical) should constitute the basis of a solid and fruitful research project is left to the researcher, keeping in mind that the foundations for successful experiments and computational models are normally set on a strong theoretical framework. 8 Introduction

NAD

EtOH

NADH

TPI some more steps GAPDH

GAP

DHAP

BPG PYR

ALD

Glucose

ATP some A steps ADP FBP G6P

ATP ADP PFK F6P ATP

ADP HK

FBP F6P ADP ATP Glc

DHAP GAP BPG PYR EtOH G6P

PGI NADH NAD+ NAD+ NADH B

Figure 1.3: Complexity of biological networks. A. Traditional representation of glycolysis; B. Abstracting from the flow-like structure of the pathway, the actual representation of glycolysis and the pentose phos- phate pathway is much more complex. are painted grey. Dashed arrows are representative of processes that are not depicted. Key: HK, Hexokinase; PGI, Phosphoglucoisomerase; PFK, Phosphofruc- tokinase; ALD, Aldolase; GAPDH, Glyceraldehyde 3-phosphate dehydrogenase; G6P, Glucose 6-phosphate; F6P, Fructose 6-phosphate; FBP, Fructose 1,6-bisphosphate; GAP, Glyceraldehyde 3-phosphate; DHAP, Dihydroxyacetone phosphate; PYR, Pyruvate; EtOH, ethanol.

1.3.2 The complexity of biology unleashed

Biology is, without a doubt, highly complex. From the organization of proteins in a cell to the social interactions found in populations of individuals, the complexity of biology is staggering. Nevertheless, this complexity is, more often than not, misrepresented in order to facilitate the understanding of the processes that are taking place. One example would be the representation of metabolic networks as linear schemes, where one metabolite is transformed into a second metabolite through the action of a protein and, if needed, some co-factor. Figure 1.3 shows a common depiction of the glycolytic and pentose phosphate pathways, as one would find in a textbook. Certain metabolites, like ATP, ADP, NAD, NADH, NADP or NADPH, are co-substrates in countless reactions. Therefore, their role in metabolism is more “central” rather than being seen as simple side metabolites needed for those reactions. Freeing oneself from the simplicity exemplified by the metabolic map of Figure 1.3A, and explicitly representing the co-factors as metabolites in their own right, the picture of metabolism becomes quite different (see Figure 1.3B). Abstracting from such a high level of complexity and simplifying the problem is helpful in some instances but deceiving in others. While we perceive a pathway as a sequence of events, be it in terms of carbon flow like the depiction of glycolysis in Figure 1.3A or in Introduction 9 terms of proton transfer as in the depictions of photosynthesis or respiration, the human brain has a harder time deciphering the intricacies of a highly complex network such as that presented in Figure 1.3B. A similar conclusion was highlighted by Garfinkel (48), who emphasizes that a mathematical model of a biological process should be kept as simple as possible as complex models will lead to fuzzy conclusions (see also 49; 50; 51, for a discussion on the need of mathematical models to explore complex systems). Nevertheless, one must accept that simplification does not always provide the most evident answer: applying any variation of Occam’s razor in biology is dangerous, as one will, more often than not, fail to see the “big picture” by carefully examining highly detailed “frames” of a particular event. Systems biology attempts to address this issue by collecting snapshots of the “big picture”, as described earlier.

How are we organized?

The study of the complexity of networks has met with increasing interest in the life sciences. The studies of Albert-Lazlo Barab´asi’s group on the topology of networks (52) are among some of the most highly cited in network analysis studies. Research on network topology and attempts in maximizing the efficiency of metabolic processes for metabolic engineering applications, however, have been addressed before (53). The interest in the topology of metabolic networks for the improvement of metabolic engineering applications has led to the study of robustness of the networks to the parameters of the biochemical processes (54) or to the deletion of nodes of the network (55). These networks give some indication of the complexity of biology, where it can be observed that metabolites such as pyruvate are highly connected (56), therefore participating in a large number of reactions. A discussion on the meaning of robustness is not the scope of this work. However, for the purpose of completeness, robustness is defined, with respect to networks, as the capacity that a given network has towards structural damage to its topology. Therefore, a network is said to be robust if deleting some nodes does not affect the overall properties of the network and is said to be weak otherwise. Discussing networks leads to a brief discussion on the organization of those networks and what kind of topology best reflects the observations from biological systems. Barab´asi’s book Linked (57) propularized the concept of networks by explaining popular concepts such as the Bacon distance and the six degrees of separation. But how are networks organized? The organization of complex networks, such as the Internet, social networks, food webs or metabolic networks is highly diverse. Originally developed by Euler in 1736 to solve of the K¨onigsberg bridge problem (58) (where it was wondered if anyone could cross the 7 bridges of K¨onigsberg without crossing the same bridge twice – see Figure 1.4), graph theory has been highly developed in mathematics and its concepts have been used to study the properties of networks. In its applications to sociology, biology and computer science, the discussion lies on which structure best defines these problems. A paper by Dorogovtsev and Mendes (59) describes in some detail how networks evolve, from networks of scientific citations to communication networks. Three kinds of network topologies 10 Introduction

Figure 1.4: The K¨onigsberg bridges problem. The town of K¨onigsberg had 7 bridges that crossed the Pregel river (A), connecting the river banks (A, B, C and D). The solution presented by Euler (B) showed that in no way could one cross all the bridges in a single trip without crossing them more than once. Euler proved that the solution would only be possible if and only if there were exactly two or zero nodes of odd degree. This is now known as an Eulerian path or Euler walk. Since the K¨onigsberg graph has 4 nodes of odd degree, it cannot have an Eulerian path.

are usually described: random networks, small-world networks and scale-free networks. The concept of random networks was introduced by Erd¨osand R´enyi (60) (Figure 1.5A). These networks are characterized by a fixed number of nodes, N, in which any two nodes are connected to each other with a probability p, yielding an average number of edges equal to ((pN(N − 1))/2) (59). Under large numbers of nodes this type of network will exhibit a Poisson distribution for the connection degree of each node. Similar to random networks are small world networks. These networks, first proposed by Watts and Strogatz (61), can be constructed from a regular lattice by randomly rewiring some of the links between nodes in the network (Figure 1.5B). Scale-free networks (Figure 1.5C) have been proposed by Barab´asi et al.(62; 63; 64; 65; 52; 56) and have been illustrated to match biological networks. Scale-free networks show a power distribution for the average degree and show a property known as preferential linking. This means that any new link in the network has a higher probability of being drawn to a node with a high degree than to one with a low degree. An example of this kind of network would be the Internet or the network of airport hubs: a popular web-site or a busy airport will have more “links” than a personal web-page or a regional airport. The discussion on how biological systems are organized spans numerous publications. If some believe that metabolism shows a small-world organization (66; 67), others believe that the organization obeys a scale-free type of structure (see 52; 56). To add to the mix, a recent paper by Tanaka et al. (68) highlights the possibility of biological networks being “scale rich”, contrary to the “scale free” concept, basing their analysis in protein-protein interaction networks that fail to show the power law node degree distribution characteristic of scale-free networks. The actual answer is anyone’s choice, but the bottom line is that the organization of biological, sociological and ecological networks, to mention just a few, is Introduction 11

A B C

Figure 1.5: Random, small-world and scale-free networks. Examples of random (A), small-world (B) and scale-free (C) networks. Each network has 50 nodes and 100 edges. As described in the text, A and B are very similar in properties. It becomes evident in the case of scale-free networks the emergence of a small number of nodes with a large number of connections to other, while the majority of the nodes have a low number of links to other nodes.

more complex than may be believed at first, and only intense research will allow us to unveil the real topology of such networks. One must realize that whichever network(s) is drawn to represent a biological system is a mere abstraction aimed at simplifying the underlying complexity, which spans several levels of organization (69).

1.3.3 Modules and parts lists

The study of the biological system as a whole has brought challenges that, even when ap- parent, were not immediately tackled. Our ability to deal with highly complex problems is innately limited (70). Nevertheless, we are able to understand small complex systems and build even larger systems based on those small ones. An example would be the construction of a radio, as illustrated by Lazebnik (18). What allows us to construct such complicated apparatuses is the knowledge of how the smaller parts work together and how they can be assembled into a larger, more complicated, device that will eventually play some music. Therefore, an electrical engineer will have a list of the “parts”, some of which may be com- bined into “modules”, that are necessary to build a complex radio. If one could translate the electrical engineering jargon into biology then one could “assemble” (or engineer) bio- logical systems with the same ease as an electrical engineer builds a radio, provided that we have all the necessary “parts” but also a schematic diagram of how these “parts” and “modules” interact with each other. I will come back later to this issue when discussing reverse engineering in life sciences. If this seems an obvious problem to be tackled, it is not an easy one to solve. Endy (71) discusses that the difficulty in successfully engineering biological systems may come from the fact that biological design was not optimized to suit human understanding, tinkering 12 Introduction and engineering. Biological systems have added challenges that are not met by the electrical engineer in the sense that (i) they are much more complex, (ii) they present variation in their behavior and (iii) organisms evolve. Nevertheless, successful circuits can be constructed. Hasty et al.(72) show how one can engineer small gene circuits based on ideas and concepts from electrical engineering. If the engineering assumptions hold, these simple gene circuits (or gene modules) can be integrated together to construct a more complex system, with important consequences of such technologies to biomedical applications. Although these approaches may hint at the need for some reductionism, thereby contradict- ing the need for holistic approaches to biology as seen with systems biology, this is not the case. The tools and concepts that stem from systems biology allow for devising experimental setups that will focus on the understanding of biological systems as a whole. The biological engineering approaches allow for a mapping of the “parts list” that can be found in biological systems, thereby reducing the complexity of the system, introducing the notion of modules that are comprised of parts, without hindering the holistic approach. The current state of the art in computational tools for modeling and simulating dynamical systems, as well as the currently available tools for the analysis of complex data sets, allied to the aforementioned human inability to deal with large numbers of variables, indicates that our understanding of a biological system is most likely to be resolved in terms of modules, with an educated guess as to what the function each module will likely be. The holistic view of systems biology will provide an unbiased analysis of the system’s response to a perturbation. However, a modular approach will provide an overall understanding of the system’s response. Subsequent studies on individual modules will then aid in detailing the systematic response at a finer scale. As pointed out by Hasty et al.(73) while discussing gene network modeling, it is necessary to characterize small subsystems on which hypotheses will be formulated, generate mathemati- cal models of the interaction of these subsystems and finally devise experimental procedures that will help in the process of formulating the hypotheses. These are the foundations of systems biology, which are set in a larger scale.

1.4 Simulation and modeling in the life sciences

The use of computers in biological research has evolved to more than the mere use of word processors and spreadsheets. Powerful computations to solve complex biological problems have been emphasized since the first use of an analog computer to simulate the peroxidase reaction, back in 1943 (40). Though the complexity of the system studied by Chance only concerned a single enzymatic reaction, the development of mathematical models of biochem- ical and biological processes as an extra tool in life sciences research started to develop in a larger scale. This led to an increase in the complexity of the models. In over 60 years of research using computers to model and simulate biological systems, many subjects have been studied. Examples are the study of population dynamics (74) based on the Lotka and Volterra mathematical formulations (75; 76), simulations of the periodic Introduction 13 behavior of biochemical networks (77; 78; 79), epidemiology studies (80; 81) and studies on the control of metabolic networks (82), to name only a few. These studies have greatly benefited with the steady increase in computer power but also with increasing acceptance by top journals. As noted by David Garfinkel (48), the lack of support experienced by researchers interested in studying biological processes with the aid of mathematical models and computer simulations led to a rocky start and shaky acceptance of this kind of work within the community. It has only been the realization that computer simulations and modeling can help explain the avalanche of ‘omics’ data and the complex behaviors therein that simulation is becoming more accepted. In the systems biology view, mathematical modeling and computer simulation plays a crucial role in the identification and comprehension of a biological system, leveling off with the experimental work done in laboratories (see Section 1.3). Therefore, it is important to understand what has been done in the past and what issues have already been tackled. We do not need to re-invent the wheel: we just need to make the wheels turn in the same direction.

1.4.1 Historical overview

Before discussing the applicability of modeling to the biological sciences, one should discuss the concepts of modeling. A model is defined as a representation of a system, often math- ematical, that reflects its relevant properties. The mathematical frameworks that can be the basis for models are diverse but are often based on ordinary differential equations (see 83; 84; 85, as examples) to partial differential equations (e.g. 86) and discrete mathematics (see, e.g., 87). Any of these applications should explain the same system in a similar fashion in order for that particular model to be considered valid. As we will be addressing later, it is also important to realize that the modeling framework used, as well as the simulation engine, should produce the results similar to those observed in experimental conditions. In- consistencies between model results and experimental observations mean inconsistencies in the model(s) and require a careful examination to fully understand why the results differ. As a brief example, when studying the organization of genetic networks, Kauffman (88) used a Boolean approach, in which genes can be in one of only two states (“on” – when the state of the gene is 1 – or “off” – when the state of the gene is 0), much like a switch in an electric circuit. As in a digital circuit, the connections between the genes will determine the states at which each gene may be. However, because the system is discretized in both time and in the values that the variable may take, the dynamical behavior observed may not correspond accurately to the actual observations made in the experimental conditions. Nevertheless, several publications can be found in which conclusions on the behavior of the dynamical system are made based on Boolean models of the system (89; 90; 91). Mathematical models have been steadily used in the life sciences. The population dynamics models of Lotka and Volterra were mathematical models based on experimental observations 14 Introduction

on how two interacting species (typically a predator and a prey) develop over a period of time. The Lotka-Volterra equations, as they are known, are extremely simple. However, these equations can still explain a highly complex dynamical system found in ecology. These are given by (74):

dA = αA − βAB (1.1a) dt

dB = γAB − δB (1.1b) dt where A represents the prey species, B the predator species and where α, β, γ and δ represent the rate of birth of the prey, rate of death of the prey, rate of birth of the predator and rate of death of the predator, respectively. Such a simple model explains the periodicity observed in interactions of predator-prey communities. Notice that external factors are not included (or, in fact, needed) in order to explain the periodic behavior. Also, the food chain is resolved to only 2 species. The model can, of course, be expanded to include such factors as the periods of inactivity of the predator, the rate of the birth of the prey being influence by the presence of the primary food source, or its capture by superpredators. An example would be:

dS = k P − k PA − k P (1.2a) dt growth 1 2

dA = k A − k AB − k AC − k A (1.2b) dt bA dA1 dA2 dA3

dB = k B + k AB − k BC − k B (1.2c) dt bB 1 bB 2 dB 1 dA2

dC = k C + k AC + k BC − k C (1.2d) dt bC 1 bC 2 bC 3 dC 1 where S represents the primary food source (e.g., grass), A the primary predator species (e.g., an herbivore), B a secondary predator (a carnivore) and C would be a superpredator

(e.g., a necrophage) and where the k’s represent the various rates of birth (kbi ) and death Introduction 15

7 4 6 3.5 5 A 3 B 2.5 4 2 3 1.5 2 1 1 0.5

Population (a.u.) 0 Population (a.u.) 0 0 5 10 15 20 0 5 10 15 20 Time (a.u.) Time (a.u.)

Figure 1.6: Increasing complexity in Lotka-Volterra models. The original model on predator-prey interactions (see 74) only accounts for the presence of those 2 interacting species (A), as presented in equations 1.1a-1.1b). However, one can expand the model of the population dynamics by including more factors (or species), as given by equations 1.2a-1.2d (B), adding to the complexity of the system.

(kdi ) of each of the species and its interaction rates. A study similar to this one has been presented by Garfinkel and Sack (92), emphasizing the levels of complexity that can be accomplished with simple manipulations and extensions of the Lotka-Volterra model. Some variations of models like this one have been used to describe epidemic outbreaks (see, e.g., 93). Garfinkel discusses the need for complex models as necessary, due to the high complexity of the biological system (51). He argues that if one attempts to model a complex system by assembling simple systems together, it may take the same amount of work than it would to build the complex model in the first place. However, a model cannot be as complex as the system under study, due to experimental procedures that do not focus on every single detail (50). Using Garfinkel’s analogy, simulation and computer modeling resembles solving a crossword puzzle, in the sense that a model will have bits and pieces about which we have no information, which can only be complemented by analysis of the results of the model that may give hints as to what processes may take place that have been overlooked in the course of the experimental design. However, the level of complexity should be adjusted to the problem being addressed. If one is studying how enzymes bind metabolites and drug targets, the complexity should be set at the level of atomic interactions and complex molecular dynamics. If the object of study is the response of an entire population to a vaccine, the complexity cannot be set to the same level and should be kept at population related measures (e.g., number of births, number of deaths, or survival rate). The use of mathematical models to predict, analyze and describe events in biological systems became evident with the advent of computers. A paper by Britton Chance (40) is the first one that describes the study of the kinetic parameters of an enzyme (peroxidase) with the aid of an analog computer. Though modern computers can handle systems with a significant larger complexity (see 79), this paper opened the door to the use of computational tools to solve mathematical models of biological systems. Chance would publish several papers highlighting studies of enzyme kinetics using analogue and digital computers (see, e.g., 94; 95), setting 16 Introduction the groundwork for the development of computational life sciences research. Following Chance’s pioneer work, David Garfinkel was one of the major propellers of the field of computational biology. Showing a great interest in the problems of metabolic com- partmentation (see, e.g., 96; 97; 98), his work also focused heavily on the discussion on development of good modeling techniques (48; 50; 49), parameter estimation (99), artificial intelligence (100), developing modeling software (101; 102) and in setting the grounds con- cerning the need for computational biology approaches in modern life sciences research (103). His work is often overlooked in modern computational systems biology, with re-discoveries of issues that have long been addressed. As an example, one of the continued struggles to devise standardized modeling techniques (51) have recently been overcome by the MIRIAM consortium (104), a group of researchers in computational biology that is concerned with the standardization of modeling techniques in much the same way as the MIAME standards (105) are followed for microarray experimental design. One of the areas of research that has relied heavily on the use of mathematical models and computer simulations has been the study of periodic behavior in metabolism, namely in the glycolytic pathway (106), in the study of pattern formation in developmental biology (107) and in the analysis of the innate periodicity allied to circadian rhythms (108). An incidental observation made by Duysens and Amesz (109) on an oscillatory behavior by the NAD+/NADH pool was met with interest by the community, with Britton Chance (110), Benno Hess (78; 111), Joe Higgins (112) and Ilya Prigogine (113) leading a vast number of researchers in the pursuit of an explanation to the origin of periodic behavior in bio- logical systems. The oscillatory behavior observed hinted at the possibility of a metabolic state in which the levels of glycolytic metabolites would show periodicity. These predic- tions were confirmed experimentally (see 110; 77; 114) and gave rise to a detailed analysis of the phenomenon of periodicity in biochemical networks both experimentally as well as computationally. Aside from models that were aimed at explaining the overall emergence of oscillatory patterns in glycolysis (115), mathematical models allowed an in-depth study of the transition between oscillatory states of glycolysis (116; 117), namely the transition into chaotic behaviors (118; 119; 120). The study of periodicity in biological systems still sparks a great deal of interest. The distribution of control over oscillating pathways (121; 122), analysis of the response of cell cultures to perturbations caused over the oscillatory regime (123) together with studies on how do cultured cells propagate the oscillatory signal to neigh- boring cells (124), led to the proposition of very finely detailed models of glycolysis under oscillatory conditions (79).

Why is it necessary?

Skeptics on the use of computational approaches in biology pose the question as to why is modeling and simulation necessary and what can actually be gained from these kinds of approaches that could not obtained as easily in the lab. The answer can be given in two Introduction 17 terms. On the one hand, not all conditions can be tested in the lab. The requirements to perform experiments in a lab, from manpower to cost of equipment and materials, hinders the possibility of testing all the possible outcomes that may be attained from a biological system. The use of modeling and computational approaches solves this problem. Not only it is economically cheaper to perform simulations, but one can simulate conditions that are otherwise unattainable under experimental conditions. Substrate concentrations, toxic chemicals effects, deletion of species from the system, all can be simulated and the results can be confirmed, or denied, in experimental settings (48). The modeling approach will provide hypotheses that will be tested and verified or disproof in the biological system. This side of computational biology is crucial in the spirit of systems biology (see Section 1.3) On the other hand, the analysis of complex data requires the use of computational ap- proaches. Psychological studies show that the human mind can only handle a small number of variables (70). As the complexity of a problem rises, the informational content that we can assimilate decreases (70; 125). This limitation in our innate ability to deal with complexity demands the use of tools that will allow us to simplify complexity and make it understand- able. This simplification does not mean reductionism of the problem, though. As emphasized by Anderson (126), simplification to fundamental laws does not imply that we will become able to reconstruct the Universe, starting from those same laws. Use of computational tools, in particular simulations, allows overcoming the limitations of the human mind (50). Mak- ing the bridge to the amount of data generated by “omics” projects, one can easily see that only with computer simulations and mathematical models can we hope to achieve goals in our understanding of the biological system. However, one must always keep in mind that a model of a system will never be the actual system but only an approximation (127) and, also, the predictions generated by the model, and the model itself, are only as good as the data that was used to generate that model (48). Simply put, if the data used to generate the model is junk, the model will be junk.

1.4.2 Bottom-up versus top-down modeling

Modeling of biochemical networks or biological systems can be done with an assortment of different methodologies. These methodologies include continuous time models (such as the ones by 86; 83; 84, for example) and discrete time models (as in 87). The aim of any of these approaches is the construction of a mathematical model that can explain, predict and simulate the behavior of a biological system. Even if mathematical modeling has been around for quite some time, it was only recently, with the advent of systems biology (see Section 1.3), that mathematical modeling of biological systems gained a widespread interest. The way in which mathematical models can be built has two distinct schools of thought (Figure 1.7). Nevertheless, the end result (the model) should behave in the same way and originate the same results. The first modeling approach that can be distinguished is what is called “bottom-up” mod- 18 Introduction eling. In this approach the researcher constructs the model by using a priori knowledge about the known parts of the system, then assembling the complete model by combining the various parts. Essentially, this goes from relatively simple models of the isolated parts (enzymatic reactions) to larger and more complex models of pathways. The type of data that is required for this modeling approach consists of enzyme kinetic assays done individu- ally, which will be used to determine kinetic parameters of enzymes and get a very detailed description of particular reactions. Adding each model component (or set of components) – which can be comprised of one or a few very detailed reactions and mechanisms – at each modeling step will allow the construction of a very accurate model of the biological system. This type of approach can be exemplified by the cell cycle model built by John Tyson and his collaborators (see 128, for a review). On the other end of the spectrum is what is called “top-down” modeling. This approach uses information (and data) such as the one obtained from functional genomics experiments: it is assumed that not much is known about the parts of the system but the combined body of data reflect the complete workings of the organism and will be sufficient to deduce how it is organized internally. Small, relatively simple, models are constructed initially, to which more components and more pieces of information will be added later, increasing detail of the model. Examples of this approach are the models for gene regulation that are generated by reverse engineering methodologies (see Section 1.5).

1.4.3 Data limitations and modeling

Any given model is only as good as the data that is available (48), but it should also be emphasized that the model can only be built based on those data. This means that only the variables that are observed are eligible to be present in the model. In light of functional genomics and systems biology, this implies that a mathematical model of the biological system under study can only account for the variables that are measured experimentally. Figure 1.8 gives a detailed example of this problem. Let’s assume one wants to recover a gene network with 7 genes, as depicted in 1.8A, which is affected by 2 perturbations, P1 and P2. In the ideal case, experimental procedures would allow us to measure the relative intensities of the mRNA levels for each of the genes. However, if the experimental setup cannot allow for an accurate determination of these transcript levels (say, for example, that the mRNA levels are below the detection limit of the instrument), then cases like those exemplified by 1.8B-C can occur. Looking carefully at the example, one can notice that the failure to identify a gene product will result in the identification of a regulatory network that can be regarded as a subset of the original one, where indirect interactions between genes are identified due to the absence of measurements of another gene. The perceived causal interactions between the genes change as a result on the loss of measurement of a gene product. Therefore, and as discussed previously, the model is an approximation of the real system that can only be improved by expanded experimental observations. This problem in system identification will play a crucial role in the problem of reverse engineering biological Introduction 19 Bottom-up Top-down

Figure 1.7: Comparison between top-down and bottom-up modeling strategies. Modeling strategies have met different approaches by research groups. Classical modeling methods (the so called bottom-up models) consist of starting from well known systems, characterize these small subsystems with great detail and gradually increase the complexity of the system (in the case presented by adding extra nodes in the network). Top- down modeling does not assume any knowledge of the system at hand and the first step consists of obtaining a rough approximation of the system. The model is optimized as more information is gathered - adding to the clarification of interactions between entities in the system. The end result should be the same with whichever approach is chosen. 20 Introduction

G2 G2 G1 G3 G2 G3

G1

P1

P1 G1 G3 G7 P1 G4 G4

G7 G7 P2 G4 G5 G6 G5

G5 P2

P2 G6 G6 A B C

Figure 1.8: Effects of data availability in the construction of biological models. An accurate model of a biological system is dependent on the data available during the model construction step, irrespective of the model approach used. If one wishes to build an accurate model of the gene regulatory network in A but lacks information in some of the genes levels due to experimental impediments (B and C) then the model built will correspond to a subset of the real system to the extent that the data allows. networks from experimental data (see Section 1.5 and future chapters).

1.5 Reverse engineering for the “omics”

Data collected from experiments requires that these data are analyzed properly and that answers to hypotheses developed initially can be, at least partially, found. The analysis of these data should result in an increase of our understanding of the biological system. One of the most compelling issues in functional genomics data analysis deals with the reconstruction, or reverse engineering, of biological network structures from the data. This has got the label of the “holy grail” for functional genomics (67). Doing the same type of PUBMED research as before (see Figure 1.1) one can see that at about the same time when the boom in systems biology was observed the interest on solving the inverse problem in the life sciences seems to have caught on as well. Even though not as trendy as systems biology, the need to interpret the data brought on by the functional genomics projects may bring a boom in reverse engineering methods. The success rates of the reverse engineering methods varies considerably, depending on the type of data being analyzed or the amount of data that is available. The application of these methods to experimental conditions may, therefore, be hindered due to these limitations. Nevertheless, the methods help the advancement of new data analysis approaches and encourage bold experimentalists to break the boundaries and apply these new methodologies. The basis of all reverse engineering methods rely on the fact that variables in a biological system are related to each other. The goal, therefore, is to identify the relationships between these variables and, in doing so, identifying the network. In more classical cases, these relationships can be studied by analyzing a Jacobian matrix of the system. The Jacobian matrix represents a linearization of a dynamical system around a steady state, in which each entry of the matrix shows how each variable is affected by each other variable. Generalizing, Introduction 21

 ˙ ˙ ˙  ∂X1/∂X1 ∂X1/∂X2 ··· ∂X1/∂Xn ˙ ˙ ˙  ∂X2/∂X1 ∂X2/∂X2 ··· ∂X2/∂Xn  J =   (1.3)  . . .. .   . . . .  ˙ ˙ ˙ ∂Xn/∂X1 ∂Xn/∂X2 ··· ∂Xn/∂Xn

˙ where the partial derivative ∂Xi/∂Xj quantifies how variable Xi is dependent on variable Xj. In the study of dynamical systems the analysis of the Jacobian matrix provides important information about that system (see 127, for a detailed review on mathematical concepts), by studying the eigenvalues of the matrix. One of the early experimental settings that allows the reconstruction of a dynamic chemical system, proposed by Bar-Eli and Geiseler (129), shows how small perturbations on the variables of the system around a steady state can help estimate the Jacobian by assessing the slopes of each of the variables (how they change) when the perturbation is applied. The authors successfully reconstruct the Jacobian matrix and provide a valid experimental setup to achieve that reconstruction. This approach was later exploited and expanded for the reconstruction of chemical systems (130), where the authors perform a perturbation to the system (e.g., a chemical that is known to affect the network and is applied externally) to reconstruct the mechanism of a chemical reaction system. This method, relying heavily on statistics to determine the causal network between the variables, requires amounts of data that are, to some extent, hard to obtain in biological experimental setups (Adam Arkin, personal communication). An improvement on the method was later proposed by Samoilov et al.(131), where a different measure of relatedness between the variables to assign causality, but this method, as pointed out by the authors, requires even more data than the first one. Therefore, even though the results shown in the respective publications are promising, the application of these methods to biological systems in the current state of experimental capability seems hindered by the data requirements. A short discussion on the amounts of data produced by “omics” experiments is necessary at this point. It is commonly emphasized that large amounts of data are now being obtained in labs due to the advancement of high-throughput technologies, be it in genomics experiments with microarrays, or metabolomics experiments with chromatography-MS approaches. How- ever, this is a common misconception. The fact that one can obtain expression levels for thousands of genes (21), that hundreds of proteins can be identified on a 2-D gel (1), or that thousands of metabolites can be measured in a GC-MS experiment (2), means that one is able to obtain information about many variables, this does not equate as large amounts of data, though. Suppose we want to study the response of yeast cells to osmotic stress, at the gene expression level. We would be interested in following such response along a time series, therefore being able to distinguish the genes that could act as early responders from late responders. This could allow, in the end, the proposal of a candidate regulatory gene network for the osmotic stress response in yeast. A literature review would find a few genes that are involved in the stress response of the organism, at either a general level or a specific response to osmotic stress. Since funds are not unlimited, a subset of these genes would 22 Introduction be selected as candidate mutants, and a set of strains (wild type and mutant) would be obtained (let’s say 10). Financial constraints would then determine that there would be a collection of 15 time points and 5 replicates, for which we would then measure gene expres- sion levels with Affymetrix gene chips. Of course we need as well control experiments on the same conditions, but without the added stress. So, now we have 10 strains, 15 time points, 5 replicates, control and perturbation. Therefore, one would have 1,500 arrays, each with approximately 6,000 genes. Although it may seem a large amount of data, the information content that can be extracted from each of these arrays for the purpose of our initial study is limited. In the end there are only 150 time points, and this is likely insufficient data for the goal established. In fact, there is a great dimensionality of the data, but not a great amount of data points. This toy example may seem exaggerated, but nevertheless similar studies have enabled the generation of mathematical models of osmotic stress response in Saccharomyces cerevisiae (132). Some reverse engineering approaches have tried to make use of the kind of data that came out of “omics” efforts, namely from microarray experiments. As in the toy example presented above, these approaches try to recover a genetic regulatory network or, more accurately, a network of causal interactions between genes that may reflect, with some abstraction, the regulation of gene activity. Where does this abstraction come from? Since we are only looking at a gene regulatory network, nothing can be said about the emergence of protein complexes that will activate or inhibit the expression of a given gene. Also not considered also are the cases in which the a metabolite affects the expression level of a given gene. The interactions that are present in the gene network representation are a projection of all levels of cellular organization (gene, protein and metabolite organization) into the gene space. Therefore, some of the interactions present in the gene network are merely indirect interactions between variables (for a discussion see 133). Another factor that may influence the gene network is the fact that some variables (i.e., the genes) may not have been measured and spurious links between genes will originate that should not be present. If gene A activates gene B which in turn activates gene C, and if the levels of gene B are so low that are not detected, then the analysis would infer that gene A activates gene C – recall Figure 1.8. Though not entirely wrong, this connection should not be present. However, as I mentioned before, our models can only perform as well as the data provided to them and, therefore, there is always a limitation on what information can be extracted from the data. I will return to this topic later on.

1.5.1 Inference of gene regulatory interaction networks

Gene regulatory networks represent a coarse level of interaction between all the variables of the biological network by a projection into the gene space. Therefore, the reverse engineering of these networks from the data can be seen as a first approach to the modeling of the biolog- ical system as a whole. One such method was proposed by de la Fuente et al.(134; 135) and makes use of concepts borrowed from metabolic control analysis (82). More specifically, the Introduction 23 method recovers the interaction network between genes by calculating an inverse of the co- response matrix. The co-response matrix O, is composed entirely of co-response coefficients, which represent how two independent variables of the system respond to a perturbation of a system parameter. For the sake of the method, the perturbation corresponds to a varia- tion of the gene expression rate. The co-response is determined from experimental data by calculating the ratio of expression levels between a reference state (or wild type organism) and the perturbed system.In mathematical terms (134; 135):

0 vm i ∆mRNAi/mRNAi Oj = 0 (1.4) ∆mRNAj/mRNAj represents a co-response element, where ∆mRNAi reflects the change in expression level for 0 mRNAi between the perturbed condition and the reference state (mRNAi ). By inverting the co-response matrix one can obtain the interaction network, R (135):

R = O−1 (1.5)

The implication is that n experiments are needed in order to recover a network of n genes. Even if one can deem that as unattainable for the roughly 6,000 genes of Saccharomyces cerevisiae, a network of apparent interactions can be recovered for whichever genes are measured. Such was shown possible by de la Fuente et al.(134). Very similar results, with a slightly different approach, were obtained by Kholodenko et al.(136). Following the method of de la Fuente et al., Gardner et al.(137) have proposed another method that recovers the gene interaction network by linear regression of the data. How- ever, the authors recognize that one of the pitfalls in de la Fuente et al. lies in the number of experiments that would be needed to recover the regulatory network. Based on this fact, it is assumed that the network is not fully connected (i.e., the network is sparse). Therefore, imposing a pre-defined number of maximum interactions per gene, k, there is a reduction in the number of interactions that need to be estimated. Nevertheless, even considering this reduction in the number of interactions that need to be estimated, the method requires that all of the variables in the system are perturbed. T´egner et al.(138) explore the ideas behind this method to propose a methodology that is aimed at maximizing the amount of information that can be retrieved from any given genetic perturbation. Assuming pertur- bations around a reference steady state, this method attempts to reconstruct the network by carefully selecting the genes that must be perturbed in order to maximize the amount of information that can be used in the inference process. The idea of the need to include the constraints of sparsity in the gene network in the infer- ence method had been addressed before by Yeung et al.(139), where the authors propose a methodology to reverse engineer gene networks using singular value decomposition (SVD) 24 Introduction

and robust regression. Considering that the network is close to a steady state, then the dynamics of the system can be approximated by:

N X x˙ i(t) = −λixi(t) + Wijxj(t) + bi(t) + ξi(t) (1.6) j=1

where xi are the levels of the mRNA for gene i, λi are the self-degradation rates for those mRNAs, bi are external perturbations and ξi represent noise. Wij represent the interactions between genes i and j, more precisely indicating the strength of the regulation that gene j exerts on gene i. Using SVD to overcome the lack of observations compared to the number of variables that are typically found in experimental conditions, together with the imposition of sparseness in the gene network (thereby limiting the number of interactions that need to be determined, as in (137)), the authors reconstruct a few networks to illustrate the method. These reconstructions are obtained when the perturbations that are applied to the system (the bi’s) do not take the system very far away from its reference steady state. Therefore, these perturbations need to be relatively small such that the relaxation of the system can be interpolated by the linear approximation made. The methods described previously rely on steady state measurements of gene expression lev- els after some perturbation is applied to the system. However, several functional genomics projects are focused on the immediate response of the system to the perturbation and, there- fore, a time series of the response is recorded (e.g., 140). These kind of experiments require a different kind of inference algorithm, that will take this time response into consideration in order to recover the causal network of interactions. As mentioned previously, the inter- action between species in a dynamical system can be approximated by a Jacobian matrix (see Equation 1.3). If one keeps the system close enough to the reference steady state, the Jacobian matrix will not change. Therefore, perturbations that yield a response from the system but remain in the vicinity of this reference steady state will not cause the Jacobian to change. If one can make such perturbations then the reconstruction of the Jacobian is simple and the system can be recovered. Several examples of successful reconstruction of chemical reaction structures have been reported (see 130; 131; 141; 142; 143), in which the Jacobian matrix elements for a chemical reaction system were determined by successive perturbations to the system around a reference steady state. However, the application of such assumptions to biological systems is hard to achieve due to the magnitude of the problem. In theoretical terms, provided that one can perturb all of the variables being modeled in a biological sys- tem, Sontag et al.(144) show that the reconstruction of a small signaling network from time series data is possible. In more realistic terms, however, it must be acknowledged that the inference of biological networks from data will only provide a relatively coarse-grained view of the network (145). Introduction 25

Correlation based approaches to network inference

Several reverse engineering techniques are focused on reconstructing the biological network using statistical measures. Correlation has been a favored measure for this purpose and several methods have been proposed that use correlation as a metric for determining the relationships between variables in a biological system. One of the first proponents of the use of correlation for network reconstruction was the work of Arkin et al.(130), which use time-lagged correlations to reconstruct a chemical reaction network from time series data. The complexity of the biological system is staggering, hindering our understanding of the organism as a whole due to our inability to deal with such complexity. However, several decades of information on these systems have been collected. Therefore, any attempt for the reconstruction of biological systems could (and probably should) take this information into account. As pointed out by F¨orster et al.(146), integration of different levels of information will be important to realize how the system is orchestrated.

Reverse engineering biological modules

Our understanding of biological systems may be limited to a list of “modules” that are assembled together to compose a highly complex system (see Section 1.3.3). Therefore, reverse engineering of biological systems may be limited to reconstructing highly coarse- grained networks of biological modules, which can then be deconstructed and analyzed in higher detail. Reverse engineering a network with thousands of genes is a very hard task. However, if this could be reduced to a set of modules or clusters 10 or 100 orders of magnitude, then the problem may become solvable. It is important to acknowledge that all of the available techniques that are used in a data set are able to retrieve some information out of it, o matter how minimal. The reduction of the problem’s complexity by reverse engineering networks of clusters of genes may represent, in the end, a large gain in informational content that is extracted. Some of the methods used in data analysis (see Section 1.6) are aimed at finding similarity in patterns among variables. Popular methods such as hierarchical clustering or principal components analysis, when applied to high throughput data, aggregate variables into groups that show a similar pattern. One can then interpret these different variable groupings as modules and a coarse network of interactions between the modules can be derived. Therefore, one could reduce the complexity of the network to be analyzed considerably. Approaches such as this one have been tried in the context of metabolomics research in our laboratory, in which large data sets are reduced to a few set of variables on which the reverse engineering methodology will is applied (Adaoha Ihekwaba, unpublished work). Approaches like this one can provide initial estimates of regulatory networks which may hint as to which clusters or groupings should be targeted for a better understanding of the organismal response to some stress or perturbation. 26 Introduction

1.5.2 Availability vs. applicability

The scope and range of available reverse engineering methods is very wide. This diversity on methodologies illustrate not only the complexity of the reverse engineering problem as well as the types of data that can be found, resulting from different functional genomics experiments. The methods try to emphasize the success rate on a given set of experimental conditions, although no attempt is made to generalize to different kinds of data or volume of data that is available. Methods based on purely statistical approaches (as in 147) depend heavily on the amount of data that is presented, while other methods (134; 135; 137; 136) rely on the possibility of performing very small perturbations in all of the system variables. Application of reverse engineering methods to experimental data must satisfy two simple requirements: (i) ease of use and (ii) ease of interpretation of the results. Ease of use means that the method should be applicable by an experimental researcher with no need to tweak parameters or delve deep into the theory behind the method. Ease of interpretation of the results means that the experimental biologist should be able to, following the method’s criteria, interpret the results cleanly. The methods available often meet only one of these conditions. On the one hand, the method of the correlation metric construction (130) is easy to apply if the data requirements are met, but the interpretation of the results require some expert analysis. On the other hand, the methods by Gardner (137) or de la Fuente (134; 135) give easy to analyze results but the optimality of the parameters to be used varies from problem to problem, and the search for this parameter can be time consuming and not always intuitive. If these methods are to be applied, then the two conditions should be met. In the post-“omic” era the need for more general reverse engineering methods is pressing.

1.6 Data analysis for the “omes”

With the increase in the amounts of data collected the increase in need for appropriate statistical and mathematical methodologies was tremendous. As pointed out by Bittner et al.(148), the amount of analytical tools available to analyze large data sets is pauper when compared to the data that is expected to be analyzed. The type of data collected from microarray experiments (see 21; 149; 150), proteomics (as in 1) and metabolomics (151; 152; 3; 2) demanded a new look into data analysis tools that could handle these data. A whole collection of multivariate statistics tools have been used in functional genomics data analysis. The most prominent have been clustering (153) and principal component analysis (e.g., 3; 154; 155) but methods such as discriminant function analysis (see 156; 157) or biplot displays (158; 159; 160) are becoming valuable tools for the analysis of multivariate data. Tools such as factor analysis have been scarcely used (161), though the analysis of data will for sure entail the testing of all of the available methods and the emergence of new ones to tackle such an herculean task. Introduction 27

Data analysis is aimed at extracting information from the data and, thereby, help in its understanding. In order to perform an accurate analysis of the data the researcher must understand (i) the complexity of the problem and (ii) the expected results from the analysis. If these requirements are overlooked or ignored any analysis method will yield bogus results. The applicability of an analysis method without care for the capabilities of the method or what are the method’s strengths and weaknesses is a matter of concern and the results obtained should be looked at with skepticism. The methodologies available for data analysis are often based in solid mathematical grounds and contain assumptions that must be met in order for the analysis method to produce acceptable results. Therefore, cases can be found in the literature in which the data has been “transformed” in some way in order to satisfy the assumptions of the favored method of analysis. As pointed out by Beechem (38), data transformation should be kept to a minimum, other than linear operations such as multiply, divide or adding a constant. Our understanding of a data set is related to the amount and quality of information that can be extracted from it. Applying techniques like correlation analysis, analysis of variance (ANOVA), principal components analysis, clustering, factor analysis or discriminant function analysis, will help uncover structures in the data sets that provide valuable information to the purpose of building mathematical models of the biological system under study.

1.6.1 Correlation analysis in “omics” research

The analysis of correlation between variables in a data set has been a focus of research in “omics”, with a great emphasis in metabolomics for the purpose of inference of metabolic networks (see Section 1.5 for a discussion on reverse engineering of biological networks). It has been discussed thoroughly in the literature the construction of “metabolic correlation networks” (3; 154; 155; 162), where it is considered that the high correlation observed be- tween variables in a metabolomics sample must be due to closeness of these metabolites in the metabolic map. According to these assumptions, a high correlation value between two metabolites would imply that these should be close together in a pathway (e.g., glucose-6- phosphate and fructose-6-phosphate) while a low correlation would imply that these metabo- lites would be far apart (e.g., glucose-6-phosphate and pyruvate). If the “metabolic distance” alone can be misleading (see Figure 1.3), a paper published in 2003 by Steuer et al.(163) shows that the “correlation network” cannot be used to infer the structure of the real bio- chemical network underlying the data (nevertheless, the terms and concepts are still applied as a recent paper addresses the issue of metabolic network inference from “metabolic cor- relation networks” – see 2). However, the relationships observed between metabolites, in either simulated or real data sets, must not occur by chance and, thus, there has to be some connection between these relationships and the biochemical network. Results show that the relationships between metabolites can be an indication about the control of those metabolites by the enzymes present in the system (see (164) and Chapter 2). 28 Introduction

Correlation analysis can provide information as to how the variables in a system are related to one another in linear terms and under assumptions of normality of the data. Distortions in the data due to noise or deviations from normality will necessarily imply that any conclusions or assumptions that are made after the analysis of the data are, potentially, erroneous. Nevertheless, correlation analysis can provide an initial guess to the relationships between variables. Chapter 2 On the origin of strong correlations in metabolomics data

This chapter is based on published material: Camacho, D., de la Fuente, A. and Mendes, P. (2005), The origin of strong correla- tions in metabolomics, Metabolomics, 1, 53-63 Martins, A. M., Camacho, D., Shuman, J., Sha, W., Mendes, P. and Shulaev, V. (2004), A systems biology study of two distinct growth phases of Saccharomyces cerevisiae cultures, Curr. Genomics, 5, 649-663 Correlations in metabolomics 31

“It is extremely difficult to exactly analyze the mathematical behavior of biochemical systems which are complex enough to have biological significance. This is primarily because of the nonlinearity and complexity of the systems.” - David Garfinkel

2.1 Abstract

A phenomenon observed earlier in the development of metabolomics as a systems biology methodology, consists of a small but significant number of metabolites whose levels are highly correlated between biological replicates. Contrary to initial interpretations, these correla- tions are not necessarily only between neighboring metabolites in the metabolic network. Most metabolites that participate in common reactions are not correlated in this way, while some non-neighboring metabolites are highly correlated. Here we investigate the origin of such correlations using metabolic control analysis and computer simulation of biochemical networks. A series of cases is identified which lead to high correlation between metabolite pairs in replicate measurements. These are 1) chemical equilibrium, 2) mass conservation, 3) asymmetric control distribution, and 4) unusually high variance in the expression of a single gene. The importance of identifying metabolite correlations within a physiological state and changes of correlation between different states is discussed in the context of systems biology.

2.2 Introduction

Large-scale molecular profiling methods, often referred to as “omics”, are becoming pre- dominant in molecular biology. They have facilitated the appearance of observation-driven hypothesis generation experiments, where the emphasis is predominantly in identifying new phenomena, rather than investigating a known one in detail. These technologies are also becoming important to a systems biology approach, where they are applied in the context of a solid theoretical background and through computational models (32). Global transcript analysis through microarrays (165) is the most commonly used technique, followed closely by large-scale protein identification and quantification (166), and analysis of protein-protein interactions (167; 168). There are also several approaches for metabolite identification and quantification (169; 170; 171), commonly known as metabolomics or metabolite profiling (but see 162, for a clarification of terms). The techniques used in metabolomics are pre- dominantly based on chromatography and mass spectrometry (152; 172), although nuclear 32 Correlations in metabolomics

3 3 A B 2 2

1 1

0 valine 0 glutamine

-1 -1

-2 -2 -2 -1 0 1 2 3 -2 -1 0 1 2 3 glutamate methionine

Figure 2.1: Metabolite scatter plots. Metabolite pair scatter plots of replicate samples of wild type potato tubers, measured with GC-TOF-MS (data from Weckwerth et al., 2004). Plots show Z -transformed in- tensities (i.e. mean centered and scaled by standard deviation). A: glutamate-glutamine pair, which are uncorrelated (r = 0.0243, Spearman) despite being metabolic neighbors through glutamine synthase (EC 6.3.1.2). B: valine-methionine, which do not participate in common reactions but have a strong correlation (r = 0.9510, Spearman).

magnetic resonance (169; 170), Fourier-transform infra-red spectroscopy (173), and capil- lary electrophoresis (174; 175) are also commonly used. Like the other “omic” techniques, metabolomics data are usually in the form of ratios of concentrations; absolute concentra- tions are rarely obtained except in targeted analyses that cover only a small number of metabolites. Metabolomics is a crucial tool in systems biology because it monitors the ultimate products of gene expression (173): organic molecules that are not directly encoded in the genome and are synthesized by a diversity of enzymes. Metabolites are produced from other metabolites resulting in a level of interdependence between their concentrations that does not exist between transcripts or proteins. These constrains result from the structure of the metabolic network (stoichiometry) and, when known, can be used to derive structural biochemical properties of those networks (e.g. (176)). But currently the structure of metabolic networks is limited to the primary metabolism of microbial model organisms and some mammalian tissues. Very little is known about secondary metabolism or even the primary metabolism of many organisms, resulting in the order of one hundred thousand natural products for which no synthetic pathway is known. Thus, stoichiometry-based analysis of metabolomics data is currently limited to a handful of cases. What remains to be seen is if metabolomics data can actually be used to uncover those unknown metabolic networks. Metabolomics data can be analyzed with the same methods used in transcriptomics and proteomics, such as clustering (3), principal component analysis (177; 173), or machine learning (178; 179). Additionally, it may be possible to develop novel analyses by exploring Correlations in metabolomics 33 the vast body of existing theory on metabolism and its regulation (180; 28; 181; 182; 82; 183). The present text describes an interpretation of certain metabolomics data structures using concepts from metabolic control analysis (28; 182; 82). Such a use of theoretical concepts is expected to result in analyses that yield a greater understanding about the underlying processes, absent from most of the methods in current use. Even though metabolomics approaches are still sparsely reported, they are already reveal- ing very interesting phenomena. Perhaps the most striking one was the observation in a comparison between four different Solanum tuberosum genotypes, that a small number of metabolites pairs displayed a remarkably high correlation among biological replicates, even if the large majority of metabolite pairs showed little or no correlation (3). Subsequent studies confirm the ubiquity of this phenomenon with different techniques (2) and different organisms (25; 184; 46). Figure 2.1 illustrates this phenomenon through two metabolite scatter plots, one with nearly no correlation and another with high correlation. Since the large majority of metabolite pairs do not show high correlation, and cases like the valine- methionine pair of Figure 2.1B are rare, it becomes even more imperative to understand why such correlations exist. A na¨ıve interpretation of this phenomenon of metabolite correlations could be that the pairs with high correlation would be neighbors in the underlying metabolic network. If so, then observations of this phenomenon could help resolving unknown metabolic networks. Unfor- tunately this does not resist simple scrutiny because there are many pairs of metabolites that are neighbors in the metabolic map yet have low correlation (e.g., Figure 2.1A), and others that are not neighbors but have high correlation (e.g., Figure 2.1B). This has indeed been shown by theoretical and computational analyses that point to the correlations being shaped by a combination of stoichiometric and kinetic effects (163). This helped understanding that not all neighbor metabolites have high correlations (as most do not), but it did not go as far as to explain what originates the high correlations. Knowing this would allow us to infer valuable knowledge about the biochemical organization of cells. It is the aim here to further investigate the origin of these high correlations. This will be done using the established principles of metabolic control analysis and computer simulation of example biochemical networks. This analysis extends the utility of metabolite profiles to diagnose global regula- tory phenomena transcending the metabolite level, which emphasizes the important role of metabolomics in systems biology approaches.

2.3 Methods

2.3.1 Theoretical

In this work concepts from metabolic control analysis (28; 185; 182; 82) are used, in particular co-response analysis (186; 187). 34 Correlations in metabolomics

2.3.2 Computational

Base models All simulations were carried out with the Gepasi software (188; 189; 190)version 3.30 on a Pentium Centrino 1.4 GHz computer (Dell Corp., Round Rock, TX) running Win- dows XP (Microsoft Corp., Redmond, WA). A mathematical model of yeast glycolysis is used as an example of a metabolic network. The model of Teusink et al. (191) was adopted, with adjustments proposed by Pritchard and Kell (192), which provides a fairly accurate model of yeast glycolysis in the presence of 50 mM glucose in the medium. A second set of parame- ters for this model that would correspond to a different state of the system was sought after. Unfortunately none is available at this time for yeast, so one was created artificially. While this does not correspond to any real instance of yeast glycolysis, it illustrates what happens when there is a change in regulation. In order to obtain this second “physiological” state, the glucose concentration in the medium to 20 mM and the reduced enzyme limiting rates (Vmax), representing a change of enzyme concentrations, were changed. In this new state, the concentrations of the enzymes in the upper part of glycolysis were reduced to 50%, while the concentrations of the enzymes in lower glycolysis were reduced to 75% of their original values. The fixed rates of the succinate, glycogen, and trehalose branches were also reduced to 75%, while the fixed rate of the glycerol branch was reduced to 50%. This new state represents an example of changes in environment and regulation through gene expression; it does not attempt to simulate an actual physiological condition.

Biological replicates The phenomenon discussed here is based on correlation between components of biological replicate samples. Thus it is necessary to introduce variability in the simulations but also to do it in a way that matches the differences between different organisms of the same species. In the related work of Steuer et al., variability was introduced by adding intrinsic noise to some metabolite concentrations, which was achieved through stochastic ordinary differential equations (163). In this way, random perturbations are constantly affecting the system, perhaps similar to what would happen in the presence of thermal noise (though the magnitude of the noise used in those simulations is arguably much larger than the results of thermal noise). A different strategy is used here to simulate variability between organisms: each sample is set to have slightly different enzyme concentrations in the initial state, but which then remain constant in time (i.e. the variability is not noise). Such different enzyme concentrations would reflect differences in expression levels and, if kept to a small relative magnitude, would represent individuals (or cultures) that are almost the same but with small differences. In the present case, each “biological” replicate differs from the base model by random deviations of enzyme concentrations with 90% and 110% of the base value (pseudo-random numbers drawn from a uniform distribution). In this way “replicates” containing biological variation can be simulated.

Co-response profiles While biological replicates are assumed here to differ by random (but small) differences in all enzyme concentrations, it is important to investigate how each Correlations in metabolomics 35

enzyme concentration alone affects the system. In order to do this, simulations were carried out where each enzyme concentration is varied from 90% to 110% of its base value, while keeping other enzyme concentrations fixed at their base values. Results are plotted as co- response profiles, showing how two metabolite concentrations are affected by the enzyme concentrations, in log-log plots.

2.3.3 Yeast model expansion

The application of this approach to experimental data was performed on yeast metabolomics data that represented two different stages of the growth of the organism, the exponential growth phase and the diauxic shift phase. The yeast glycolysis model of Teusink et al.(191) was combined with a model for yeast glycerol biosynthesis (193). The strategy was to combine the two, by substituting the fixed rate step that represented glycerol synthesis in the glycolytic model by the full model of Cronwright et al.. In addition, the other fixed rate steps in the glycolytic model (trehalose, pyruvate, and glycogen branches) were substituted by first order reactions, and glycerol transport was explicitly added in the model. The latter is required if the intracellular glycerol is to become a variable in the model. The other branches need to be sensitive to substrate concentration or the model may not converge to steady states when the glucose concentration is reduced. In order to effect these changes, some parameter values representing the enzyme concentration (limiting rates or first order rate constants) had to be calibrated before the model could reproduce existing data. Non- linear optimization algorithms (190; 194) were used to achieve the following model properties: 1) the intermediate concentrations of the main glycolysis branch were as in (191), 2) the internal concentration of glycerol as in (193), 3) a ratio of internal to external glycerol concentration of 2.5 (195; 196). For points 2 and 3 the fixed external glycerol concentration is first calculated from the data of Cronwright et al.(using the ratio of 2.5), and then the original internal glycerol concentration becomes part of the objective function. The parameters adjusted by the optimizer were the limiting rates of the various enzymes of glycolysis and glycerol synthesis, plus the first order rate constant of passive glycerol transport (keeping the equilibrium constant of that step equal to unity). These steps were carried out for a first model that represents the exponential growth phase, with high external glucose concentration (191). An attempt to create a new model to mimic the post-diauxic phase by repeating the pro- cess was pursued, but now with lower external glucose concentration. The originial glycerol biosynthesis model was published with parameters for an early stationary phase, which were adopted for a simulated post-diauxic phase. Target concentrations of the glycolytic interme- diates in this state were needed, in order to be able to re-adjust the glycolytic enzyme liming rates (Vmax). Because no published values for these internal concentrations in post-diauxic phase were found, the values calculated from the paper by Teusink et al.(191) were multiplied by a factor that was the ratio between average values of these metabolites in the present metabolomics profiles. Unfortunately, the data includes only 4 glycolytic intermediates. A 36 Correlations in metabolomics

model was obtained that satisfied these constrains, but no validation was possible.

2.4 Discussion

2.4.1 Metabolic control analysis and correlations

A collection of measurements of metabolite concentrations obtained from multiple obser- vations of replicate biological samples (see Figure 2.1) can be considered as perturbations of a mean state. The variance observed in these measurements can be considered to arise from slight differences in internal and external parameters between individuals, such as en- zyme concentrations, kinetic constants, and environmental conditions. In order to express this quantitatively it is useful to introduce the concentration response coefficient, which is defined as the relative change in the steady state level of a biochemical concentration in response to a relative change in some parameter (185; 82). Formally, this is expressed by the dimensionless coefficient:

0 dXi p dlnXi RXi = k = (2.1) pk 0 dpk Xi dlnpk

0 Here Xi is the concentration of interest, with Xi its reference value; pk is the parameter 0 that changed (e.g., an enzyme concentration), with pk being the reference value. One can estimate the displacement of the concentration of interest from its reference steady state, when caused by a known parameter change:

∆lnX ≈ RXi ∆lnp (2.2) i pk k

The displacement from the reference state is proportional to the parameter change, with the proportionality constant being the response coefficient. In most cases of interest this is, in fact, an approximation because the responses are nonlinear. Expanding from the previous situation, where a single parameter was changed, to the general case when n parameters are changed, the displacement in concentration can be written as a sum of n terms, each corresponding to the effect of a single parameter:

n X ∆lnX ≈ RXi ∆lnp (2.3) i pk k k Correlations in metabolomics 37

When the concentrations of two metabolites, Xi and Xj, that suffered multiple perturbations, are plotted against each other, the coordinates in the logarithmic plane of each observation relative to the reference are determined by:

Pn Xi ∆lnX R ∆lnpk i ≈ k pk (2.4) n Xj ∆lnXj P k Rpk ∆lnpk

In the special case where a single parameter has been changed Equation 2.4 reduces to:

Xi Xi ∆lnXi Rp ∆lnpk Rp ≈ 1 = k ≡ pk OXi (2.5) X1 Xi Xj ∆lnXj Rpk ∆lnpk Rpk

X where pk O i is the ratio of two response coefficients, and is known as a co-response coefficient Xj (186; 187). In this special case of a single parameter change, the metabolite scatter plots discussed here are the same as co-response profiles. All observations lie approximately on a X straight line, with slope pk O i and the correlation coefficient would approximate +1 or −1, Xj depending on the slope being positive or negative, respectively. Technical variance, reflected as error in the measurements, would reduce these values, though this is expected to be a small effect. The length of the line defined by the observations depends on the size of the parameter perturbations and on the sensitivity of the metabolites towards that parameter. That length can be calculated using the Pythagorean theorem:

q Xi max Xi min 2 Xi max Xi min d = (Rpk ∆lnpk − Rpk ∆lnpk ) + (Rpk ∆lnpk − Rpk ∆lnpk ) (2.6)

max min where ∆lnpk and ∆lnpk correspond to the two extreme perturbations of the parameter. In general, the underlying differences between replicate samples arise from multiple pa- rameters, and thus the metabolite scatter plots obtained from them are not the same as co-response profiles, unlike what had been suggested earlier (197). Because the difference in parameter values between each pair of replicates is expected to be random, the slope defined by the concentrations of two metabolites in each pair of samples is expected to be differ- ent, generating a cloud of points. The shape of that cloud of points is determined by axes whose slope corresponds to the individual co-response profiles for each of the parameters, and whose length depends on the size of each parameter fluctuations and the sensitivity of the metabolites towards them (see Equation 2.6); this is demonstrated below with simulations (Figures 2.3 and 2.4). When there are many parameters varying there will be many such axes, potentially with widely different slopes, and in general the correlation is not expected to be high. However, there are several special cases in which many parameters can vary and still yield high correlation between two variables. A few of these cases are examined here. 38 Correlations in metabolomics

One parameter dominates

One special case occurs when only one term in Equation 2.4 dominates. This occurs when

Xi Xi Xj Xj R ∆lnpk  R ∆lnpl and Rp ∆lnpk  Rp ∆lnpl for all l 6= k. This special pk pl k l case can occur when a) the variability of one parameter is much higher than any of the others parameters, or b) the responses of the two concentrations towards one parameter are much higher than towards any other (a third case combining these two for the same parameter would result in an even greater effect, except if they happen for two different parameters, when their effects would cancel). In these circumstances the contribution of one parameter on the concentration of the two metabolites dominates and the other parameters become negligible. One of the axes is much longer than the others, dominating the shape of the scatter plot and most points will be aligned with this axis yielding a high correlation. Effectively only one parameter varies, reducing Equation 2.4 to Equation 2.5. An interesting case may arise here: if two concentrations, A and B, fall in this situation (high correlation due to dominance of a single parameter), and one of them also with a third concentration, say A and C, then by necessity, B and C will also have high correlation. This is because if a single parameter dominates the correlation of A with B, then that same parameter must be the same that correlates A with C, and therefore also B with C. Such a phenomenon may lead to groups of metabolites whose concentrations are tightly correlated, forming a clique. An analysis of cliques of correlated metabolites (154) is likely to uncover such cases.

Equilibrium and mass conservation

Another special case occurs when all co-response profiles lie on top of each other, i.e.both variables respond in the same direction towards all parameters. In this case all co-response coefficients are equal:

Xi Rp k ≡ pk OXi = α (2.7) Xj Xj Rpk

X for all k. We can write RXi = αR j for all k, and express Equation 2.4 solely in terms of pk pk responses of variable j:

∆lnX Pn αRXj ∆lnp i ≈ k pk k = α (2.8) n Xj ∆lnXj P k Rpk ∆lnpk All observations will be approximately aligned on a line with slope α, equal to all the co- response coefficients, and consequently the correlation will be high. This situation can hap- Correlations in metabolomics 39 pen in two different cases: a) when two metabolites are in (or close to) chemical equilibrium, or b) when they share a conserved moiety (198). In the case of equilibrium α is a positive constant, and in the case of moiety conservation it is negative, with the correlations being close to +1 and −1, respectively.

Different physiological states

It is important to note that this interpretation of correlation in metabolite scatter plots based on response coefficients is only valid for replicates of a single physiological state. This is due to the fact that response coefficients are linear approximations around a reference state, but biochemical systems are governed by non-linear interactions. If samples from two different physiological states are used, then they will likely not align, resulting in a low correlation, even if in each of the states alone there was high correlation. On the other hand, if the mean value of the concentrations of two states differ by a large amount, then the points in the scatter plot will exist in two clusters wide apart and this may result in a spurious high correlation. This will happen even if there was no correlation in each of the states. Thus, one should be careful to only combine data from a single physiological state. More relevant than combining samples of different states in a single scatter plot, is to compare scatter plots obtained with each single state. This will be particularly important when in one physiological state there is high correlation, but not in the other, or when the correlations are both high but of opposite sign. Examples of this state-dependent correlation are shown below with simulations. Although it is not usually possible to decompose a metabolite scatter plot into the independent contributions of the parameters (co-response profiles), their interpretation in terms of response and co-response coefficients yields insight into the regulation of the system, since these coefficients are indeed global measures of regulation.

2.4.2 Scatter plots and correlation

Scatter plots have been traditionally displayed using metabolite levels obtained directly from profiling experiments (151; 3; 2). These levels are peak areas corrected by the sample mass and by the area of an internal standard (152). Figure 2.2A and C provide a demonstration of plots using metabolite levels directly; it should become obvious that these plots are best used to compare the magnitudes of the levels between states and between the two metabolites. These plots can, however, be hard to interpret in terms of correlation, due to the inherent problem of scale. In Figure 2.2A the collection of points in the two states lie along the same regression line, and one could be tempted to think that the two metabolites are correlated in all of the genotypes. As it turns out, in the wild type the correlation among replicates is low (0.26), and it only appears to be in line with the transgenic data because the average level of the two metabolites lies in the same line delineated by the transgenic samples. The reason why this plot is inappropriate 40 Correlations in metabolomics

1.0 A 2.0 B 0.8

1.0 0.6

0.0 0.4

0.2 -1.0 glucose 6-phosphate glucose 6-phosphate

0.0 -2.0 0.0 0.1 0.1 0.2 0.2 0.3 0.3 -2.0 -1.0 0.0 1.0 2.0 fructose 6-phosphate fructose 6-phosphate 5.0 C 2.0 D 4.0

1.0 3.0

0.0 2.0 sucrose sucrose

1.0 -1.0

0.0 -2.0 0.2 0.4 0.6 0.8 1.0 -2.0 -1.0 0.0 1.0 2.0 ornithine ornithine

Figure 2.2: Comparison of correlations between metabolite pairs. Comparison of correlations between two metabolite pairs in a wild type (wt) and transgenic potato tubers, measured with GC-MS (data from (3)). Pearson correlations are calculated for each set of replicate samples from the same genotype (rank correlation estimates would be very inaccurate for this small sample size). A and B: Fructose 6-phosphate and glucose 6-phosphate in wt (empty triangles) and INV-42 (filled circles); the two metabolites have low correlation in the wt (r = 0.2597) but high in INV-42 (r = 0.9947), with p < 9.3 × 10−4 for the comparison. C and D: Ornithine and sucrose in wt (empty triangles) and INV-33 (filled circles); the correlation changes from r = 0.9875 in the wt to r = -0.8965 in the transgenic (p < 1.05 × 10−6). Note the change in sign in the correlation, indicating a considerably different regulation mode in the transgenic. Plots A and C represent metabolite levels as determined from GC-MS (corrected for sample size and internal standard). Plots B and D represent the same data after subtracting the mean and dividing by the standard deviation. In A one can easily recognize that the metabolite levels in INV-42 are larger than in the wt, but B allows for a better assessment of the low level of correlation in the wt. Both transgenic lines express a yeast invertase, further details in the original publication. Correlations in metabolomics 41 to judge correlation is because it is a measure that is independent of scale. Figure 2.2B presents the same data after applying the transformation of Equation 2.9:

¯ 0 Xi − X Xi = 2 (2.9) σX

0 ¯ 2 where Xi are now the values plotted, X is the mean value, and σX is the standard deviation of the concentration values of metabolite X. After this transformation, all linear relationships have a slope of +1 or −1. These plots are adimensional, but their units can be interpreted in terms of standard deviations. Figure 2.1, Figure 2.2B and Figure 2.2D demonstrate their use. Note that from Figure 2.2B it is much more obvious that the wild type has lower correlation. Additionally, the plot also makes obvious that the low correlation value is mostly dependent on a single replicate that has a high value of fructose 6-phosphate (2 standard deviations away from the mean). An important point to make, then, is that more data would be required to confirm that these two metabolites have low correlation in the wild type potato tuber. Nevertheless, the probability that the true correlation between the two samples of Figure 2.2B is the same is around 10−4 (i.e. about 1 in 10,000 such correlation comparisons would fail to identify two equal correlations, and in this data set there are only 4,900 comparisons). The idea that these two sugar phosphates are more correlated in the genotype overexpressing an invertase gene than in the wild type does seem to make sense, giving some support to this observation. In the case when a high correlation originates from the dominance of one parameter, then it would be interesting to construct the plot in log-log space, as this would match the way in which co-response profiles are usually plotted. Below we present simulation results using such log-transformed plots (Figures 2.3 and 2.4). A word of caution is warranted for the estimation of correlations from small sample sizes, since correlation is very dependent on sample size (small samples providing bad estimates). The contrast of Figure 2.1A (n = 43) and the wild type data in Figure 2.2B (n = 6) should make this obvious. A rule of thumb would be that correlation should not be calculated with n < 10, though only a few studies (151; 2) present sample sizes larger than this. Another consideration concerns the use of Pearson correlation, as this assumes linearity and will provide bad estimates if the relationship between variables is curved, as is common in biochemical data. Spearman (rank) correlation provides better estimates, as it does not depend on linearity, only on monotonicity, though it seems even more dependent on large sample sizes than Pearson correlation. The problem of small sample size can be partially overcome if one adopts a tight significance level to comparing correlations. But beware that with sample sizes as small as those of Figure 2.2B, even correlations as large as r = 0.95 are most likely not significant. 42 Correlations in metabolomics

2 -2.2 A B 1.6 -2.3 1.2

0.8 -2.4

0.4 log [NADH] 0.9 -2.5 0 -0.6 -0.5 log [glucose 6-phosphate] -0.4 -2.6 -1.6 -1.2 -0.8 -0.4 0 0.4 0.395 0.4 0.405 0.41 0.415 log [fructose 6-phosphate] log [NAD]

-2.2 -2.2 C D -2.4 -2.3 -2.6

-2.8 -2.4

-3 log [NADH] -2.5 -3.2

log [phosphoenolpyruvate] -3.4 -2.6 -2.8 -2.4 -2 -1.6 -1.2 -8.4 -8 -7.6 -7.2 -6.8 -6.4 -6 log [2-phosphoglycerate] log [1,3-bisphosphoglycerate]

Figure 2.3: Relationship between metabolite scatter plots and co-response profiles. Triangles correspond to “biological” replicates obtained by simulation of a yeast glycolysis model where each replicate differs from each other by a small random change in all enzyme concentrations (10%, see Methods). Lines correspond to co-responses towards single enzymes, and were determined by simulation, changing the concentration of each enzyme (10%) while keeping all others constant. Values plotted are the logarithm (base 10) of the concentrations. Solid lines: hexose transporter (HXT); dashed lines: glycerol 3-phosphate dehydrogenase (G3PDH); dotted lines: enolase (ENO); dashed-dotted lines: alcohol dehydrogenase (ADH). A: glucose 6- phosphate and fructose 6-phosphate (r = 0.9988), which are near equilibrium. Note that the co-response for phosphoglucoisomerase (PGI), their isomerase, is very small but perpendicular to HXT, all others are superimposed with HXT. B: NAD+ and NADH (r = -0.9997), which form a moiety conservation cycle. Co- response to all enzymes is overlapping and in the same direction as the scatter. C: 2-phosphoglycerate and phosphoenolpyruvate (r = 0.7741), that are linked by a single enzyme (ENO). D: 1,3-bisphosphoglycerate and NADH (r = -0.0398), both products of the same enzyme (glyceraldehyde 3-phosphate dehydrogenase, GAPDH). Correlations in metabolomics 43

2.4.3 Simulations

The use of computer simulations complements the analytical results presented above, and enables investigating lines of thought for which it would be hard to carry out experiments. A well-described model of the yeast glycolytic pathway (191) was used to investigate the relationship between co-response and correlation in biological replicates. Figure 2.3 depicts simulation results corresponding to the four special cases discussed above. Figure 2.4 depicts an illustration of how correlation between biological replicates reflects changes in overall system regulation.

Equilibrium

Figure 2.3A corresponds to the case in which two metabolites are near chemical equilib- rium (mass-action ratio of 0.20 for an equilibrium constant of 0.29). The majority of the co-response lines caused by each of the enzymes are aligned on top of each other and the “biological” replicates do align over the same line. The protein towards which the response of the concentration of glucose 6-phosphate and fructose 6-phosphate is higher is the hexose transporter (HXT), and is reflected in the plot as the longest of the segments. Interestingly, the co-response of the metabolites towards the enzyme that interconverts them, phospho- glucoisomerase (PGI, EC 5.3.1.9), is in a direction almost perpendicular to the dominant direction and is a weak co-response (the line segment is very short). This is in contrast to the explanation provided for the correlation of the same metabolite pair in Cucurbita max- ima (151). The fructose 6-phosphate:glucose 6-phosphate pair has been observed with very high correlation in a wide variety of metabolomics studies (25; 151; 46; 3; 2). This kind of correlation appears when there is a high glycolytic flux and the reaction is near equilibrium. For lower glycolytic flux, this correlation seems to be lost, at least in wild type potato tubers (3) and yeast cultures in post-diauxic phase (46).

Mass conservation

Mass conservation relations are ubiquitous in metabolism, the most common being repre- sented by moiety-conserved cycles. These cycles are formed by a small number of molecules that carry a common moiety whose degradation or synthesis is much slower than the reac- tions of the cycle. Examples of such cycles are those formed by the NAD+ moiety to carry reducing equivalents and by the adenosine moiety to carry free energy. As discussed above, members of these moiety-conserved cycles are expected to be highly correlated, and at least one of them should have negative correlation with the others (due to the mass conserva- tion constraint). In the yeast glycolysis model there are two such cycles: NAD+-NADH and AMP-ADP-ATP. Figure 2.3B displays the scatter plot of the NAD+-NADH pair, with a rank correlation of -0.9997. All of the co-response lines are superimposed on the same direction as the replicates but, unlike the previous case, this pair is not in equilibrium. The co-response 44 Correlations in metabolomics profile is dominated by the steps that interconvert the NAD+ moiety between oxidized and reduced forms, namely glycerol 3-phosphate dehydrogenase (G3PDH, EC 1.1.99.5), alcohol dehydrogenase (ADH, EC 1.1.1.1), glyceraldehyde 3-phosphate dehydrogenase (GAPDH, EC 1.2.1.12), and the succinate branch. It is important not to confuse the profiles of equilibrium and mass conservation: in the former, the correlation is always positive, while in the latter there has to be a negative correlation. And while there may be other reasons for negative correlations other than mass conservations, their presence in a metabolite profile should raise the question if those metabolites are not possibly in a mass conservation relation.

Moderate correlation

Let’s turn to the case of moderate correlations, that is, the case where 0.6 < |r| < 0.8. In the simulation results there are many cases of moderate correlation; Figure 2.3C displays the case of 2-phosphoglycerate and phosphoenolpyruvate, which happen to be linked by the enzyme enolase (ENO, EC 4.2.1.11), but which have only a moderate correlation (r = 0.7741, rank). In this case the co-response profile has three enzymes with decreasing strength and different directions: HXT has the strongest effect, followed by G3PDH, and then ENO (the enzyme that links them). The angle formed by HXT and G3PDH is marked and define the scatter in the “biological” replicates, with the largest scatter along the direction of HXT, and less scatter in the direction of G3PDH. Indeed, the scatter is generated by variance in each of the enzymes, with the direction set by the co-response of the metabolites towards the enzyme, and the amount of scatter proportional to the response of each of them to that enzyme. If a pair of metabolites is controlled essentially by a single enzyme, then their replicates will be correlated no matter how close they may be in the metabolic map; this correlation would originate from an amplification of the variance for that enzyme. Due to the summation theorem for concentration control, which states that the sum of all concentration control coefficients towards the same metabolite is zero (199), for one enzyme to dominate (e.g., 10-fold larger than the others), the others must be small and have opposite sign. It can be predicted that this situation is not common, although possible. Another possibility for the high correlation is for the concentration of one of the enzymes to vary widely between replicates, and this would only happen if there was no tight control of its expression. In this case, the variance is still due to that enzyme, but not by amplification. A summary of the origin of metabolite correlation among replicates is that it is due to large variation of a single enzyme, or through differential amplification of the variance of a single enzyme. In any case the observed correlations are properties of the whole system, not of any particular metabolite, enzyme, or reaction.

Low correlation

The final special case is the most common: when two metabolites are poorly correlated. Figure 2.3D displays the pair 1,3-bisphosphoglycerate and NADH, both products of the Correlations in metabolomics 45

reaction catalyzed by GAPDH. Despite this metabolic constraint, they display a very low correlation in this model (r = -0.0398, rank). The co-response profile has four enzymes with relatively similar strength and with different angles. According to the stated hypothesis, the scatter of “biological” replicates is anisotropic resulting in the low correlation. Even though the co-response is equally dominated by four enzymes (HXT, ADH, G3PDH, and ENO), it is believed that two of them alone would have been able to generate low correlation, as long as they formed an angle close to 90°, as is the case with the pairs HXT:ADH or G3PDH:ENO. This example strengthens the notion that neighboring metabolites may have little or no correlation not because they are not related, but because the variance in the enzymes that control them affects them in equal amounts and different directions. Overall, this is what happens to the majority of metabolite pairs, and is a consequence of the systemic nature of metabolic control (82). Indeed it is another manifestation of the same principle that results in dominant mutations being rare (200).

Comparing correlations

One of the important aspects that can be pursued based on the thesis that correlation of replicates and associated scatter plots reveal aspects of regulation, concerns comparing correlations between two distinct physiological states or genotypes. Early experimental data from potato tubers (3) already displays this phenomenon as is depicted in Figure 2.2, even though not highlighted in the original publication. To demonstrate how these changes relate to the co-response profiles, an example from simulations is analyzed. Figure 2.4 depicts the relationships between ADP and phosphoenolpyruvate in the yeast glycolysis model in two distinct states (see Methods for details). In terms of correlation alone, the pair is moderately correlated in the high glucose state (r = -0.9079) but in the low glucose state the correlation decreases considerably (r = 0.4113). Inspection of the scatter plots (Figure 2.4) reveals that the co-response profile is dominated by HXT in the high glucose state; in the low glucose state, even though HXT still has the largest magnitude of control, several other enzymes now also have comparable control levels on the two metabolites. The correlation becomes low in the low glucose state due to some spread of the replicates but also to the nonlinearity of the relationships (here even rank correlation fails to identify the relation because there is a change in derivative in the curve). The change in regulation between the high and low glucose states (see Methods section) is well reflected in both the scatter plots as well as on the rank correlation. Note that the two correlations are very significantly different (p < 3 × 10−21). When this principle is applied to different genotypes, it becomes similar to the method that has became known as FANCY (169; 201), which relies on similarities in co-response to identify mutants that act on similar parts of metabolism. However, in FANCY one is interested in the actual values of co-response, which would be determined from the mutant phenotypes, while here we are restricted to observing correlations of replicate samples that do originate from biological variation filtered by the complete co-response profile. In this 46 Correlations in metabolomics

-2.2 -3.6 A B -2.4 -3.8 -2.6

-2.8 -4.0

-3.0 -4.2 -3.2 log [Phosphoenolpyruvate] log [Phosphoenolpyruvate] -3.4 -4.4 -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 log [ADP] log [ADP]

Figure 2.4: Changes in correlation and co-response profiles. Changes in correlation reflect changes in the co-response profile of the pair ADP and phosphoenolpyruvate (both are substrates of pyruvate kinase, PYK). Triangles correspond to “biological” replicates, simulated as in Figure 2.3 (see also Methods). Lines correspond to changes in each enzyme, while keeping all others constant. Solid lines: hexose transporter (HXT); dashed lines: ATPase; dotted lines: PYK; dashed-dotted lines: glycogen branch. The plot is of the logarithm (base 10) of the concentrations. A: yeast model with high glucose concentration in the medium (model from 191), r = 0.9079. B: same yeast model with lower glucose concentration and changed basal concentration of enzymes (see Methods for further details). Rank correlation is now r = 0.4113, but note the nonlinear nature of the co-response profile. Interestingly PYK, the enzyme that has the two metabolites as substrates, only affects the concentration of phosphoenolpyruvate.

case there is no attempt at determining the co-response profile, which is arguably hard to do experimentally and it is why we have resorted to simulation here. The two approaches are related and could be complementary.

2.5 Yeast metabolism: model expansion and correla- tions

The interpretation of metabolite scatter plots in terms of regulatory aspects of the biological system was further investigated in the model system Saccharomyces cerevisiae in two dif- ferent stages of the organism’s growth curve. The experimental conditions of this work are described in (46) and the reader is reverted to the publication for more detail. Given that metabolite profiles from all 10 biological replicates of each state were collected, statistical analysis can be extended beyond testing for different levels of metabolites, allowing estima- tion of correlations between metabolite pairs. Early in the development of metabolomics an interesting phenomenon was identified, where, in a small number of cases, replicate measure- Correlations in metabolomics 47

Figure 2.5: Metabolite correlations in yeast exponential growth phase. Metabolites are represented as nodes, with edges connecting them whenever there is a significant correlation (p < 0.0001). Positive correlations are represented with blue edges, while negative ones with red edges. 48 Correlations in metabolomics

ments of a metabolite pair display a strong correlation (3). These unexpected correlations are relevant because in the presence of random variation (noise, or biological variation) there should exist no correlation at the level of replicates (i.e., the levels of these metabolites should also change randomly among equivalent samples). The existence of such correlations implies that there are factors that filter random fluctuations in a deterministic way, be it technical artefacts or the biochemical network itself. As was shown before, these correlations are system properties that originate from the global biochemical regulation. In particular, it is hypothesized that the high correlations are due to one, or the combined effect of: a) a single factor (enzyme) that dominates the control of both correlated metabolites, or b) fluctuations of the concentration of a single enzyme that are considerably larger (orders of magnitude) than any of the others that have significant concentration control over the two metabolites. In both cases it is assumed that the largest source of variance is biological diver- sity, not intrinsic or technical noise. This assumption is supported by our results described above, where the technical variance is smaller than the biological. Metabolite correlations within replicate measurements of the same state in our data set were analyzed. The important cases are those with very high correlations – these should be limited to correlations at least above |r| > 0.9. Correlations that have changed significantly between the two phases are also important, even if not above the stringent criterion of |r| > 0.9, as they reveal changes in global regulation. 4-Aminobutiric acid (GABA) and lysine were the only pair that maintained an extremely high correlation in both phases. It is so remarkably high that it may lead to the assumption that it may be due to an identification artifact (one of them possibly being a derivatization of the other). But this is partly contradicted by the fact that both have different patterns of correlation with all other metabolites, which would be the expected result of the mentioned artefact. This observation deserves further investigation; if it is a true correlation then it implies a very strong control of the concentrations of these two amino-acids by a single enzyme. Figures 2.5 and 2.6 display all other strong significant correlations (p < 0.0001 and |r| > 0.95) using graphs, where two metabolites are connected if they are highly correlated. It is evident that there are many more high correlations in the exponential phase, and that in both cases the graphs appear to be “scale-free” (62). While graph representations are useful to express a great deal of information succintly, it is not clear how they should be interpreted nor the relevance of some of their features (e.g., the disconnected sub-graphs). Further theoretical analysis is needed. Table 2.1 lists all metabolite pairs for which there was a significant change in correlation (p < 0.0001) between the two growth phases. Here the emphasis is put on the change in correla- tion, rather than its values. It is our thesis that these changes reflect changes in the overall regulation of the system (see Section ??). A lack of correlation, however, does not imply that the metabolites are not related (or even interacting). It merely reflects that their concentra- tions depend on several factors (enzymes) by similar proportions. Clear cases of metabolite pairs that were highly correlated in exponential phase, but not in the post-diauxic phase are: glucose-galactose, fructose-glucose, arginine-glutamic acid, glutamic acid-homoserine, and 8 additional pairs. The opposite case, of high correlation in the post-diauxic phase Correlations in metabolomics 49

Table 2.1: Changes in metabolite correlations (Spearman) between growth phases. Correlations were classi- fied using the following symbols: +1, strong positive correlations; -1, strong negative correlations; +, positive correlations; -, negative correlations; 0, zero correlations. Each correlation pair (rexponential, rpost−diauxic) is significantly different (p < 0.0001 and n > 5)

Metabolite i Metabolite j rexponential rpost−diauxic Type Histidine VS-GC1-P1-036830-213000 0.93 -0.81 +1 / -1 Trehalose VS-GC1-P1-011700-075000 0.81 -0.87 +1 / -1 Citric acid VS-GC1-P1-021810-191000 0.95 -0.57 +1 / - 2-Aminoethanol VS-GC1-P1-009650-152000 0.94 -0.55 +1 / - Histidine VS-GC1-P1-009650-152000 0.96 -0.54 +1 / - Arginine Glycine 0.93 -0.49 +1 / - 2-Aminoethanol VS-GC1-P1-018900-191000 0.94 -0.44 +1 / - Arginine VS-GC1-P1-030550-218000 0.95 -0.43 +1 / - Arginine Glutamic acid 1 -0.25 +1 / 0 Histidine Pinitol 1 -0.21 +1 / 0 Arginine Phenylalanine 1 -0.21 +1 / 0 Fructose Glucose 0.99 -0.15 +1 / 0 Arginine Homoserine 0.99 -0.09 +1 / 0 Glutamic acid Phenylalanine 1 -0.07 +1 / 0 Tyrosine VS-GC1-P1-024890-129000 1 0.04 +1 / 0 Glutamic acid Homoserine 0.99 0.07 +1 / 0 Galactose Glucose 0.99 0.09 +1 / 0 Lysine Pinitol 1 0.15 +1 / 0 4-Aminobutyric acid Pinitol 1 0.22 +1 / 0 Histidine Ornithine 1 0.25 +1 / 0 Lysine Ornithine 1 0.39 +1 / + Benzoic acid VS-GC1-P1-018790-241000 1 0.42 +1 / + 4-Aminobutyric acid Ornithine 1 0.43 +1 / + 4-Aminobutyric acid Histidine 1 0.61 +1 / + Histidine Lysine 1 0.64 +1 / + Fumaric acid α-Ketoglutaric acid -0.18 1 0 / +1 Phosphoric acid VS-GC1-P1-014060-188000 0.42 0.99 + / +1 Glycerol Malic acid 0.54 1 + / +1 Citric acid VS-GC1-P1-024590-233000 0.67 -0.93 + / -1 VS-GC1-P1-012000-142000 VS-GC1-P1-024130-120000 -0.9 0.55 -1 / + 50 Correlations in metabolomics

Figure 2.6: Metabolite correlations in yeast post-diauxic growth phase. Metabolites are represented as nodes, with edges connecting them whenever there is a significant correlation (p < 0.0001). Positive correlations are represented with blue edges, while negative ones with red edges. Correlations in metabolomics 51

but no correlation in the exponential phase, was only observed in the two members of the TCA: α-ketoglutaric acid and fumaric acid. Note that α-ketoglutaric acid concentration did not change significantly between the two stages, although its correlation with fumarate was strong in the post-diauxic phase. This highlights the advantage of studying the correlations in parallel with the analysis of concentration changes. Perhaps the most interesting correla- tion pattern is that of two metabolite pairs which displayed high positive correlation in the exponential phase and high negative correlation in the post-diauxic phase. This change can be interpreted as having originated from one enzyme dominating the control of their concen- tration in one phase, and a different enzyme taking that role in the other phase. Note that a negative correlation implies that the controlling factor affects the metabolites in opposite direction. The two pairs that have this pattern are histidine and an unknown peak (most abundant ion mass of 213), and trehalose and another unknown peak (most abundant ion mass of 75).

2.5.1 Simulations and model validation

Before a model can be trusted, it needs to be validated against independent data (i.e., data that was not used to build it). In this study this was done by using the replicate metabolite measurements. Given that the premises were that the variance in metabolite profiles orig- inates mainly from biological variance, an attempt to simulate this using the two models was undertaken. Because there is no gene expression represented in these models, biolog- ical variance would be reflected in the enzyme concentrations, which would all be slightly different from an average by random amounts. This was simulated in the software Gepasi (188; 189; 190), using the scanning facility, by which several simulations were generated where (uniform) random values are added to the average enzyme concentrations. The ran- dom factors were of a magnitude calibrated to produce CVs in the 4 glycolytic intermediates measured and glycerol, in agreement with the experimental measurements. Given this, vali- dation of the model depends on it producing a pattern of metabolite correlations similar to the experimental. This is idependent from the model building process, given that the large majority of model parameters were obtained from in vitro enzyme kinetics (193; 191). Simulation results were mixed, with the exponential phase model being able to reproduce the observed correlations very well, but not the post-diauxic model. As such, the post-diauxic model is deemed invalid, and will not be discussed further. A new attempt at constructing that model will require quantitative measurements of the glycolytic intermediates, most likely needing targeted assays, not GC-MS profiles. Table 2.2 contains the metabolite correlations obtained with the exponential phase model contrasted with those obtained in the experiment. Only the correlations of phosphoenolpyruvate (PEP) were not in agreement. However, the level of this metabolite in the exponential phase was very low, possibly close to the limit of detection of the GC-MS assay (it was indeed only detected in 6 of the 10 replicates). It is plausible that the variance in PEP is dominated by noise, rather than biological variance, in this case. This would explain the lack of agreement between model and simulation, and 52 Correlations in metabolomics

Table 2.2: Comparison of metabolite correlations between model and experiments. This table shows the comparison of metabolite correlations obtained from the model and from the experiments (exponential phase). Variance was introduced in the model by adding small random factors to the concentration of the enzymes, simulating biological variability.

Metabolite i Metabolite j rexperiment n rmodel n Fructose-6-phosphate Glucose-6-phosphate 0.933 9 0.986 15 Glucose-6-phosphate Glycerol -0.850 9 -0.764 15 Glucose Glucose-6-phosphate 0.817 9 0.561 15 Fructose-6-phosphate Glycerol -0.783 9 -0.761 15 Glucose Glycerol -0.767 9 -0.396 15 Fructose-6-phosphate Glucose 0.650 9 0.575 15 Glucose Phosphoenolpyruvate -0.543 6 0.157 15 Glucose-6-phosphate Phosphoenolpyruvate -0.543 6 0.121 15 Fructose-6-phosphate Phosphoenolpyruvate -0.371 6 0.154 15 Glycerol Phosphoenolpyruvate 0.086 6 -0.182 15

is encouraging. However the model needs to be challanged by further validation with new data. If it withstands those challenges, then it can become a building block for the in silico yeast.

2.6 Conclusions

Measures of correlation between metabolites in replicate profiles can be very informative about the underlying biological system. An earlier study established that metabolite cor- relations do not necessarily correspond to proximity in the biochemical network (163) and, therefore, cannot be used for the reconstruction of metabolic networks. This analysis de- scribed four different regulatory configurations that are expected to be the origin of metabo- lite correlation. As was shown, simulations suggest that when the correlations are very strong, they are likely due to chemical equilibrium. An interesting prediction, still to be confirmed, is that metabolites sharing conserved moieties should have high correlations, and at least one of them being negatively correlated with the others. However, most high corre- lations may be due to either 1) stronger mutual control by a single enzyme, or 2) variation of a single enzyme level much above others. In both cases it is impossible to identify the responsible enzyme from these data alone, though hints can be obtained from the set of metabolites forming correlation cliques (154). The analysis of experimental data from dif- ferent Saccharomyces cerevisiae growth stages confirmed that this approach can be used to identify changes in the regulatory aspects of metabolism by the identification of significant changes in correlation between metabolites (see Table 2.1). These regulatory hypothesis need Correlations in metabolomics 53 to be followed up by collecting more data, with proteomics being the most promising for this. It is hoped that the concepts introduced here will enable better analysis of metabolomics data in the context of systems biology. Chapter 3 Comparison of reverse engineering methods

Part of the work presented in this chapter has been submitted for publication: Camacho, D., Vera-Licona, P., Mendes, P. and Laubenbacher, R. (2007), Com- parison of reverse-engineering methods using in silico network, Ann. N. Y. Acad. Sci., submitted 56 Comparing reverse engineering methods

“The exact sciences also start from the assumption that in the end it will always be possible to understand nature, even in every new field of experience, but that we may make no a priori assumptions about the meaning of the word understand. ” - Werner Heinsenberg

3.1 Abstract

The reverse engineering of biochemical networks is a central problem in systems biology. In recent years several methods have been developed for this purpose, using techniques from a variety of fields. A systematic comparison of the different methods is complicated by their widely varying data requirements, making benchmarking difficult. Also, due to the lack of detailed knowledge of realistic real networks, it is not easy to use actual experimental data for this purpose. This work contains a comparison of three reverse engineering methods using data from two artificial networks. These networks include a gene regulatory network and an artificial biochemical network that is sufficiently realistic and complex to include many of the challenges that data from real networks pose. In particular, this artificial biochemical system includes proteins and metabolites, in addition to mRNA transcripts as nodes. To test the efficiency of the reverse engineering methods being compared, common measures of system identification correcteness, such as accuracy, the rate of true positives and the rate of true negatives, were determined. Since experimental data contains intrinsic noise, the study also focused on the degeneration of the results obtained by each method as noise was added to the data, sampled from a normal distribution with mean 0 and varying standard deviation between 1% and 50%. The results show that on the data set used here methods based on genetic perturbations of the network (RSA and NIR) tend to perform slightly better than methods that rely on statistical measures for network identification.

3.2 Introduction

A major challenge of computational systems biology is the reconstruction of biological net- works directly from data, particularly from large-scale ‘omics’ studies. Several methods have been proposed using a variety of theoretical frameworks, such as statistical analyses (130; 147), machine learning (see 202; 203; 145)), chemical kinetics (139; 144), metabolic control analysis (204; 134; 135; 136), or algebra (87), among others (see 205, for a review). These algorithms differ widely in terms of the type and amount of data that they need as Comparing reverse engineering methods 57

P2 G6

G2 P3 G9

G8 G5 P1

G1

G10 G3 G7 G4

Figure 3.1: Gene network with 10 genes. This network was generated using the software package by Mendes et al.(206). It contains 3 environmental perturbations (P1, P2 and P3) that directly affect the expression rate of some genes (G1, G2 and G5, respectively) and all of the other genes indirectly. Arrow ends mean activation and blunt ends inhibition of the transcription rate.

input, many of them requiring completely different experiments from the others. The algo- rithms also differ in terms of what information they provide. Some aim to reconstruct only the wiring diagram of the network, i.e., information about which network nodes influence which other ones. Other algorithms provide dynamical models, including information about variable dependencies. Typically, the performance of each method is demonstrated using available experimental data or simulated data. To date, no systematic comparison of available methods has been done, in part because such a comparison faces several challenges. It would, of course, be desirable to use an in vivo or at least in vitro network to generate the data to be used. The two main obstacles for doing so are the difficulty of performing all the needed experiments on a realistic-size network to fulfill the differing requirements for the various methods and the lack of detailed knowledge of the network to be reconstructed. A further complication is the absence of a rigorous understanding of the data requirements for the different methods and a universal method to measure the quality and information content of a given data set. If a simulated network is to be used, then it is important to incorporate several realistic features, such as realistic size and presence of noise, different molecular species, or different time scales. Mendes et al.(206) describe a software package to generate artificial gene networks that can be used for comparison of reverse engineering algorithms. These networks, however, are simple gene-gene interaction networks and do not satisfy some of the requirements for a realistic 58 Comparing reverse engineering methods comparison. They are particularly ill-suited for those reverse engineering algorithms that are focused on networks including metabolic reactions (i.e., when some network interactions are based on mass transfer). Unfortunately, metabolic networks have numerous constraints of a chemical and thermodynamical nature, and no generator equivalent to that proposed by Mendes et al.(206) for general biochemical networks seems to exist that does conform to those constraints. Nevertheless, I will use a gene network with 10 nodes to illustrate how a comparison can be performed (Figure 3.1), which was generated using the software by Mendes et al.(206). I will then demonstrate how these methods perform on a more realistic biological network model (Figure 3.2) that incorporates several realistic features needed for the purpose of assessment of reverse engineering algorithms. This model network integrates metabolic, signaling and gene regulatory interactions. It is of a modest size (20 genes, 23 protein forms, and 16 metabolites; a total of 59 molecular components) but still larger than most of those used previously to determine reverse engi- neering algorithm performance. The network contains structures that are common in real biochemical systems, but usually absent from test networks, including moiety conserved cy- cles of co-factors, an electron transport chain, a substrate that fulfills two roles (energy and reducing equivalents), proteins that are substrates, receptors and transcription factors (not just enzymes), and a small regulatory network (see Figure 3.2). Such a level of complexity will provide a more realistic platform to accurately test reverse engineering algorithms and identify their strengths and weaknesses when dealing with complex networks. To compare the three different reverse engineering methods, the data used relates to the gene transcript levels only. Two of the methods rely on data obtained by perturbing the system around a reference steady state (134; 135; 137) and the other method uses time series data (147). An important criterion for the particular choices made was the availability and relatively easy usability of a software implementation of the algorithm. Therefore, other methods, with different mathematical frameworks and different modeling techniques, though valuable in the context of reverse engineering, will not be used due to the fact that they do not present themselves of easy application by an experimentalist.

3.2.1 In silico networks

In Section 1.3 I discussed the use of in silico models of biological systems as an important tool in life sciences research. Many of the methods that attempt to reverse engineer biological networks from data, as the ones presented here, benefit a great deal on its development by the use of synthetic systems to which the results of the methodology being introduced can be accurately compared. In this work I will focus on two different kinds of networks that were used to make a comparison between these methods. These are (i) gene regulatory networks, in which the interactions between genes are phenomenological and represent the result of the effect of transcription and translation on the regulation of the genes in the network and (ii) artificial biochemical networks, which represent reactions or events that can be found Comparing reverse engineering methods 59

M1 G19

G15 P19

P15

G16 G3 G2 G1

P16 G4 P2 G20 P22

G18 G17 G14 G5 P3

P18 P17 P14 P4 P21

G7 G9 G8 G10 G11 G12 G13

M18 P10 M23 P13 M11 M10 P20

P9 P8 P5

M13 M14 P12 M22 M9

P7 P11

M7 M8 M16 M24 M20

M1 G6 G19

M5 M6 M17 M19 M21 M12 G15

P1 G6 M3 G2 G3 G4 G16 G20 G1

M3 P6 G18 G17 G14 G5

P23 M2 M4 M15 G8 G9 G7 G10 G12 G13 G11 A B

Figure 3.2: Synthetic biochemical network. A: A realistic synthetic biology that includes not only the regulatory interactions between genes but also protein translation and metabolic reactions that show how metabolites are converted into one another and how these can influence gene expression. Genes (circles, light grey) are transcribed into proteins (squares, grey) which convert metabolites (diamonds, white). Arrow ends mean activation or end product; blunt ends mean inhibition; dashed interaction represent action (gene transcription or metabolite conversion). B: For the purpose of the comparison being performed, only the known gene regulatory network (which does not account for the indirect effects that may be present due to translation of the protein and metabolite spaces into the gene space) will be compared and used as a measure for accuracy. 60 Comparing reverse engineering methods in biochemical systems. For the purpose of the comparison of reverse engineering methods that I will tackle in this chapter, only gene regulatory interactions will be discussed, even though in one case (the artificial biochemical network) these interactions are a consequence of gene regulation by proteins encoded by other genes as well as the effect of metabolites in gene regulation.

Gene networks

Gene networks are phenomenological abstractions of interaction and regulation between genes translated into the gene space (133). As discussed in Section 1.5, gene regulatory networks do not contain information on protein complexes that will activate or inhibit the expression of a given gene or the cases in which the a metabolite will affect the expression level of a given gene. Therefore, the interactions that are present in the gene network rep- resentation are a projection of all of the levels of cellular organization into the gene space, which implies that some of the interactions present in the gene network are merely indirect interactions between variables (133). Biological networks (and, therefore, gene networks) can present different topologies (see Section 1.3.2). Mendes et al.(206) have presented a software package that can generate gene regulatory networks with any number of nodes with random, scale-free or small-world topologies. Using the software package of Mendes et al. a network with 10 genes and with scale-free topology ((as proposed by Barab´asi et al. 62; 63; 64; 65; 52; 56) was generated (see Figure 3.1). This network is affected by envi- ronmental perturbations which affect the transcription rate of the gene on which they act directly and whose effect is propagated through the network by the interactions between the genes.

Artificial biochemical system

The artificial biological system that I will be presenting here consists of 3 different levels of biological organization, with gene transcription, protein translation and metabolite synthesis and degradation (see Figure 3.2A). Even though this is an in silico network it aims at mimicking structures commonly found in biology. This network has 2 sources of material that enter the system, which are metabolites M1 and M23. One of these metabolites (M1) is responsible for the activation of one of the genes in the network (G15), which, in turn, acts as a transcription factor of several other genes. Another one of the metabolites in the system (M3) acts as an activator of G17. At the protein level there are post-translational modifications introduced into the model, with 2 proteins (P15 and P18) that are modified by 2 metabolites (M1 and M3, respectively) and, thereby, originating 2 “new” protein forms, P22 and P23, respectively. Incorporated into the model is also the possibility of a protein having two different states (similar to what happens with phosphorylation and dephosphorylation of proteins in signaling cascades), with P20 being “activated” by P2 in the presence of M4, with its active form being P21, that catalyzes the conversion of M9 to M10. This level of Comparing reverse engineering methods 61

Table 3.1: Perturbations applied to the gene network. After the system is allowed to reach a steady state an experiment is performed in which an environmental perturbation is applied to the system.

Experiment P1 P2 P3 1 0 0 0.1 2 0 0.5 0 3 0 0.5 0.1 4 0.1 0 0 5 0.1 0 0.1 6 0.1 0.5 0 7 0.1 0.5 0.1

complexity in models to be used for reverse engineering approaches should be included in the formulation of good benchmarks for problems of this sort. Because the methods that I will be comparing work best for gene networks, I will only be considering the gene regulatory network presented on Figure 3.2B. This is not an accurate depiction of the gene network for this system, but the interactions represented are true interactions and will be the basis of comparison across the methods. For a description on how to project the protein and metabolite space into the gene space I refer the reader to the work of Alberto de la Fuente (207).

3.2.2 Reverse engineering algorithms

The range of reverse engineering methods is very large, from statistical applications to more complex mathematical frameworks (see Section 1.5). In this chapter I will make a comparison between three of the existing methods (Section 3.3) using software provided by the algorithms developers. In particular, I will analyze approaches based on correlation of the system variables (147), a methodology that uses singular value decomposition allied to robust regression (137) and an approach that borrows concepts from metabolic control analysis (134; 135). Each of these methods has different requirements in terms of data type and amount. For correlation based methods one needs either time course data that result from a response of the dynamical system to a perturbation (130), with enough time points sampled to cover the response accurately, or a large number of independent samples are required (147). The methods by Gardner et al.(137) and de la Fuente et al.(134; 135) require steady state data that result from small perturbations around a reference steady state, made to each of the variables in the system to be inferred. For details on how these methods are implemented and where software can be obtained please see Section 3.3. 62 Comparing reverse engineering methods

Regulatory strength analysis

This method, proposed by de la Fuente et al.(134; 135), is similar to the one proposed by Kholodenko et al.(136). The method relies on data obtained from experiments that perturb the expression rate of each gene, preferably by lowering it. An example of how can this be achieved is by generating heterozygous populations, with as many heterozygous populations as the number of genes of interest whose network is to be inferred. Using a simulated network one can perform such experiments easily. For this study the rate of transcription for each gene in both the gene regulatory network and the artificial biochemical network were set to half of its reference steady state value and a new steady state (one for each perturbed gene) was obtained. The method determines how each gene interacts with each other by borrowing concepts from metabolic control analysis. Firstly, the co-control coefficients are calculated by determining the ratios between different mRNA lev- els. By inverting the co-control coefficient matrix one obtains, then, the regulatory strengths matrix, which provides information on how the variables interact with each other. Because all of the ratios between all of the variables are calculated numerically, the resulting regu- latory strengths matrix is highly dense (in fact, there are no zero entries as a result of the calculations). However, biological networks can be expected to be sparse. Therefore, one can set a threshold value assumed to be a measurement of the noise level. This is discussed by de la Fuente et al.(134).

Reverse engineering by multiple regression

The method proposed by Gardner et al.(137) is very similar to the methods of de la Fuente et al. on the regulatory strengths (134; 135) as well as the method developed by Yeung et al.(139), which was developed around the reconstruction of genetic networks by linear regression of the data. The method takes in perturbations made to the variables in the system (in the cases presented here, the genes) around a reference steady state and, assuming sparsity of the biological system, it recovers the network by performing a multiple regression of the data to a generalized model of a Jacobian approximation. The sparsity of the interaction matrix is an essential assumption in the method, as it allows for the number of interactions to be estimated to be reduced considerably. As an example, on a network with 20 variables the total possible number of interactions is 400 (N 2, where N is the number of variables). If one limits the average number of outgoing connections per variable (thereby limiting the number of possible targets of each variable) to l, with l << N, then the total number of connections that will be present in the network will be 20×l. This implies that 400−(20×l) are considered 0, making the problem much simpler to solve. The assumption of sparsity in biological networks has been introduced before in both ge- netic (208) and protein networks (209). The work by Thieffry et al.(208) shows how the transcriptional regulatory network for E. coli has a relatively low connectivity, being com- Comparing reverse engineering methods 63

Table 3.2: Perturbations applied to the biochemical network. After the system is allowed to reach a steady state an experiment is performed in which an environmental perturbation is applied to the system.

Experiment M1 M23 1 0.2 0.4 2 1 0.4 3 0.2 0.01

posed essentially of small subnetworks that are interconnected, with the mean connectivity (the number of connections exhibited by each gene) being between 2 and 3. These results may hint at a scale-free topology for E. coli, as is the case for yeast and Helicobacter pilori (209). As with Gardner et al.(137), also T´egner et al.(138) and Yeung et al.(139) assume a sparse biological network for their reverse engineering approaches.

Partial correlations

One method proposed for the reconstruction of genetic networks was presented by de la Fuente et al.(147) and uses concepts of partial correlations, itself based on ideas from path analysis (210; 211). Methods for the inference of biological network structures based on correlations have been proposed before under the context of metabolomics (see Chapter 2 and 163; 164). It has been discussed that one cannot determine causality from correlation analysis alone (163), with the correlations observed being linked with regulation of the sys- tem (164). Partial correlations identify interactions between species by calculating all of the possible correlations and eliminating “false” correlations by conditioning the calculation of the correlations on each of the variables. As with other correlation-based approaches, the number of data points presented should be considerable, to allow for good statistical infer- ence. This requirement in the amount of data points required (from independent samples) makes a good application of the method to be performed by a single experiment somewhat unfeasible. However, since this method can take on any kind of observational data, it is possible for one to collect any data (from the same system) and use all available data to perform better inferences. These ideas have been explored by de la Fuente et al.(147).

3.2.3 Benchmarking and reverse engineering

In order to be able to effectively and honestly compare reverse engineering methods there is the need for a benchmark network to which the results of the method’s application should be compared to. Benchmarks have been proposed for other problems, such as the parameter estimation of kinetic parameters (see 212; 194). The purpose of a benchmark is to inde- pendently, without favoring any particular application, objectively assess the performance of 64 Comparing reverse engineering methods

Table 3.3: Data requirements in reverse engineering methods. The different reverse engineering methods available have different data requirements, both in type and amount of data needed. This makes for a comparison between methods difficult.RSA: Regulatory strengths analysis; NIR: Multiple regression; PC: Partial correlations.

Method Data requirements RSA Small perturbations, around a reference steady state, of all system variables

NIR Small perturbations,around a reference steady state, of system variables. Hypothetically not all variables need to be perturbed (see 137)

PC Any kind of data can be used, provided that is sufficient for good statistical estimation of correlations between variables

the named application being tested, therefore enabling a means for comparison among dif- ferent methods. In the particular case of the reverse engineering of networks with biological significance, a benchmark provides a good test for any methodology being proposed. This benchmark should be realistic and broad enough such that any method’s pre-requisites and specifications can be easily met without losing realism. In silico networks provide the best choice for benchmark development since any kind of experimental setup can be devised to meet intended specifications. The artificial biochemical network presented in this chapter has been proposed as a valid benchmark for reverse engineering problems (Mendes, unpub- lished work). The work that follows illustrates how the kind of synthetic networks proposed here can be used to assess each of the methods analyzed strengths and weaknesses.

3.3 Model

3.3.1 Computational

All simulations were performed using the COPASI software package (213), and run under Linux (Slackware distribution), Windows 2000 (Microsoft Co., Redmond, WA) and Mac OS X (Apple Co., Cupertino, CA). MatLab (MathWorks Co., Natick, MA) was run under Windows 2000. The software for each of the methods was obtained either from the authors (Alberto de la Fuente kindly provided the software for the regulatory strengths method; Tim Gardner kindly provided MatLab code for the multiple regression algorithm) or from publicly available websites (Ometer (for the partial correlations algorithm): http://mendes.vbi.vt.edu). Comparing reverse engineering methods 65

Table 3.4: Confusion matrix for gene networks. A confusion matrix is a good way to visualize a given method’s efficiency. A confusion matrix for a gene network is proposed, with the identification of the entries of the matrix that relate to the estimated number true and false positives and negatives. For a description of the definitions of true positives, false positives, true negatives and false negatives in the context of gene networks, please refer to the text. TP: True positive; FP: False positive; TN: True negative; FN: False negative.

Predicted Actual + 0 - + TP FN FP 0 FP TN FP - FP FN TP

3.3.2 Genetic perturbations

The methods by de la Fuente et al.(134; 135) and Gardner et al.(137) require that the system is to be perturbed at the gene level, by performing an over-expression or a repression of the gene expression. For this study I opted to perform a repression of gene expression by manipulating the rates of synthesis of the mRNA, setting them to half of their original value. Each rate was perturbed independently and the corresponding steady state for this new phenotype was recorded. This procedure was made for all of the genes in the model network (10 for the gene network and 20 for the artificial biochemical network). At each new expression rate the corresponding steady state was recorded. In the end, a total of 1 + N steady states, where N is the number of genes in the network, are recorded.

3.3.3 Environmental perturbations

Environmental perturbations were also performed, in which external parameters of each of the network systems were varied. After an initial steady state was achieved with these parameters set to 0, these were perturbed as a constant input (for the gene regulatory network) or as a pulse perturbation (in the artificial biochemical network case). This type of perturbation attempts to mimic experimental cases in which the time response of a system is followed after a relatively large perturbation is applied. Tables 3.1 and 3.2 show the different perturbations applied in both the gene regulatory network and the artificial biochemical network. 66 Comparing reverse engineering methods

Table 3.5: Choosing the right parameters. The effectiveness of the methods discussed here is very dependent on the parameter value chosen for each of these methods. The best parameter values are chosen as those that yield the largest area under the ROC curve (see ROC curves, Figure 3.3). RSA: Regulatory strengths analysis; NIR: Multiple regression; PC: Partial correlations; NCD: Number of connections discovered; ACC: Accuracy; TPR: True positive rate; TNR: True negative rate. For parameter descriptions see text.

Method Parameter NCD ACC (x100%) TPR (x100%) TNR (x100%) RSA 0.01 49 0.7647 0.5510 0.9623 0.1 36 0.8037 0.6111 0.9014 1 17 0.8276 0.7647 0.8384 NIR 2 20 0.771 0.550 0.816 3 30 0.746 0.500 0.833 4 40 0.652 0.359 0.813 PC 0.01 14 0.709 0.167 0.779 0.1 16 0.706 0.214 0.782 1 56 0.463 0.154 0.738

3.3.4 Adding noise

To test the efficiency of each method and how the methods would degrade with the presence of experimental error, noise was added with a normal distribution as N ∼ (0, µ), where µ is the standard deviation, ranging from 1% to 50%. The sampling of the distribution was made using R (214), ran on OS X (Apple Co., Cupertino, CA).

3.3.5 Data requirements

As was discussed previously, the data requirements for each of the methods compared in this study are as different as each of the methods themselves. This factor makes a comparison on the accuracy of each of the methods hard to do without resorting to mathematical models of biological systems. Table 3.3 provides an overview of the types of data required by each of the methods. With in silico biological and biochemical models one can perform the necessary simulations that would fulfill the requirements for each one of the methods being compared. I refer the reader to Sections 3.3.2 and 3.3.3 for a description on how each of these simulations was performed.

3.3.6 Method evaluation: measures of correctness

In order to assess how a method performed with respect to recovering the gene regulatory networks some measure of the methods accuracy must be employed. A confusion matrix Comparing reverse engineering methods 67

0.95

0.76 A

0.57

0.38

0.19

Sensitivity (x100%) 0.00 0.1 0.2 0.3 0.4 0.5 0.6 1 - Specificity (x100%)

0.50 B 0.32 C

0.45 0.24

0.40 0.16

0.35 0.08 Sensitivity (x100%) Sensitivity (x100%) 0.11 0.22 0.33 0.44 0.1 0.2 0.3 0.4 1 - Specificity (x100%) 1 - Specificity (x100%)

Figure 3.3: Receiver-operator characteristic (ROC) curve for the gene regulatory network. The best param- eter values are selected based on a high sensitivity and specificity that correspond to the largest area under the ROC curve. (A) Regulatory strengths analysis: threshold = 0.1; (B) Multiple regression: K = 3; (C) Partial correlations: p-value = 0.1. was used with this purpose. The confusion matrix determines how a method was confused in solving a given problem by determining the number of false positives and false negatives estimated. For this study, these measures are somewhat complex and require care in defining them. In a general definition, true positives are defined as elements that were uncovered as positive that are positive, true negatives as elements that were uncovered as negative that are negative, false positives as elements that were uncovered as positive that are negative and false negatives as elements that were uncovered as negative that are positive. In the context of a gene network, these definitions need to be re-stated, as three types of discoveries can be made: activations, inhibitions and no interactions. Therefore, for gene networks and in terms of gene-gene interactions, true positives are defined as activations and inhibitions identified that are correct, true negatives as the number of non-interactions cor- rectly identified, false positives as activations and inhibitions identified that are not present in the network and false negatives as the number of activations and inhibitions that were not identified. As in a general case, which I will not demonstrate here, these discoveries can be visualized with a confusion matrix (Table 3.4), which illustrates how many of the discov- eries predicted by any given reverse engineering approach fall under each of the classifying 68 Comparing reverse engineering methods categories described. Because a counting on the number of true and false discoveries is made, the confusion matrix allows for a quick and easy way to calculate measures of correctness of the reverse engineering methods analyzed. These include (i) accuracy, which is given by the percentage of true positives and true negatives identified in the total number of interactions discovered, (ii) the true positive rate, which gives the percentage of true positives discovered in the total number of positives uncovered and (iii) the false positive rate, which gives the percentage of false positives identified in the total number of positives discovered. Similar measures can be performed for the true negative and false negative rates.

3.4 Results

3.4.1 Gene network

The methods that are being compared have been proved to work well in the inference of gene regulatory networks. Each method is dependent on a single parameter which, based on the method’s assumptions, will define the interactions between the genes and, therefore, command the success or failure of the method. For the regulatory strengths method (RSA) this parameter is a threshold value that will define which interactions between variables should be considered zero (or non-existent). The underlying assumption is that regulatory strengths with a value below that threshold fall below the noise level and, therefore, can be considered zero (134). For the multiple regression (NIR) this parameter, K, represents the maximal number of incoming interactions (i.e., effects that other variables have on a particular variable) that are allowed for each variable in order to satisfy the assumption of sparseness of the gene network (137). Finally, in the case of the partial correlations (PC) approach, this parameter is the p-value for considering a correlation significant (147). These parameters are problem specific and, therefore, require a search for their best value for the problem at hand. For the case of the gene network that I used for the comparison of the methods, these parameters were set to 0.1 for the RSA method, 3 for the NIR method and 0.1 for the PC method. Table 3.3.4 shows the accuracy for some of the parameter values searched. The selection of the parameter values is made also by analyzing how the specificity and the sensitivity of each of the methods changes as the corresponding value is changed (or, similarly, one can plot the true positive rate against the false positive rate as the respective parameter value is changed). The resulting curve is termed the receiver-operator characteristic (ROC) curve – see Figure 3.3. In machine learning applications, the area under the ROC curve is usually selected as a summary statistic, where it indicates that, when a positive and negative examples are picked, the discrimination will give a higher score to the positive one than the negative one. The best parameters will be used throughout this study, not only for assessing how accurate Comparing reverse engineering methods 69

Table 3.6: Effect of noise in the performance of reverse engineering methods. Noise was added to the data by sampling a normal distribution with mean 0 and varying standard deviation. RSA: Regulatory strengths analysis; NIR: Multiple regression; PC: Partial correlations; NCD: Number of connections discovered; ACC: Accuracy; TPR: True positive rate; TNR: True negative rate. For parameter descriptions see text.

Method Noise (%) NCD ACC (x100%) TPR (x100%) TNR (x100%) RSA 0 36 0.8037 0.6111 0.9014 1 38 0.8019 0.6053 0.9118 5 80 0.4112 0.2821 0.8148 10 80 0.4000 0.2533 0.8333 20 88 0.3043 0.1772 0.7778 30 87 0.3304 0.1842 0.8571 40 73 0.3782 0.1538 0.7609 50 54 0.5167 0.1915 0.8030 NIR 0 30 0.7456 0.5000 0.8333 1 30 0.7456 0.5000 0.8333 5 30 0.7304 0.4667 0.8235 10 30 0.7328 0.4643 0.8372 20 30 0.7094 0.4138 0.8161 30 30 0.7179 0.4286 0.8276 40 30 0.7478 0.5000 0.8471 50 30 0.7034 0.3929 0.8182 PC 0 16 0.7063 0.2143 0.7818 1 6 0.7422 0.1667 0.7705 5 14 0.6929 0.1429 0.7611 10 12 0.6977 0.0000 0.7692 20 12 0.6977 0.0000 0.7692 30 12 0.7109 0.1000 0.7759 40 12 0.6899 0.0000 0.7607 50 16 0.6797 0.0714 0.7679 70 Comparing reverse engineering methods

0.71 A

0.61

0.50 (x100%) Accuracy 0.40

0.1 0.2 0.3 0.4 Noise (x100%)

0.74 B 0.73 C

0.73 0.72

0.72 0.70 (x100%) (x100%) Accuracy Accuracy 0.71 0.69

0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 Noise (x100%) Noise (x100%)

Figure 3.4: Accuracy of reverse engineering methods for gene regulatory network. The accuracies of the reverse engineering methodologies analyzed here are intimately dependent on the presence or absence of noise. Since the accuracy takes into account the number of true positives and true negatives that are discovered in the total number of interactions found, and since the number of true negatives is much larger than the number of true positives, the methods that identify the least number of connections will show an overall better accuracy, as can be seen by panel (C). (A) Regulatory strengths analysis: threshold = 0.1; (B) Multiple regression: K = 3; (C) Partial correlations: p-value = 0.1.

a method is but also for the study on the effect of noise on the performance of each method. It is assumed that a parameter that works well in a given network without the presence of noise would be the most suitable parameter to be used when noise is present.

Regulatory strengths

The method by de la Fuente et al.(134; 135) relies on small perturbations of the system’s variables around a reference steady state in order to reconstruct the underlying interaction network between these variables. As described in Section 3.3.2, in the case of gene networks this is achieved by changing the rates of synthesis of the different genes independently. The mathematical framework of metabolic control analysis that is the basis for the method implies that perturbations applied to a system should be small. de la Fuente et al.(134) use perturbations of the gene expression rate not greater than 10%. For this study, a decrease by 50% on the rate of synthesis of the gene products was performed for each of the genes, Comparing reverse engineering methods 71

after which the system is allowed to reach a new steady state. This perturbation, though much larger than that tried out by the authors, seems more realistic in experimental settings, where one can change the rates of expression of genes by the addition of RNAi’s (see 215, for a short description on the effects and uses of siRNA’s, a special case of RNAi’s). In the absence of noise, this method shows a relatively high accuracy (slightly over 80%) with a relatively small number of connections (about 4 per gene in this case). This implies that the connections that it finds are, in its majority, correctly identified as either being present or absent. Table 3.6 shows how the method behaves as noise is added to the data. As it can be seen, there is a sharp decrease in the accuracy of the method (Figure 3.4A), which is allied to the fact that a lot more spurious connections are found. This is also reflected on the decrease of true positives that are identified (Figure 3.5A), and a consequent increase in the number of false positives discovered. The drop in the rate of true negatives discovered is not as pronounced as in the case for the overall accuracy of the method (Figure 3.6A), but it still reflects the fact that a considerable number of interactions are being discovered (as can be seen in Table 3.6). The interaction matrices calculated for each of the different levels of noise (results not shown) seem to indicate that a new threshold must be determined for each one of these instances. That being the case it presents itself as a major drawback of the method, as one threshold that is determined from one experimental condition may prove itself invalid for similar conditions.

Multiple regression

The method to reverse engineer genetic regulatory networks from data based on multiple regression was proposed by Gardner et al.(137). This method works in a similar fashion to that of de la Fuente et al. presented above, as it takes in data that result from perturbations at the genetic level around a reference steady state. One of the method’s main assumptions is that a genetic regulatory network is essentially sparse (i.e., the number of interactions between genes is relatively low), as proposed by (208; 209). Therefore, an assumption on the maximal number of interactions that a gene is allowed to have with any other gene is introduced into the method. The optimal parameter value for this method is equal to 3 and is presented in Table 3.3.4. As is the previous case, this will be the parameter value that will be used to assess how the method copes with the presence of different levels of noise in the data. Since this parameter determines how many incoming edges a given gene has, the total number of interactions found will always be K × N where K is the parameter and N the number of variables in the system. Therefore, irrespective of the presence or absence of noise in the data, the number of connections that will be found, for this particular case, will always be 30. When no noise is present, the performance of this method is comparable to the RSA method, with an overall accuracy of 74.56%. As noise is added to the data, the accuracy of the method does not seem to fluctuate much (see Figure 3.4B). However, since the number of interactions 72 Comparing reverse engineering methods

0.55 A

0.45

0.35 (x100%) 0.25 True positives

0.1 0.2 0.3 0.4 Noise (x100%)

0.49 B 0.18 C

0.46 0.13

0.44 0.09 (x100%) (x100%) 0.41 0.04 True positives True positives

0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 Noise (x100%) Noise (x100%)

Figure 3.5: True positives rate for gene regulatory network and its robustness to noise. As with the accuracy (see Figure 3.4), the rate of true positives that are discovered by any of the reverse engineering methods depends on the presence of noise and the amount of noise added. Neither of the methods performs particularly well in identifying true positives when noise is present in the data. The case of the multiple regression algorithm is intriguing, since the number of true positives being correctly identified when higher amounts of noise are present increases, which is unexpected. The partial correlations method never performs any better than the other methods analyzed, with an average of true positives of 19.246%. (A) Regulatory strengths analysis: threshold = 0.1; (B) Multiple regression: K = 3; (C) Partial correlations: p-value= 0.1. Comparing reverse engineering methods 73

0.89 A

0.86

0.82 (x100%) 0.79 True negatives

0.1 0.2 0.3 0.4 Noise (x100%)

0.840 B 0.78 C

0.830 0.78

0.820 0.77 (x100%) (x100%) 0.810 0.77 True negatives True negatives

0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 Noise (x100%) Noise (x100%)

Figure 3.6: True negatives rate for gene regulatory network and its dependence on noise. As before (see Figures 3.4 and 3.5), the rate of true negatives discovered is also dependent on the amount of noise that may be present in the data. The method that seems most sensitive to noise when identifying true negatives is the RSA, with a considerable decrease from 97.24% with no noise to 15.75% at 20% of noise added. The other methods do not show such a decrease, which does not imply that their overall performance is better (as was seen previously and is presented on Table 3.6). (A) Regulatory strengths analysis: threshold = 0.1; (B) Multiple regression: K = 3; (C) Partial correlations: p-value = 0.1. 74 Comparing reverse engineering methods

0.60 A

0.46

0.31 (x100%) 0.17 True positives

0.1 0.2 0.3 0.4 Noise (x100%)

0.68 B 0.07

0.64 0.06

0.59 0.05 (x100%) (x100%) 0.55 0.04

True positives True positives C 0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4 Noise (x100%) Noise (x100%)

Figure 3.7: Rate of true positives for the synthetic biochemical network and its dependence on the noise level in the data. The accuracies of the reverse engineering methodologies analyzed here are intimately dependent on the presence or absence of noise. Since the accuracy takes into account the number of true positives and true negatives that are discovered in the total number of interactions found, and since the number of true negatives is much larger than the number of true positives, the methods that identify the least number of connections will show an overall better accuracy, as can be seen by panel (C). (A) Regulatory strengths analysis: threshold = 0.1; (B) Multiple regression: K = 2; (C) Partial correlations: p-value = 0.1. Comparing reverse engineering methods 75 is constant, it means that the number of true positives must oscillate accordingly, as can be seen from Table 3.6 and Figure 3.5B. This phenomenon is also true for the rate of true negatives (see Figure 3.6B) and the rates of false negatives and false positives (results not shown). These results seem to indicate that only for small amounts of noise (less than 5%) will the method be able to provide a reliable estimation of the regulatory network being inferred.

Partial correlations

The method of partial correlations (147) aims at discovering interactions between variables in a system by calculating the correlation between all of the variables and then removing correlations that may appear due to a common factor. The removal of edges (or interac- tions) is made based on statistical tests performed when the correlations are calculated by conditioning on a given variable. As a consequence, this method requires a large amount of data in order to be able to accurately estimate the correlations between the variables. For the study presented here, the data used was a set of different time series that correspond to 8 different perturbation experiments, as presented in Table 3.1. A total of 8 time points per experiment were collected, totaling 64 data points available to this algorithm. As can be seen from Figure 3.4C, the accuracy of the method with and without the presence of noise does not vary too much (between 67% and 75%). However, not only the number of interactions that are discovered is very low at all times (with a highest number of interactions found to be 16) but the percentage of true positives that is discovered is fairly low (between 0% and 22% of interactions discovered as positive that were correctly identified). On the other hand, it can be seen that the number of true negatives is relatively high throughout all the study (always above 75% of the connections uncovered as negative are correctly identified as such), but this factor relates closely to the number of total connections being discovered. Since the method is highly dependent on the number of data points available, it can be argued that the data provided to the algorithm was not sufficient to allow for a good estimation of the correlations between the variables. In the paper by de la Fuente et al.(147) the authors use a network of 5 genes from which they simulate 1000 steady states, which yields a very good inference of the network when the algorithm is applied. However, in the same paper, the authors show that in the presence of noise (50% noise level) the power of the algorithm, which represents the fraction of interactions that were correctly uncovered, for networks with a scale-free topology (the same topology as the one used in this Chapter) is between approximately 20% and 30% (depending on the threshold used).

3.4.2 Artificial biochemical network

It was shown before how each of these methodologies work in the inference of gene regulatory networks. In the gene regulatory network used previously all of the represented interactions 76 Comparing reverse engineering methods are actual interactions between the genes and no other are present. However, in the ap- plication of these methods to experimental data for the reverse engineering of biologically meaningful gene networks, the actual gene network is seldom known. One approach that can be undertaken to assess the actual gene network has been proposed (207), but this approach implies a full knowledge of the Jacobian matrix in order to be able to abstract to the gene space, a condition that is not easily met. The synthetic biochemical network presented here comprises all 3 levels of biological or- ganization: gene transcription (the gene space), protein translation (the protein space) and metabolite usage (the metabolite space). As with an experimental setting, it will be assumed that the Jacobian for this system is not known. Therefore, interactions in the gene space that are caused by entities in the protein and metabolite spaces cannot be accounted for. However, some gene-gene interactions are known (presented in the gene regulatory network – Figure 3.2B) and the methods will be judged in their ability to uncover these interac- tions alone (therefore, only the true positive rate will be compared across methods). The parameter values that need to be estimated for each of the methods were determined as before.

Regulatory strengths

The first step to reverse engineer the gene regulatory network for the synthetic biochemical system presented was to assess which would be the best threshold value to be used. This process was performed in a similar fashion to that described previously. The results showed that, for this case, when the threshold value was increased from 0.01 to 0.1 the percentage of true positives increased from 56.14% to 73.17%, while achieving a decrease in the false positive rate from 43.85% to 26.83%. Therefore, the threshold value was set to 0.1 for the remaining of the study on the synthetic biochemical network. The results obtained when applied to the data without noise were in accordance to the ones observed in the case for the gene regulatory network presented earlier, with a rate of true positives discovered of 73.17% (Figure 3.7A). With the addition of noise, the results are also similar to those obtained for the gene regulatory network, with a decay in the number of true positives identified from 73.17% to 3.64% at 50% noise (see Figure 3.7A). Using a published data set, de la Fuente et al.(134) were able to get interesting results for the gene network of the flower morphogenesis in Arabidopsis thaliana (216). Therefore, and as alluded to previously, the choice of this threshold parameter value dictates the efficiency of the method. It also seems relevant to mention that the amount of noise that is present in the data may influence the ability of the method to correctly infer networks. If the data does not contain much noise (less than 5% of error in the measurements) then this method performs relatively well and can be used for the inference of biological networks and act as a first pass in the process of building dynamical models of the biological system (see Chapter 6) Comparing reverse engineering methods 77

Multiple regression

Performing an extensive parameter search for the K parameter, as done previously for the gene regulatory network, we found that this parameter is optimal, for this network, when it is equal to 2. The true positive rate for this method, in the absence of noise, is comparable to the RSA method (72.50% versus 73.17%), as would be expected since the methods operate in similar fashion. It is when noise is added that the major differences show. In the presence of noise, as was the case for the gene regulatory network, this method does not change its rate of true positives considerably (from 72.50% at the noise-free level to 52.50% at the 50% level). As discussed before, since the number of incoming interactions per gene is forced to be the value of K, the number of true positives fluctuates in the same way the other measures of correctness do (results not shown). This indicates that there is an optimal threshold for the noise level over which the assignment of interactions between genes appears to be somewhat random.

Partial correlations

For the partial correlations method the data to be used was the time series of the response of the system to environmental perturbations, as shown in Table 3.2. For each perturbation 50 time points were collected, and the time series for different mutants were measured, in a total of 1221 time points. Each time point in the time courses is taken to be an independent measurement, even if some correlation may be present due to the dynamics of the system (consecutive time points are correlated). Even though the description of the method by de la Fuente et al.(147) is focused on simulation studies that represented a large population of individuals with a small difference among them (illustrated by the sampling of the degradation rates of the mRNAs with a normal distribution with a standard deviation of 1%), it is not easy to collect that type of data experimentally. Therefore, the data used in this study is more representative of a realistic experimental setup data collection. The p-value to be used was searched in the same way as in the gene regulatory network case and found to be 0.1. When applied to the aforementioned data set, the partial correlations method revealed a relatively low number of true positives (only 4.94% in the data with no noise). In comparison with the gene regulatory network, these results seems to indicate that (i) either not enough data points were provided for an accurate estimate of the correlations between the variables or (ii) the dynamical complexity of the system affects the method immensely. As noise is added there is a slight increase in the number of true positives uncovered (with a maximum of 7.95% at 5% noise), but overall the method’s performance is very weak (based solely on the true positive rate). 78 Comparing reverse engineering methods

3.5 Discussion

In this chapter I looked at how some of the methods available to reverse engineer biolog- ical networks compare, given the same network and providing data that would satisfy the method’s requirements. This was, by no means, a comprehensive comparison of all the avail- able reverse engineering methods available to infer genetic or biochemical networks. The premise for such comparison was set on a few requirements: (i) the method had to have been published, (ii) the implementation of the algorithm needed to be easily available either from the authors or through a website and (iii) the method would be tried as is, even if the author(s) had improved versions. The methods tested obeyed all of these requisites. Other work has been done though not included in this Chapter (217). Other methodologies are also available for solving the inverse problem in the context of systems biology, which have not been addressed here. Methods that rely on the identifica- tion of regulatory networks based on machine learning algorithms such as static (e.g. 218) or dynamic (219; 220) Bayesian networks, as well as methods that are based on mutual information (202; 221) have been shown to have some success. A recent publication (222) makes a comparison on some of these methods (see also 217). Other types of methods avail- able include those that have been developed under the context of physical for the reconstruction of reaction mechanisms (see 141; 223). The application of some of these methods (130; 131) to reverse engineer gene regulatory networks should be straightforward, even though the amounts of data constraints – both in terms of the amount and the type of data needed – are really high (Adam Arkin, personal communication). The results shown in this Chapter highlight what may be some of the pitfalls of each of the methods analyzed. It should be emphasized that the validity and applicability of the methods is not being questioned, but rather their relative performance was being assessed in conditions likely to be seen in concrete experimental settings. The results shown indicate that when the data provided has very low amounts of noise, RSA and NIR perform rather well, achieving high levels of accuracy and true positives discovered. The partial correlations method performed less well, even though it could be argued that the data provided was not suitable for the optimal performance of the method. Nevertheless, it should be mentioned that the amount of data provided for the partial correlations method far exceeds the amount that would be available in realistic experimental settings, where only a few data points can be collected (due to technical and financial constraints). The fact that the method underperformed in these conditions can hint at its difficult applicability to a single research project, though the fact that there are no constraints on data type may help the method if data from similar projects is collected and used in the analysis. The development of reverse engineering methodologies that allow the reconstruction of bi- ological networks from experimental data is a topic of intense research in systems biology. However, the application of such methods is dependent on their effectiveness in correctly identifying these networks accurately. Method development requires, therefore, a known in Comparing reverse engineering methods 79 silico model, whose interactions are known and that can be used to assess the method’s efficiency, accuracy and robustness. It becomes important to develop benchmarks which can provide a fair comparison of the emergent methodologies being developed. In this context, the networks I showed or similar could be constructed in order to provide a valid and realistic benchmark on which to try and assess the performance of new reverse engineering methods. These tests would allow researchers to focus on the problems that their method may have, thereby resulting in the proposition of solid methods that can reverse engineer biological networks. Chapter 4 Reverse engineering biological networks by least-squares fitting 82 Reverse engineering by least squares

“Discovery consists of seeing what everybody has seen and thinking what nobody has thought.” - A. Szent-Gy¨orgyi

4.1 Abstract

Can the dynamics of a biological system be recovered by a direct fit of a model to a dataset? The inference of a Jacobian matrix from a biological set is a complex and hard to solve problem. However, the estimation of the Jacobian elements by direct fitting of a first order Taylor approximation to the data can provide a means to infer, at first, a phenomenological representation of the biological system, one step closer to the full understanding, at a mech- anistic level, of the system at hand. In this chapter this method will be explored in detail, with careful reasoning about the pitfalls and the optimal results that can be obtained by the approach presented here.

4.2 Introduction

One of the goals of systems biology research is to be able to infer biological networks from functional genomics data, being touted as the holy grail in systems biology (67). For this purpose, several methodologies have been proposed (141; 142; 131; 224; 139; 138; 225; 226), and the efficiency of some of these methods in the inference of different kinds of networks was illustrated in Chapter 3. As was described in Section 1.5, the dynamics of a biological system can be described by the Jacobian matrix, which provides information on how each variable in the system is dependent on any other. Recalling Equation 1.3 we have

 ˙ ˙ ˙  ∂X1/∂X1 ∂X1/∂X2 ··· ∂X1/∂Xn ˙ ˙ ˙  ∂X2/∂X1 ∂X2/∂X2 ··· ∂X2/∂Xn  J =   (4.1)  . . .. .   . . . .  ˙ ˙ ˙ ∂Xn/∂X1 ∂Xn/∂X2 ··· ∂Xn/∂Xn

˙ where ∂Xi/∂Xj shows how variable Xi depends on variable Xj. In biological systems this matrix is expected to be highly sparse (52; 67; 139; 138), which means that the majority of entries of the matrix will be zero. This translates in that, on average, the number of interactions a certain variable has with all of the other variables is low. As an example, if we consider a metabolic network, the majority of metabolites has a relatively low number Reverse engineering by least squares 83

of reactions in which it participates, as substrate or product, and only a few metabolites have a larger ‘interaction’ network, such as ATP or NADH. The analysis of this matrix not only provides the interaction between the variables (in a phenomenological way) but also can provide information on the stability of the dynamical system under study (see, e.g., 128). In classical examples of system identification (see 129; 141; 223; 142), one can recover the entire Jacobian matrix by performing small independent perturbations in all of the variables of the system around a reference steady state and following the relaxation of the system to the steady state. However, it can be argued that such approach is limited when applied to biological systems, not only due to the large number of variables present in the system, but also in the potential inability to perform such experiments (as discussed in Chapter 3). Other methods that successfully reconstruct chemical reaction mechanisms (130; 131) require amounts of data that are impractical in most situations. One common kind of experimental design that is found in functional genomics projects deals with studies relating the time response of the biological system to some environmen- tal perturbation, known or expected to have some influence in the behavior of the system (works such as those found in 171; 46; 140; 3, constitute an example). With the same goal of reconstructing biological networks from these kind of data, a few methodologies exist (87; 220; 139; 144; 226). The work presented in this Chapter aims at suggesting a new methodology for the inference of gene regulatory networks, where an iterative approach will be used to improve the network prediction.

4.2.1 Reverse engineering gene networks by least squares fitting

Let’s start by assuming that the rate of expression of gene xi is given by:

p N ! X X x˙ i = γi (βikyk) + (αijxj) + Ci (4.2) k=1 j=1

Pp where k=1(βikyk) accounts for the effect of environmental perturbations on the system, with p being the number of perturbations applied and where βikyk represents the effect that PN a certain perturbation k has on the variable i. The term j=1(αijxj) represents how variable i is affected by all other variables in the system. Finally, Ci is regarded as a basal rate for the gene expression, being interpreted as the rate at which the gene expression would take place if gene i would have all of the elements of the interaction matrix and perturbation matrix equal to zero. The perturbations applied to the system are not restricted to small perturbations. As is the case in experimental setups, it is often not possible to perform perturbations that would keep the system in the linear domain around the reference steady state (which would justify the use of more classical system identification methods). Therefore, such assumption 84 Reverse engineering by least squares

P G5 P G5

G1 G1

G2 G2

G6 G7 G6 G7

G4 G3 G4 G3 A B

Figure 4.1: 7 Gene network to be used to demonstrate the network inference method. (A). Network generated according to Mendes et al.(206). (B). Data shows that genes G4 and G5 have nearly constant (G4) or constant (G5) dynamics, and therefore cannot be used in the inference method. This implies that indirect interactions between genes G6 and G7 with G1 become true interactions and are seen as such in the analysis of the method’s performance.

is not taken into account in this approach, where large perturbations, that cause large deviations from the reference steady state, are going to be considered. The estimation of the parameters in Equation 4.2 is made by least squares fitting of the model to experimental data by use of global search algorithms, such as genetic algorithms (see 227, for a review), evolutionary strategies (194) or the particle swarm algorithm (228; 229). The estimation is made by minimizing the square of the difference between the model predictions and the data, according to:

N X ∗ 2 Y = (xi − xi) (4.3) i=1

∗ where Y is the sum of squares value, xi is value of the variable predicted by the model and xi is the experimentally measured value. The complexity of parameter estimation problems similar to this one has been addressed somewhere else (230; 231; 232). Since no assumption is made about the number of connections that each variable has, all of the elements of the interaction matrix, together with all of the elements of the perturbation matrix and the basal rates, need to be estimated. Therefore, the number of parameters that need to be estimated is, according to equation 4.2, N 2 + N × p + N. A typical functional Reverse engineering by least squares 85 genomics experiment will be limited in the number of data points that are collected which will be, undoubtedly, less than the number of parameters that need to be estimated. The problem is, therefore, said to be underdetermined.

Confidence network

In a typical functional genomics experiment there are limitations on the number of data points that can be collected. Therefore, the application of reverse engineering problems to such data needs to take this factor into account. As mentioned previously, one of the approaches that can be used is to make the assumption that each variable (i.e., each gene) has a low and predefined number of interactions with any other variable (e.g. 139; 138; 137), however that assumption is not made here. Therefore, the number of parameters that need to be estimated far exceeds the number of data points that can be provided by experiment. The problem is then said to be unidentifiable (see 233) . This results in an infinite number of solutions of the parameter estimation problem (therefore, an infinite number of different parameter values) giving equally good fits. In mathematical analysis one would attempt to derive the complete space of solutions to the problem. Instead, our numerical approach is to sample that space of solutions by using stochastic methods that each time arrive at a different one. The set of solutions is then analyzed to uncover properties that seem to be invariant. If some parameter values always have the same sign in all of the solutions obtained, then we consider those interactions to be truly of that sign (i.e., activation or inhibition). If one considers these interactions alone, then one can build a confidence network that represents the connections that, given a particular data set, appear on all of the solutions searched. Since not enough data points are available for an accurate estimation of the parameters in Equation 4.2, this confidence network will not have the best overall performance. However, by careful selection of follow-up experiments, the overall performance of the network prediction can be improved considerably.

Experiment selection

The confidence network obtained as described above can be used as a first pass estimation of the true network, and is then used to provide an indication as to what experiment could be devised in order to obtain the most information about the network structure. Despite biological networks being known to be robust to perturbations (55; 54), much information about its interaction network can be gained by performing extreme perturbations on the system: namely by knocking out genes and applying the same perturbation experiments to the mutant. Independent of the discussion on network topology – see Section 1.3 – it is common knowledge that a few variables in the biological system (genes or proteins) appear to be highly connected while others seem to have very few connections, therefore hinting at the sparseness of the network’s interaction matrix used in other methods. 86 Reverse engineering by least squares

It is based on these assumptions that a knockout experiment is selected in order to improve the prediction of the biological network. The gene to be knocked out should be the one that would maximize the effect that the perturbation would have on the network. The gene chosen is, therefore, the one that has the least number of incoming edges and the largest number of outgoing edges. That would be the gene that, in that stage of our knowledge, is potentially able to affect the largest number of other genes.

Inference rules

Knockout experiments also have important information content as to the nature of the interactions between the variables. If the initial starting conditions before the application of the perturbation – most likely a steady state – are maintained for the knockout mutant, then one can deduce a number of facts about the network from the comparison of the two initial states (wild type and mutant) and, therefore, how some genes are related in the network. A set of inference rules can then be devised as follows.

1. Let Gi be the knocked-out gene. If all Gk, with k = 1, ..., i − 1 are not constant but Gl, with l = i + 1, ..., n, are constant then Gk do not affect Gl.

2. If all genes are constant as a result of the experiment, then the perturbation affects only the knocked out gene.

3. Let R be the ratio between wt and ko experiments for a given gene at the initial time step:

wt Gi R = ko (4.4) Gi

Also, let Gk be the knocked out gene, where k 6= i. If:

(a) R = 1 then Gk does not affect Gi.

(b) R > 1.5 (or some other threshold) then Gk inhibits Gi.

(c) R < 0.5 then Gk activates Gi.

This set of rules help constrain the search space of the fitting algorithms in the subsequent runs, when more data is gathered. This is done by setting some of the interaction terms to zero, and others to either only positive or negative values, according to the results of applying the rules. Reverse engineering by least squares 87

4.3 Methods

4.3.1 Computational

The implementation of the method was made in Copasi (213), ran under Mac OS X (Apple Co., Cupertino, CA), Linux (slackware distribution) and Windows 2000 (Microsoft Co., Redmond, WA).

4.3.2 Model

The model used to illustrate the method is that of a gene regulatory network with 7 genes (see Figure 4.1) and was generated as proposed by Mendes et al.(206). The network to possess a scale-free topology, even though such topology distinctions are hard to realize with such small number of genes. In order to obtain data in a realistic manner, the system was simulated as if being perturbed only from the outside (as in the case of adding a chemical to the reaction or growth media). At the steady state condition, which is obtained without the presence of any external chemical (Pi,ext = 0), the system is perturbed and the response of the system to the perturbation is followed and recorded (tf = 5 and N = 25, where N is the number of time points collected). Different ‘strains’ of the biological system were simulated by making knock-out mutants of some genes (by setting to zero the initial gene concentration, [Gi]0, as well as the transcription rate, ki,tr) after which the system is allowed to reach a steady state. The perturbations were performed as above. The rate of gene expression for each gene is described by Mendes et al.(206) and is given by:

" N  n#" M  n# Y Ak Y Ki,l x˙ i = Vi 1 + − kdeg,ixi (4.5) Ka,k + Ak Ki,l + Il k=1 l=1

where xi is a given gene, Vi its rate of transcription, kdeg,i the rate of mRNA degradation and n is the Hill coefficient. The rate of transcription is affected by activators, Ak, and inhibitors, Il, which can increase or decrease the rate at which the gene is transcribed, respectively. Note that the dynamics of this model are most categorically not like those of the equations that we will later try to fit (Equation 4.2) and the challenge here is to demonstrate that despite this, fitting Equation 4.2 to the data produced by Equation 4.5 still uncovers important aspects of the network. 88 Reverse engineering by least squares

4.3.3 Fitting the model to the data

The estimation of the parameters in Equation 4.2 is made by performing a least-squares fit of the model to the data. The fit is performed with a global optimization algorithm. In this case I have used the particle swarm algorithm (228; 229), over 10,000 iterations with a swarm size, S, of S = 1.5 × Π, where Π is the total number of parameters, rounded up to the nearest integer.

4.4 Results and Discussion

The method described here attempts to reconstruct a regulatory gene network (a represen- tation of the regulatory interactions between genes in a data set) in an iterative process by use of time series data, by which one improves the inference by adding data and constraints at each iteration step. To help illustrate the method, I will use the 7 genes network shown before (see Figure 4.1). As described, in a ‘wild-type’ condition the system is allowed to reach a steady state in the absence of the perturbation, after which an environmental change is caused in the system by the addition of an effector of one of the genes. In the network shown, P, an activator of G1, is 0 in the reference steady state and is set to 10 to effect a perturbation. The degradation or consumption of this ‘chemical’ is not considered in the method (there is no differential equation for P) and so it is always present in the system. Translating into experimental settings, this could be achieved by running an experiment in a chemostat, which allows for an accurate control of the experimental conditions throughout the entire experiment. After the perturbation is applied, the response of the system to the perturbation is followed and a time series for such response is collected. To mimic experimental conditions, a limitation on the number of data points collected was imposed. For each time series, 25 time points are obtained, spaced evenly. This is even a greater number of time points than are usually available from a single functional genomics experimental setting, but achievable if needed and nevertheless still well below the number of parameters to be estimated. As demonstrated previously, the number of parameters that need to be estimated, in the first step, based on Equation 4.2, is N 2 + N × p + N. For the 7 genes network, on which 1 perturbation is applied, it can easily be seen that the number of parameters to be estimated is 63. Therefore, with 25 time points, the system is still underdetermined and it can be used to illustrate how the method works in such conditions. After collecting a time series for the system in the wild-type condition the model is fit to these data. The resulting network is shown in Figure 4.2A. As it can be seen, this network has a large number of mistakes, namely a high level of false positives (73.91%). However, the method’s premise is that the overall inference performance will increase as more data is collected in follow-up experiments that are chosen based on results of the previous fitted model (confidence network). Taking into account the inference rules presented previously Reverse engineering by least squares 89

P P P

G1 G1 G1

G7 G2

G2

G6 G7

G3

G3 G6

G6 G7 G2 G3 A B C

Figure 4.2: Network inference for the 7 gene network. The network inferred with only the wild type data (A) contains a number of errors and the overall performance of the inference method is not good. However, as knockout data is added (knockout G2 – B) the performance improves. The iterative addition of knockouts (knockouts G2, G3 and G6 – C) brings the network inference ever closer to the ideal solution.

(see Section 4.2.1), the gene to be knocked out is gene G2, as it has the highest number of outgoing edges (4) and the lowest number of incoming edges (3). Self loops (either self activation or self inhibition) are not considered as outgoing or incoming edges. After selecting the follow-up mutant, the perturbation experiment is performed on this ‘strain’ and the time series collected as before. Again, the inference rules need to be applied in order to assess, from the data, which interactions can be considered activations, inhibi- tions or non-existent. The data for gene G2, when compared with the data for the wild type condition, shows that G2 has no effect on G1, inhibits both G3 and G6, and activates G7. All this information can be added to the model before the next fitting iteration. The consequence of adding information in such a way is that the parameter space in which to search for possible solutions for the problem becomes constrained and reduced: constrained because it can be forced to search for a parameter in only a sub-region of the whole space (in case of activations it only searches in the positive side, and in inhibitions it only searches in the negative side), and reduced because non-existant interactions imply that the αij from Equation 4.2 become 0 and can be removed from the fit, reducing the number of parameters that need to be estimated. Figure 4.2B shows the resulting confidence network for the wild- type and ∆G2 data. The simple addition of these data, together with the inference rules, caused a decrease in the number of false positives discovered (57.14%) with an increase in the overall accuracy (from 32% to 56%). Using this iterative approach, more knockout experiments were performed. Figure 4.2C 90 Reverse engineering by least squares

1.00 0.80 Accuracy 0.60 TP TN e (x100%) 0.40 FP FN 0.20 Measur 0.00 A B C D

Figure 4.3: Improvement of reverse engineering method by addition of knockout data. The selective addition of knockout data, suggested from the confidence networks, leads to an improvement of the measures of correctness of the method. A. Wild-type data. B. Wild-type + ∆G2 data. C. Wild-type + ∆G2 + ∆G3 data. D. Wild-type + ∆G2 + ∆G3 + ∆G6 data. See Table 4.1 for a detailed description of the rates of true positives, true negatives, false positives and false negatives, together with accuracy. shows the resulting confidence network when ∆G2, ∆G3 and ∆G6 mutant data were added to the wild type data. The resulting network has an overall accuracy of 76%, with 100% of true negatives identified and 40% of false positives. Figure 4.3 shows how the rates of true positives, true negatives, false positives, false negatives and accuracy improve as more knockout data is added. These results seem to indicate that the methodology presented here is useful in identifying correctly the connections that should not be present in the network (illustrated in the true negative rate measure). It should be noted that the measures of correctness of the network were made by direct comparison with the original network and not with the reference steady state Jacobian matrix. Over the dynamics of the response of the system to the perturbation the Jacobian changes across the time series, since it represents a linearization of the system at a specific point. Therefore, the estimation of the parameters of Equation 4.2, because it is made across the entire time series, reflects an average Jacobian of the dynamical system. This Jacobian will not be the same as the Jacobian at the reference steady state and, therefore, it is not useful to compare them.

4.5 Conclusions

I showed in this Chapter a methodology that is aimed at inferring gene regulatory networks from time series data originated from functional genomics data. This methodology differs Reverse engineering by least squares 91

Table 4.1: Improvement in the measures of correctness as knockout data is added. As shown in Figure 4.3, the overall performance of the method improves as knockout data and data-based constraints are added to the model. Accuracy TPR TNR FPR FNR Data (x100%) (x100%) (x100%) (x100%) (x100%) wt 0.3200 0.3044 0.5000 0.6956 0.5000 wt + ∆G2 0.6800 0.5000 0.9091 0.5000 0.0909 wt + ∆G2 + ∆G3 0.7600 0.6667 0.8125 0.3333 0.1875 wt + ∆G2 + ∆G3 + ∆G6 0.7600 0.6000 1.000 0.400 0.000

from other available methods as it does not make any assumptions with respect to the interactions between the variables (i.e., the genes). Also, no pre-requisites on the number of data points are made. It was shown how the method works when the number of data points is smaller than the number of parameters to be estimated, therefore keeping the problem underdetermined. By careful selection of follow-up experiments, together with inference rules that reduce the dimensionality of the search space, one can gather more data and, therefore, improve the inference of the network. In the estimation of the regulatory interaction network parameters, global optimization algo- rithms were used, namely the particle swarm algorithm. This class of algorithms is designed to find the best global minimum in a parameter space, thereby allowing for the discovery of the best least squares solution by avoiding global minima. These algorithms are stochastic in nature and, therefore, several iterations of one of these algorithms should be run in order to confidently make assumptions on the value of the parameters estimated. Even when the problem at hand is fully determined or over determined (i.e., there are as many or more data points than number of parameters to estimate), global optimizers need to be run sev- eral times in order to get a good (statistical) assessment of the parameter values determined. Results not shown illustrate that, for this particular method with 3 times more data points than parameters to estimate, the global optimization algorithms recover the majority of the interactions correctly, but a few connections are wrongly identified. Nevertheless, it should be kept in mind that the inference method is attempting to recover the regulatory interaction network of a non-linear dynamical system (Equation 4.5) by inte- grating a linear differential equation (Equation 4.2). Therefore, the network that is inferred should only be viewed as a first approximation to the actual biological network, to which non- linear methods for reaction parameters estimation, such as presented by Moles et al.(194), can then be applied and improve the knowledge of the biological system. Chapter 5 Beyond reverse engineering: applications to experimental data 94 Applications to data

“I hope we’ll be able to solve these problems before we leave.” - Paul Erd¨os

5.1 Abstract

In the previous Chapter I showed a new iterative approach to infer gene regulatory networks from time series data. In the scope of systems biology, the cooperation between experimental and theoretical approaches will be fundamental to the advancement of knowledge in the life sciences (as was seen in Section 1.3). The methodology introduced tries to put the concepts of systems biology to practice, resulting in a method that will gain from such interplay. In this Chapter I will show how the method works with experimental data to generate a gene regulatory network for a set of 13 genes from Saccharomyces cerevisiae that are known to be involved in the response to oxidative stress. I will also demonstrate how the method can be used in cases where the aforementioned interplay may not be possible (for a wide range of reasons) and try to build on the results to illustrate the method’s utility.

5.2 Introduction

The reverse engineering of biological networks is a complex problem and its application to experiments coming out of functional genomics projects has been limited. Due to constraints related to the kind of data that is needed for a particular method or to difficulties in applica- tion of the method by experimentalists, reverse engineering relies heavily on the development of mathematical models in order to show the potential of each methodology being proposed (see Chapter 3 for a discussion on the data requirements for different reverse engineering methods and a comparison on their performance using in silico networks). In the age of systems biology, where interdisciplinary teams come together to solve complex biology prob- lems (see Section 1.3), the development of mathematical frameworks and conceptual bases to solve such problems is highly dependent on the ability to generate and obtain data by the experimental facilities. It is this interdependence that is often overlooked in the development of reverse engineering methods, even though some research is indeed moving in that direction (see 202; 234, as examples). A focus of functional genomics and systems biology projects has been the study of the time response of a system to environmental perturbations (140; 46; 171). The data collected consists of a number of time series recordings, with points sampled according to thoughtful experimental design in order to capture much of the system’s response. Nevertheless, the number of data (time) points collected is, more often than not, driven by the availabilty of funds for the project and, therefore, compromises between the number of time points that will be collected and the number of experiments that can be performed must be made. It Applications to data 95

is this data constraint that some of the available methods that can handle time series data have a difficulty coping with (see 147; 220), as the requirements for such methods in the amounts of data are usually much higher (see Chapter 3). I presented previously (see Chapter 4) a method to reverse engineer regulatory gene net- works from time series data in which the number of data points available is smaller than the number of model parameters to be estimated. Through an iterative process and with the use of carefully designed inference rules, the performance of such method has been shown to improve, allowing the resulting network to be used as a valid starting point for subsequent modeling approaches and experimental design in order to accurately reconstruct the regu- latory network. The lack of constraints on the method, specifically in the amount of data required to perform the analysis, makes this method suitable for application to times series data from functional genomics experiments. I will explore this point in this Chapter.

5.2.1 The yeast response to oxidative stress

Reactive oxygen species (ROS) are responsible for several nefarious effects on biological systems, including genetic degeneration and apopotosis (235; 236). The action of antioxidant enzymes, such as catalase, and free radical scavengers, such as glutathione and ascorbate, are among the innate responses that cells possess to fight the effects of stress induced by ROS (237). The yeast Saccharomyces cerevisiae, the model organism for eukaryotes, has been on the forefront of research in response mechanisms to oxidative stress (see review in 236). The availability of deletion mutant libraries for Saccharomyces cerevisiae allows for a detailed study, at the genetic level, of the effect that oxidative stress may have on the organism. In a systems biology framework, experiments can be planned and carried out in such a way that the data collected can be used to predict and propose gene regulatory networks associated with the response to the stress. Such networks could then be taken as a basis to further experimental designs to explore and increase the understanding and knowledge of how does an organism react to oxidative stress.

5.3 Methods

5.3.1 Experimental setup

The experiments were carried out as described by Martins et al.(238). The Saccharomyces cerevisiae strain BY4743 (MAT a/MAT α his3∆1/his3∆1 leu2∆0/leu2∆0 lys2∆0/+ met15∆0/+ ura3∆0/ura3∆0) and knockout mutants for the Yap1 transcription factor, glutathione per- oxidase (1 and 2), glutaredoxin (1 and 2) and glutathione oxido-reductase proteins (yap1 ∆, 96 Applications to data gpx1 ∆, gpx2 ∆, grx1 ∆, grx2 ∆, glr1 ∆, respectively), were obtained from the American Type Culture Collection. The strains were constructed by the yeast deletion consortium (239; 240) and are derived from the S288C strain. Six replicate cultures were grown in batch-controlled conditions, in New Brunswick BioFlo fermentors, in minimal medium with 4% (w/v) sucrose, supplemented with uracil 40mg/L, L-leucine 120 mg/L and L-histidine 40 mg/L. Cultures were grown at 30◦C, 500 rpm, pH = 6.0 and dO2 was kept higher than 80%. In mid- exponential phase (OD600 1.5), a solution of cumene hydroperoxyde (CHP) was added to 3 of the cultures in order to obtain a final concentration of 190 µM. Three untreated cultures were used as controls, where CHP was replaced by the addition of ethanol. Samples were collected from the fermentors (both controls and CHP-treated) at defined time points (right before the addition of CHP (t = 0) and at 3, 6, 12, 20, 40, 70 and 120 min). All collected samples were processed for RNA extraction and profiled using Affymetrix Yeast Genome S98 arrays (Affymetrix, Inc., Santa Clara, CA).

5.3.2 Data preparation

The wt and mutant strains were analyzed together in order to compare the stress response in both strains. Robust Multichip Average (RMA) (241; 242) was used for microarray data summarization and normalization. To assess the significance of differences between transcripts from the same genotype across two different experimental conditions or between transcripts under the same experimental condition across two different genotypes, a 3-way ANOVA gene by gene model was built and ran in SAS (SAS Institute Inc., Cary, NC):

yijkl = µ + Ti + Vj + Mk + (TV )ij + (TM)ik + (VM)jk + (TVM)ijk + ijkl (5.1) where yijkl is the intensity measured on the array for time i, treatment j (control or CHP conditions), genotype k (wt or mutant) and replicate l; µ is the overall mean intensity of the gene; Ti is the effect of the ith time point; Vj is the effect of the jth treatment; Mk is the effect of the kth genotype; (TV )ij is the interaction effect between time point i and treatment j;(TM)ik is the interaction effect between time pont i and genotype k;(VM)jk is the interaction effect between treatment j and genotype k;(TVM)ijk is the interaction effect between time point i, treatment j and genotype k; ijkl is the residual for time point i, treatment j, genotype k and replicate l. The positive False Discovery Rate (FDR, cut-off 0.05) multiple-testing adjustment (243) was applied to correct q values (qFDR). The data used was not logged or transformed in any other way.

5.3.3 Computational approach

The method was applied as presented in Chapter 4. All computations were carried out in Copasi (213) under Linux (slackware distribution), Windows 2000 (Microsoft Co., Redmond, Applications to data 97

WA) and OS X (Apple Co., Cupertino, CA).

5.3.4 Yeast regulatory network

A collection of 13 genes known to be related to oxidative stress in Saccharomyces cere- visiae were selected to infer a gene regulatory network of the response of the organism to CHP. These genes include the 6 available mutants plus γ-glutamylcysteine synthetase and glutathione synthetase (GSH1 and GSH2 genes), thioredoxin-related genes (namely, TSA2 and TRX2 ), transcription factor associated proteins (SKN7 and YBP1 ) and an additional glutathione peroxidases (GPX3 ). Note that in this example the application of the method was done after all of the experiments had been carried out. Since not all mutants of genes selected for analysis were used, the inference rules needed to be adjusted as to the choosing of the next experiment to be used. Therefore, the selection of the next experiment is only among the available knockout experiments; if some other would fit the requirements better than any of the existing ones it still cannot be used (since no new experiments were possible at this stage).

5.4 Results and Discussion

5.4.1 Revamping the inference rules

Recalling Section 4.2.1, the inference method proposed relies on inference rules to make prediction about the nature and sign of interactions between the variables (i.e., the genes) in the system. Also, these inference rules allow for the careful selection of genes to be knocked out in order to obtain the maximal information possible concerning false positives and/or false negatives in the confidence network generated from wild type data. Ideally, this approach should be used iteratively with the experimental approach, in order to, possibly, reduce the number of experiments and costs to be performed. However, the experimental design is usually thought off prior to the application of theoretical approaches to the data collected. Therefore, the choice of potential knockouts is made based on prior information, gathered from the literature. This implies that, through the application of the inference rules, the gene that should be knocked out to yield the most information may not be one of the available, previously selected, knockouts. Therefore, the inference rules need to be adjusted to cope with the availability of knockouts. Figure 5.1 illustrates this problem with an example. Let’s assume that this toy confidence network reflects the network obtained by fitting the wild-type data for a system with 9 genes, labeled G1 through G9. Let’s also assume that only 3 knockout mutants are available, illustrated by the grey color (G2, G4 and G6). If one takes the inference rules as presented earlier, the best choice for a knockout to be performed 98 Applications to data

P G5

G1

G3

G2

G4

G8 G9 G7 G6

Figure 5.1: Revamping the inference rules.This toy confidence network is assumed to be obtained from the wild-type for a mock 9 gene network. However, only a few of the genes in this network have available knockouts, which implies that the selection of which gene to be knocked out needs to be made amongst these genes, even if some of the other genes that do not have available knockouts would be better suitors for such experiment, according to the inference rules presented earlier (see Section 4.2.1) would be G1, as it has only one input (from P) and has 4 outgoing edges (to G2, G3, G4 and G5). However, this knockout would not be available. The second best choice would be G5, with 2 incoming edges (from G1 and G2) and 4 outgoing edges (to G2, G3, G4 and G6), but this knockout is also not available. Therefore, one must resort to applying the inference rules to the genes that have available knockouts. This leads to the selection, in this toy example, of G2 as the best knockout to be performed, with 3 incoming edges (from G1, G3 and G5) and 4 outgoing edges (to G4, G5, G6 and G7). This approach will prove itself important in the application of the method to the yeast regulatory network for the response to oxidative stress. As the experiments were performed prior to the application of the reverse engineering methods, this led to a selection of knockouts based on literature information. Therefore, since a number of genes selected to infer a stress response-related regulatory network will not have available knockouts, the procedure highlighted above will prove its worth.

5.4.2 Yeast response to oxidative stress

The application of reverse engineering methods to experimental data is an important step in method development and validation, where its applicability can be assessed under non- ideal conditions. In the remainder of this Chapter I will demonstrate an application of the method proposed in Chapter 4 to reverse engineer a gene regulatory network for the response Applications to data 99

CHP

GPX3

YAP1

GLR1

SKN7

GPX2

GRX2 GSH1

GSH2

TSA2

CHP TSA2 GPX3 GRX1 YAP1 GSH2 GLR1 YBP1

GRX2 TRX2 YBP1 GRX1

GPX2 GSH1 TRX2

SKN7 GPX1 A B

Figure 5.2: Application of the reverse engineering method to the yeast response to CHP data. A: Application of the reverse engineering method presented in Chapter 4 to the wild-type data of Saccharomyces cerevisiae response to stress induced by CHP. The confidence network obtained indicates that the GRX1 knockout would have to provide the largest amount of information. B: The addition of the data from the GRX1 knockout mutant brought about a large increase in the number of interactions discovered, not necessarily true positives. This result may indicate that the chosen knockout was probably not informative as to the effects its deletion would have in the network. 100 Applications to data

of Saccharomyces cerevisiae to oxidative stress. It should be emphasized that this regulatory network is not the sole regulatory network that reflects the response of the organism to the perturbation (the addition of CHP) since only a small subset of the approximately 6,000 yeast genes are considered. The genes selected have been linked with response to oxidative stress conditions. A study conducted by Thorpe et al.(244) highlights the effects of deleting some genes from the yeast genome when the organism is insulted with different sources of ROS, namely hydrogen peroxide, cumene hydroperoxide and linoleic acid 13-hydroperoxide. These oxidants were chosen based on the different kinds of ROS that are generated. The authors find that the deletion of genes such as GLR1, GPX3, TRX2, YAP1 and SKN7, among others, produce mutants that are sensitive to at least one of the oxidants. These genes are included in the data set that I will analyze and for which I will be generating a response network. A very interesting effect that has been observed by Thorpe et al.is that YAP1 seems to play an important role in the overall response of Saccharomyces cerevisiae to cumene hydroperoxide, the chemical that was used in this study. Of relevance, it seems that not only the deletion of YAP1 yields a mutant sensitive to the chemical but also the defective regulation of this gene by other genes (in particular, TRX2, GPX3 and YBP1 ) leads to very sensitive mutants. These genes are responsible for the repression (TRX2 ) or activation (GPX3 and YBP1 ) of YAP1 in the presence of an oxidant. A study performed by Gasch et al.(245) also points out that the response of Saccharomyces cerevisiae to an environmental stress caused by an oxidant (namely, hydrogen peroxide) is mainly regulated by Yap1p, which hints at the YAP1 gene being an important node in the regulatory network for the response to oxidative stress. Figure 5.2A shows the confidence network obtained by fitting Equation 4.2 to the wild-type data for the response of Saccharomyces cerevisiae to the addition of CHP to the media. From the confidence network it can be perceived, based on the inference rules for the selection of knockout experiments, that the GRX1, TSA2 and GSH2 genes have the same number of incoming (0) and outgoing (2) connections. However, and as mentioned above, only one of these has an available knockout (GRX1 ). Therefore, the knockout experiment to be added will be this one. Figure 5.2B shows the resulting confidence network. The number of connections that are discovered in this case increases significantly (from 18 in the wild-type case to 60 with the two data sets), without the certainty that the increase in interactions reflects an increase in the true positive rate. These results may indicate that (i) either the parameter space was not sampled enough and/or (ii) the knockout chosen was not the most informative in terms of information gathered. The first possibility (lack of parameter space coverage) can be solved by running several more simulations, thereby increasing the number of solutions that are obtained from different initial values of the parameters, which may represent a greater coverage of the parameter space. However, the increase in coverage of the parameter space comes at the time cost of running the simulations. The second possibility (the wrong choice of the knockout mutant) is imposed by the data available. As was discussed, the method relies on the ability to pinpoint which could be the most informative mutants to be performed. This is done in an iterative manner, with Applications to data 101

Table 5.1: Data driven inferences on regulation of gene expression. Based on the inference rules presented before (see Section 4.2.1), a set on interactions can be inferred for the gene regulatory network by analysis of the changes in the basal level of expression of each of the genes in the network in the wild-type and in the mutant being analyzed.

Gene Activation Inhibition No effect YAP1 GPX1 GRX1 GPX2 TSA2 TRX2 GPX1 YAP1 TSA2 GPX2 GRX2 YBP1 GSH1 GSH2 GPX2 YBP1 GRX1 YAP1 SKN7 YBP1 GRX2 YAP1 TSA2 GPX2 YBP1 TRX2 GSH2 GLR1 TSA2

each iteration of the method suggesting a knockout experiment. In the data that is being analyzed here, the knockouts were chosen beforehand, based on information found in the literature as to which genes seem to be more closely related to the response of the organism to oxidative stress. From the confidence network from the wild-type data, as shown, 3 genes have the same conditions as to which one should be knocked out. Further analysis of the data (not shown) indicate that CHP may have an inhibitory effect on TSA2 (this effect is seen both in the wild-type as well as in the case where the GRX1 mutant is added to the data set). Therefore, the effect of knocking out this gene could provide more information in the discovery of regulatory interactions between the genes in this response network. In fact, it has been shown (246) that TSA2, which encodes for a cytosolic form of thioredoxin peroxidase II in Saccharomyces cerevisiae, seems to be an important component of the organism’s defense against organic peroxides, the family of compounds to which CHP belongs to. 102 Applications to data

5.4.3 Coping with reality

As I discussed previously, the reverse engineering approach presented in this work relies on the possibility to perform knockout experiments in order to increase our knowledge of the inferred network in an iterative process. However, it is not always the case that an iterative approach can be undertaken. Therefore, one must be able to work around the data that is presented. For the application of the method to experimental data I chose a data set that corresponds to a time response of Saccharomyces cerevisiae to cumene hydroperoxide, where a total of 6 knockout mutants are available (see Section 5.3). Therefore, and as was shown above, the iterative approach is hindered by the fact that the knockouts chosen may not be the ones that were done. Instead of the iterative approach, I will show how the method works when a pre-selection of the knockout mutants is made, as is the case presented. Table 5.1 summarizes the effects that genes have on each other, based on the data obtained from the microarray experiments. It has been described in the literature (247) that Yap1p acts as the major regulator of oxidative stress in yeast and this can be seen in the data used: YAP1 appears to be responsible for the activation of 4 other genes (GPX1, GPX2, TSA2 and TRX2 ). The response to oxidative stress by Yap1p appears to be very general, with the transcription factor being reported to a wide range of oxidants (247; 245). Therefore, the response to other types of oxidants, namely to organic peroxides, may yield differences in genome regulation. As was discussed before, TSA2 has been linked to the response to such class of oxidants (246), but the data used in this work seems to indicate that GPX1 may also play a critical role in the response to CHP, since its deletion affect a large number of genes. The fact that this gene does not affect TSA2 could hint to a possible regulatory link between TSA2 and GPX1, but that can only be confirmed by further data. Since the data that is obtained from experimental settings is, by default, noisy, the definition of what is a non-interaction (“No effect” on Table 5.1) can be optimized to account for the presence of noise in the data. Previous results from the same group from which these data was obtained show that the largest coefficient of variation present in microarrays was between 10% and 20% (46). Therefore, the non-interactions, whose definition is found in Section 4.2.1, are considered if the ratio between the knockout and wild-type levels of expression, for a given gene, is 1 ± 0.05, where 0.05 would correspond to a noise level of 5%, below that of the coefficient of variation. This threshold is very conservative: using higher threshold values for noise levels, closer to the coefficient of variation in the microarray experiments, one could potentially eliminate a larger number of spurious interactions (at the risk of eliminating true interactions as well). Using this information, as well as the entire data set (wild-type and all mutants), one can infer the best possible gene regulatory network for the 13 genes selected in this study, that reflects the response of these genes to the perturbation of the system with cumene hydroperoxide. The implementation of the method was made as before (see Section 4.3). Figure 5.3 reflects the application of the method to the entire data set. Note that this is a final network that can be obtained, since no more knockouts are available (i.e., all of the available data was used). Follow up experiments could be performed in order to improve Applications to data 103

the inference of the network. Analyzing the confidence network obtained one can perceive that the knockouts that are available are not sufficient to obtain a more comprehensive understanding of the interactions between the genes selected. Also, this confidence network hints at possible new experiments that may yet yield information that can help in the increase of understanding of the aforementioned interactions. In particular, it seems that a knockout of the YBP1 gene could hold interesting information that would lead to clarifications in the interactions that are present in the network. It must be emphasized, however, that these hypothesis are just speculative. Even if a lit- erature research can provide some validation or justification for the choosing of one gene knockout over another, these choices may not provide the most information towards the overall understanding of the process being studied. This, in fact, was just shown here: the knockouts that were chosen were based on published results that would hint at their impor- tance in the response of the yeast Saccharomyces cerevisiae to oxidative stress but the results here obtained may indicate that other knockouts could, possibly, have been more informative. The method that I presented in Chapter 4 and which I applied to experimental data in this Chapter reinforces the ideas and concepts of systems biology in the sense that there needs to be a constant and iterative contribution from experimental and computational/theoretical disciplines (34; 32). Nevertheless, the method that I propose here seems to be effective with either approach: (i) an iterative approach, under the scope of systems biology, where experimental and theoreti- cal approaches are used in combination in order to achieve a maximal amount of information or (ii) application to existing time series data for which gene knockout data exists. This flex- ibility in the method’s application can provide itself valuable for existing or future research projects that embrace the systems biology concepts.

5.5 Conclusions

It was shown in this Chapter how the reverse engineering method proposed earlier (see Chapter 4), based on estimating the interactions between variables in a biological system by means of fitting a linearized model to data, can be applied in the context of systems biology research. The results obtained with the reverse engineering of a gene regulatory network for the response of Saccharomyces cerevisiae to oxidative stress induced by cumene hydroperoxyde indicate that the method can be used as a first approach towards building a more comprehensive model of the system under study, where it can be used to suggest experiments that can confirm or deny potential regulatory interactions between the genes. As was discussed, the inference rules presented before (see Section 4.2.1) had to be readjusted to cope with the fact that most systems biology experimental designs available have a preset number of conditions that are based on previous knowledge found in literature, such as which genes could be potentially involved in some biological process, like the response of 104 Applications to data

CHP SKN7

GPX1

YAP1

GSH2

TRX2

GPX2

GPX3

GRX2

GLR1

GRX1

YBP1

TSA2

GSH1

Figure 5.3: The yeast regulatory network for response to CHP insults when all available data is used. Comparing this network with the ones shown in Figure 5.2 one can see that the addition of all of the data, without the ability to select the optimal experiment, does not yield a significant improvement in our knowledge of the network. Careful analysis indicates that, excepting the already available knockouts, a knockout of YBP1 could provide important information in our understanding of the network. Applications to data 105

the organism to oxidative stress. The method presented does not make any assumptions as to which could be important variables in the system, with this knowledge being driven by the data and introduced into the model accordingly. The application of the method as was performed in this Chapter demonstrates that even in cases in which the selection of knockouts was made a priori, one can obtain a somewhat satisfying result concerning a gene regulatory network that relates to the response of the organism to an environmental perturbation. It should be emphasized, nevertheless, that it is expected that the method would provide more satisfying results if the iterative process suggested would be pursued. As was shown for the case of the selection of the first mutant to be performed, the choice fell on the only available knockout (GRX1 ) among 3 possible candidates. A literature search indicates that TSA2, one of the other candidates, could potentially yield more interesting data and more information as to how the system reacts to perturbations made with organic peroxides (246). The issue of the coverage of the parameter space, while searching for solutions to the least- squares problem, was briefly mentioned. The parameter space is a multi-dimensional (n- dimensional, where n is the number of parameters to be estimated) hyper-plane, with a very complex landscape, composed of several “hills” and “valleys”. Global optimization algorithms, such as genetic algorithms (248; 249), evolutionary strategies (194) or the particle swarm algorithm (228; 229), attempt to find the global minimum of this multi-dimensional hyper-plane (i.e., the place in the entire plane that has the lowest value for the objective function being optimized). Since these algorithms are stochastic in nature, randomizing the initial conditions (i.e., randomizing the location in the plane where the optimization will begin) will yield different solutions. If the landscape of the plane is simple, such as a parabola for example, the end solution of the optimization will be very close to the global minimum. However, in complex planes there may be several “valleys” that have very steep walls, thereby “trapping” the algorithm in a local minimum. The randomization of the initial starting point should enable to span the parameter space enough. However, computational difficulties, such as the time needed to run each optimization, do not allow for a wide enough coverage of the parameter hyper-space. Advances in the field of parameter optimization and optimization algorithms could provide new tools to allow for a comprehensive coverage of the parameter space, which may not have been established here. Considering the case of the yeast regulatory network presented, a total of 190 dimensions exist in the parameter space. The solution that I presented corresponds to the confidence network generated by 5 independent solutions. It may be argued, therefore, that not enough coverage of the space can be achieved with such low number of solutions given the number of dimensions to be explored. Under the vision of systems biology proposed by Kitano (34; 32; 33), the experimental discoveries walk hand-in-hand with the theoretical and computational approaches. The development of iterative theoretical and computational methods has been tried before (234) with some success. The demand for careful experimental design that is based on theoretical and experimental results is bound to grow, in order not only to keep the costs of research 106 Applications to data to the essential but also to gain the most information possible of each iteration of the experimental process. The method that I presented in Chapter 4 and applied here can be used in such an iterative process. The application of inference rules at each iteration allows one to gather the largest amount of information and, thereby, potentially reducing the number of experiments needed to be performed in order to achieve a broad overview of the gene regulatory of the system at hand. Chapter 6 Looking ahead: Future research Looking ahead 109

“If I have seen further it is by standing on the shoulders of giants.” - Isaac Newton

6.1 Abstract

Systems biology brought a change to the face of research in the life sciences. This change presented different needs in terms of data collection but also in data analysis and opened the door to make biology the ultimate interdisciplinary field. A system biologist will be expected to have a breadth of knowlegde that spans across multiple fields, from mathematics to computer science, statistics and biology (250). But what is in the cards for systems biology in the long run? I will approach some of the issues that systems biology may face as well as provide my own view of the development of the field, from biology to mathematics and .

6.2 Back to the future: Part I

The face of research in the life sciences changed considerably in the past few years. From reductionistic approaches in which complex systems are dissected into isolated components of theses systems to a holistic one where the system under study is looked at as a whole rather than a collection of parts, the look on life sciences has shifted. As introduced back in Chapter 1, systems biology (34; 32) is the new hot buzz, the latest installment in a series of revolutionary techniques and ideas that were introduced into life sciences research that changed the field entirely. From studies of single genes to microarrays introduced by Schena et al.(21), from enzyme kinetic studies to studies on the entire protein makeup of cells (1), or from chemical analysis of biological compounds to mass quantification of thousands of compounds in a cell at a given point in time (2), the field of life sciences research is picking up an incredible amount of pace. A recent publication in Science is one of several publications that highlight how a comprehensive study of a given organism, under the scope or definition of systems biology, can be exciting even if overwhelming (251; 252). Exciting since it opens up a whole wealth of knowledge that new technologies and techniques associated with data collection allow, but overwhelming since this wealth of knowledge comes with unthinkable amounts of data that need to be analyzed and made sense off. Systems biology needs to rise to the challenge of managing, integrating and analyzing such complex data sets in order for our knowlegde of a given organism be increased. Not only has systems biology changed the face of research in biology, but it is envisioned that it will also have a big impact in medical practice (44; 253; 34). One of the major contributions that systems biology can bring to medicine comes from the predictive power that can be obtained through the construction of mathematical models of the system. The use of systems 110 Looking ahead

biology in drug development, with hypothesis-driven and discovery-driven science intertwined and working together (254), may help overcome flawed decisions made over insufficient data knowledge of the organism (44), and enabling a faster decision on candidate drug targets and physiological response to drug treatment through computer simulations. Even though the number of publications in the field increases tremendously every month, there lacks a “breaking news” type of discovery to solidify its importance to the community (254). If the discovery of new planets that could harvest life have the astro-sciences on their toes (as reported in Nature concerning the possibility of the planet Gliese 581c, that orbits the ‘habitable zone’ of the Gliese 581 dwarf star, to have conditions similar to those found on Earth – see 255), systems biology has yet to achieve the same effect in the life sciences. Nevertheless, the prospects are good.

6.3 Reverse engineering life

As it was discussed throughout this dissertation, one of the foci in systems biology is the uncovering of network structures with biological relevance based on data that comes out of experimental setups. A multitude of methods that meet these goals already exist, and examples were given accordingly. These span from methods that use perturbation data applied around the vicinity of a steady state (141; 134; 135; 136; 137), methods that use Bayesian theory or modifications of such (see 220), algebraic approaches (87), or several different statistical approaches (e.g., 130; 147). The method introduced here (see Chapter 4) bases the inference of a gene regulatory network in the ability to fit a modified zeroth- order Taylor approximation to time series data, making no assumptions on the number or type of interactions that each variable (i.e., each gene) has with all of the other variables (i.e., the other genes). These approaches present themselves as a first pass in the overall goal of devising a comprehensible mathematical model of the biological system at hand, and the rates of success under different conditions have been demonstrated (see Chapter 3 and 217; 222; 205; 256).

6.3.1 Linear vs. non-linear worlds

Even if linear approaches to biology are found throughout the literature and are able to explain highly complex systems (see, e.g., the case of the Lotka-Volterra models – see Equa- tions 1.1a and 1.1b), biology does not present itself linearly. Inhibitions, activations, species interactions are examples of cases that render linear approaches unsound. The classical ex- ample of the Lineweaver-Burk equation to approximate enzyme kinetics parameters shows that linear approaches tend to be flawed. Nevertheless, such approaches can be shown to replicate, to some extent, complex behaviors such as oscillatory behaviors in predator-prey populations. Looking ahead 111

It was shown in this work that a linear approach to reverse engineering gene regulatory networks from time series data yields satisfactory results, represented in an approximation of the original network as the confidence network. In fact, the methods that I highlighted above also operate within the limits of linearity of the dynamical system (especially those that use perturbations around the steady state, since these perturbations are usually small enough that the system remains close enough to the steady state and, therefore, a linear approximation is still valid). But could a non-linear approach, in which the interactions between the variables are defined explicitly as activation or inhibitions, yield better results? The rate of expression of a gene, as given by Equation 4.2, is defined as:

p N ! X X x˙ i = γi (βikyk) + (αijxj) + Ci (6.1) k=1 j=1

where the terms in the equation were defined previously. In biology, each species is dependent on the rates at which is produced and degraded, and this accounts for the overall rate for this species:

a˙ i = v1 − v2 (6.2)

where v1 and v2 are the total rates of synthesis and degradation of a (i.e., they represent the sum of all the terms that contribute for the synthesis and all the terms that contribute to the degradation, respectively). In terms of expression of genes, v1 includes terms that affect the rate of expression by either inhibiting, activating, or both and v2 is only dependent on the concentration of the transcript. Therefore, one can write these as (206):

n j !  nk  Y Kij Y A v = V k (6.3a) 1 x Inj + Kinj Ank + Kank j j j k k k

v2 = kdeg,xX (6.3b)

where x is a gene, Kij is the inhibition constant for inhibitor Ij, Kak is the activation con- stant for activator Ak, and n is the Hill coefficient that attests for cooperativity of binding of the effectors to the transcriptional machinery. kdeg,x is the degradation rate of the tran- script of gene x. One can rewrite Equation 6.1 and generate a generalized equation for the transcription of a gene X that accounts for non-linearities in the system and that can be used for fitting the model to the time series data. This equation can then be written as: 112 Looking ahead

n   p   Y Kj + xj Y Kk + yk x˙ i = γiVi (6.4) Kj + αjxj Kk + βkyk j k

  Qn Kj +xj where Vi is the maximal rate of transcription, accounts for the effect that j Kj +αj xj   Qp Kk+yk other variables have on the expression of xi and accounts for the effect of k Kk+βkyk environmental perturbations that are applied to the system (yk) have on the rate of expression of xi. Included into the equation is the possibility that each variable (be it an internal variable of the system or an environmental perturbation) can have an inhibitory or activating effect on xi. This duality is given by the parameters αj and βk. If the estimated value for these parameters is larger than 1 then the corresponding variable is an inhibitor. If it is smaller than 1 then it has an activating effect. If the value for these parameters turns out to be equal to 1 then this variable will have no effect on xi. Such approach has been implemented already (Adaoha Ihekwaba, unpublished results), with the results being somewhat satisfactory. If the method that was presented in this work may present itself too abstract, the familiarity with equations and parameters such as those presented in Equation 6.4 may bring closer attention to such an approach by the experimentalist community.

6.3.2 Speedy delivery

It was discussed (Chapter 4) that the number of parameters that need to be estimated in order to obtain one fitting solution is fairly large (in the order of N 2 + N × p + N). For the non-linear modification presented in Equation 6.4 this number is, once again, fairly high: N + N 2 + (N × p)2. Therefore, there seems to be a need for parallel computing for the expedite resolution of the reverse engineering problem in systems biology. A commentary by Sui Huang on Nature in 2000 (257) highlights the need to increase the research in parallel com- puting in order to aid the solution of life sciences research problems, such as genome-wide interaction maps or mathematical modeling. The author argues that biology will soon join the physical sciences in the demand for extra computing power, given the high complexity of biological systems. An interesting view on the research for those extra bits of computing capacity is presented by Butler (258), where the author goes through some projects that are aimed at optimizing computing power available for research, namely through the use of supercomputers. Software like Copasi (213) and others already display the ability to operate remotely and may soon present the ability to run in parallel ((or already have that capability, as the E-CELL project – see 259), which would imply a great advancement in computing for the solutions of biological problems. The use of the Internet as a supercomputer, as done for the SETI@home or the Folding@home projects (258) may spur the development of analysis Looking ahead 113

What we know

What is left

Figure 6.1: The knowledge iceberg. Our knowledge of biological systems, though extensive, may be limited to only a small fraction of what that system is about. System-level approaches, under the scope of systems biology, will allow for the identification of all of the components of the system, by undertaking an holistic ap- proach. It is only when these data are understood that we can improve out knowledge of the system and, using the iceberg analogy, perceive what lies underwater. (Picture of an actual iceberg taken from a rig at New- foundland. The iceberg is estimated to weigh 300,000,000 lbs. – from http://defiant.corban.edu/gtipton/net- fun/iceberg.html) and simulation tools for systems biology that make use of the millions of computers that are online everyday.

6.3.3 Specifying the approximation

With more or less accuracy, the existing reverse engineering methods represent only approxi- mations of the biological networks that are inferred in comparison to the actual network (see Chapter 3). These approximations should not be taken at face value, as they only represent the best network that the approximation used for the method allows. As the case for the method presented in this work, the confidence network represents only an approximation to the underlying biological network that generated the data that was used in the fitting process. Even if one applies a concerted and iterative approach as suggested, the network can only be an approximation to the real network. The failure to identify this ever present short coming of reverse engineering methods may lead to wrong interpretations of the data and incorrect assumptions on the nature of the biological network being analyzed. So what can be done? The top down and the bottom up modeling approaches have a common goal of generating a 114 Looking ahead mathematical model of the biological system that can have valuable predictive power and that will enable making studies on the systems behavior under conditions that could be difficult to obtain under experimental conditions. If the bottom up approach has met some success (see 191, for a success story in model construction using a bottom up approach, for the yeast Saccharomyces cerevisiae), the top down modeling approach is lagging. Reverse engineering methods have been successful in recovering phenomenological representations of biological networks. The next step is to uncover the kinetic equations that gave rise to the experimental data that was used. This new step of reverse engineering and parameter estimation in biological networks has been approached before (143; 194), and benchmark data sets have been proposed for such matter (212). If systems biology approaches and reverse engineering methodologies are to be successful in allowing building detailed mathematical models of biological systems, then research needs to move in the direction from the phenomenological network into a detailed kinetics model. This may be the next important challenge in reverse engineering and systems biology for the next few years.

6.4 Back to the future: Part II

A biological system, be it a cell, a tissue, an organ or an entire organism, is composed of a large number of variables. As an example, the number of genes that can be found in living organisms spans from a few hundred to a few thousand (see 260). That does not take into account the number of proteins and metabolites that can be found in the same organism. Therefore, in terms of numbers of variables that need to be considered, this can be staggering. On the other hand, some highly complex engineering problems also tackle large numbers of variables: the number of elements found in a microprocessor can be astounding for the unready observer. If an electrical engineer can understand a microprocessor circuit by looking at the “parts list” of the circuit, the same cannot seem be achieved by the biologist that looks at the “parts list” of a living organism. We are drawn back towards the paper by Lazebnik (18). The fact that a biologist will assemble a radio based on a ‘trial-and-error’ approach, rather than a systematic reconstruction as the electrical engineer would, illustrates the need for a clear definition of what the “parts” in a biological system are. However, it is also important to note that biology presents itself more complex than an electrical circuit: the ‘circuitry’ of life is not fully understood. Even for complete genomes the understanding of how genes interact with each other or how the system responds to an insult are known only very briefly. It is possible that, for the most studied subsystems such as the glycolytic pathway or the regulation of p53, only know the very tip of a ‘knowledge iceberg’ is known and we have no idea what lies underwater (see Figure 6.1). Looking ahead 115

6.4.1 Modular biology and systems approaches

Methodologies with similar goals as the one I proposed on Chapter 4 are abundant (see Sec- tion 1.5 and Chapter 3). However, these methodologies are often developed using synthetic biological networks, for which all of the information about topology and kinetics is known. So what could be expected of the application of these methodologies to improve the solving of problems and increase our knowledge of biological complexity? The amount of information that is collected from systems biology-oriented projects is insur- mountable. When trying to make sense of these data one must be aware of the limitations of understanding, as discussed previously (see Section 1.4). Therefore, while keeping an holistic framework, one must deconstruct the problem in smaller pieces, in order to gain knowledge in a step-wise manner, rather than generating an analysis or a mathematical model that will be to general without being specific. Let’s look at the results I presented in Chapter 5. The goal of the study was to generate a gene regulatory network for a set of genes of Saccharomyces cerevisiae that are known to be involved in the response of the fungi to oxidative stress. The collection of 13 genes selected was made based on this premise. Therefore, one can argue that this set of genes constitutes a module in the overall gene regulatory of the organism and the network constructed is not the gene regulatory network of the organism but shows how each of these 13 genes may be influenced by each other when cumene hydroperoxyde is added to a population of cells in mid-exponential phase. Under this assumption, one can select sets of these modules and generate an analysis or model that relates the modules against each other. Further experimentation in each of the modules will add greater specificity and knowledge about the overall system. In other words, it can prove more effective to “knowledge hunt” than to “knowledge fish”: when presented with a large amount of information it is more fruitful to look for answers in a concerted way (“hunting”) rather than to accept any answer at face value (“fishing”). The idea of modularity in biology has been addressed by Segr`e(261) where the author champions the idea that pathway centered views of metabolism should be complemented by modulocentric approaches. But should we resume the analysis of biological systems to colored boxes in a circuit, each of which has a different function? I do not believe so. As pointed out Endy (71), biological systems are always evolving. This makes biology interesting but also difficult to standard- ize. Even the advent of systems biology and comprehensive data collection from functional genomics experiments cannot account for the natural variability of biology. Nevertheless, it may prove itself valuable if some engineering concepts are translated into biology settings. Namely, standardization, decoupling and abstraction (71). This would imply generating a catalogue of the “parts” of biological organisms, a daunting task in itself. I refer the reader to the interesting paper of Endy on setting foundations for biological engineering. 116 Looking ahead

6.5 Back to the future: Part III

So where does biology go from here? Systems biology has left a significant mark in biology and will continue to do so. It should be expected that the approaches brought by sys- tems biology will generate great developments in drug development and in medical practice. Targeted medicine is more and more becoming a reality. The development of mathemat- ical models of disease states will allow ‘big pharmas’ to develop more efficient drugs and, hopefully, lead to a worldwide improvement of human condition. But are we any closer to understanding biological systems? Are we any closer to have a complete grasp on the ‘bits and pieces’ that make life possible? I believe that the changes observed in science over the past few years are leading us to that goal. The road ahead seems promising and I foresee that the next decade will be the decade of systems biology, when we will begin to see palpable results of comprehensive systems-level analysis of organisms. But until we can overcome the difficulties that allows us to fully realize an organism, as highlighted by Endy, we must accept that our knowlegde will always be limited and that, in the words of accomplished sci-fi writer Arthur C. Clarke, “the truth, as always, will be far stranger” (262). Bibliography

1. C. Godon, et al., J. Biol. Chem. 273, 22480 (1998).

2. W. Weckwerth, M. E. Loureiro, K. Wenzel, O. Fiehn, Proc. Natl. Acad. Sci. USA 101, 7809 (2004).

3. U. Roessner, et al., Plant Cell 13, 11 (2001).

4. J. D. Watson, F. H. Crick, Cold Spring Harb. Symp. Quant. Biol. 18, 123 (1953).

5. J. D. Watson, F. H. Crick, Nature 171, 737 (1953).

6. J. D. Watson, F. H. Crick, Nature 171, 964 (1953).

7. L. Pauling, R. B. Corey, Nature 171, 346 (1953).

8. L. Pauling, R. B. Corey, Proc. Natl. Acad. Sci. USA 39, 84 (1953).

9. F. J. Corpas, L. Garc´ıa-Salguero, J. Perag´on,J. A. Lupi´a˜nez, Life Sci. 56, 179 (1995).

10. A. M. Martins, C. Cordeiro, A. Ponces Freire, Arch. Biochem. Biophys. 366, 15 (1999).

11. A. M. Martins, P. Mendes, C. Cordeiro, A. Ponces Freire, Eur. J. Biochem. 268, 3930 (2001).

12. W. W. Cleland, Biochim. Biophys. Acta 67, 104 (1963).

13. E. L. King, C. Altman, J. Phys. Chem. 60, 1375 (1956).

14. B.-S. Kerem, et al., Science 245, 1073 (1989).

15. J. R. Riordan, et al., Science 245, 1066 (1989).

16. J. M. Rommens, et al., Science 245, 1059 (1989).

17. C. Papazoglu, A. A. Mills, J. Path. 211, 124 (2007).

117 118 Bibliography

18. Y. Lazebnik, Cancer Cell 2, 179 (2002).

19. P. Hieter, M. Boguski, Science 278, 601 (1997).

20. T. F. Smith, Trends Genet. 14, 291 (1998).

21. M. Schena, D. Shalon, R. W. Davis, P. O. Brown, Science 270, 467 (1995).

22. S. Ghaemmaghami, M. C. Fitzgerald, T. G. Oas, Proc. Natl. Acad. Sci. USA 97, 8296 (2000).

23. M. Scholz, S. Gatzek, A. Sterling, O. Fiehn, J. Selbig, Bioinformatics 20, 2447 (2004).

24. J. L. Griffin, Phil. Trans. R. Soc. Lond. B 359, 857 (2004).

25. C. D. Broeckling, et al., J. Exp. Bot. 56, 323 (2005).

26. O. Fiehn, et al., Nature Biotech. 18, 1157 (2000).

27. J. Sachs, Aristotle’s Metaphysics (Green Lion, 2002), second edn.

28. D. A. Fell, Understanding the control of metabolism (Portland Press, London, 1996).

29. H. Kacser, The organization of cell metabolism, R. Welch, J. Clegg, eds. (Plenum Press, 1986), pp. 327–337.

30. B. O. Palsson, Nature Biotech. 18, 1147 (2000).

31. T. Ideker, et al., Science 292, 929 (2001).

32. H. Kitano, Nature 420, 206 (2002).

33. H. Kitano, Science 295, 1662 (2002).

34. H. Kitano, Curr. Genet. 41, 1 (2002).

35. A. Agrawal, Nature Biotech. 17, 743 (1999).

36. I. C. on Systems Biology.

37. W. Zieglgansberger, T. R. Tolle, Curr. Opin. Neurobiol. 3, 611 (1993).

38. J. M. Beechem, Methods Enz. 210, 37 (1992).

39. M. Straume, M. L. Johnson, Methods Enz. 210, 87 (1992).

40. B. Chance, J. Biol. Chem. 151, 553 (1943).

41. L. Chong, B. Ray, Science 295, 1661 (2002). Bibliography 119

42. E. Pennisi, Science 302, 1646 (2003).

43. N. J. Provart, P. McCourt, Curr. Opin. Plant Biol. 7, 605 (2004).

44. E. C. Butcher, E. L. Berg, E. J. Kunkel, Nature Biotech. 22, 1253 (2004).

45. L. J. Sweetlove, R. L. Last, A. Fernie, Plant Phys. 132, 420 (2003).

46. A. M. Martins, et al., Curr. Genomics 5, 649 (2004).

47. V. J. Nikiforova, et al., J. Exp. Bot. 55, 1861 (2004).

48. D. Garfinkel, FEBS Letters 2, S9 (1969).

49. D. Garfinkel, Trends Biochem. Sci. 6, 69 (1981).

50. D. Garfinkel, Am. J. Phys. 239, R1 (1980).

51. D. Garfinkel, Math. Biosci. 72, 131 (1984).

52. H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, A.-L. Barab´asi, Nature 407, 651 (2000).

53. G. Stephanopoulos, J. J. Vallino, Science 252, 1675 (1991).

54. N. Barkai, S. Leibler, Nature 387, 913 (1997).

55. R. Albert, H. Jeong, A.-L. Barab´asi, Nature 406, 378 (2000).

56. E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, A.-L. Barab´asi, Science 297, 1551 (2002).

57. A.-L. Barab´asi, Linked (Plume, 2003).

58. F. S. Roberts, Applied combinatorics (Prentice Hall, 1984).

59. S. N. Dorogovtsev, J. F. F. Mendes, Adv. Phys. 51, 1079 (2002).

60. P. Erd¨os,A. R´enyi, Publ. Math. Debrecen 6, 290 (1959).

61. D. J. Watts, S. H. Strogatz, Nature 393, 440 (1998).

62. A.-L. Barab´asi,R. Albert, Science 286, 509 (1999).

63. A.-L. Barab´asi,R. Albert, H. Jeong, Physica A 281, 69 (2000).

64. A.-L. Barab´asi,E. Ravasz, T. Vicsek, Physica A 299, 559 (2001).

65. I. Farkas, H. Jeong, T. Vicsek, A.-L. Barab´asi, Z. N. Oltvai, Physica A 318, 601 (2003).

66. D. A. Fell, A. Wagner, Nature Biotech. 18, 1121 (2000). 120 Bibliography

67. A. Wagner, D. A. Fell, Proc. R. Soc. Lond. B 268, 1803 (2001).

68. R. Tanaka, M. E. Csete, J. C. Doyle, IEE Proc.-Syst. Biol. 152, 179 (2005).

69. U. Alon, Science 301, 1866 (2003).

70. G. A. Miller, Psychol. Rev. 63, 81 (1956).

71. D. Endy, Nature 438, 449 (2005).

72. J. Hasty, D. McMillen, J. J. Collins, Nature 420, 224 (2002).

73. J. Hasty, D. McMillen, F. Isaacs, J. J. Collins, Nature Reviews 2, 268 (2001).

74. A. J. Lotka, Proc. Natl. Acad. Sci. USA 18, 172 (1932).

75. A. J. Lotka, J. Phys. Chem. 14, 271 (1910).

76. A. J. Lotka, J. Am. Chem. Soc. 42, 1595 (1920).

77. J. Higgins, Proc. Natl. Acad. Sci. USA 52, 989 (1964).

78. B. Hess, A. Boiteux, Annu. Rev. Biochem. 40, 237 (1971).

79. F. Hynne, S. Danø, P. G. Sørensen, Biophys. Chem. 94, 121 (2001).

80. J. Ma, D. J. D. Earn, Bull. Math. Biol. 68, 679 (2006).

81. M. Safan, H. Heesterbeek, K. Dietz, J. Math. Biol. 53, 703 (2006).

82. H. Kacser, J. Burns, Symp. Soc. Exp. Biol. 27, 65 (1973).

83. L. M. McIntyre, D. R. Thorburn, W. A. Bubb, P. W. Kuchel, Eur. J. Biochem. 180, 399 (1989).

84. B. Novak, A. Csikasz-Nagy, B. Gyorffy, K. Nasmyth, J. J. Tyson, Phil. Trans. R. Soc. Lond. B 353, 2063 (1998).

85. M. G. Poolman, D. A. Fell, S. Thomas, J. Exp. Bot. 51, 319 (2000).

86. E. A. Gaffney, N. A. Monk, Bull. Math. Biol. 68, 99 (2006).

87. R. Laubenbacher, B. Stigler, J. Theor. Biol. 229, 523 (2004).

88. S. A. Kauffman, J. Theor. Biol. 22, 437 (1969).

89. L. Glass, J. Theor. Biol. 54, 85 (1975).

90. T. J. Perkins, M. Hallett, L. Glass, Biosystems 84, 115 (2006). Bibliography 121

91. L. Raeymaekers, J. Theor. Biol. 218, 331 (2002).

92. D. Garfinkel, R. Sack, Ecology 45, 502 (1964).

93. D. J. D. Earn, P. Rohani, B. M. Bolker, B. T. Grenfell, Science 287, 667 (2000).

94. B. Chance, D. Greenstein, J. Higgins, C. C. Yang, Arch. Biochem. Biophys. 37, 322 (1952).

95. B. Chance, J. Biol. Chem. 235, 2440 (1960).

96. D. Garfinkel, J. Biol. Chem. 241, 3918 (1966).

97. D. Garfinkel, Brain Res. 23, 387 (1970).

98. M. Kohn, M. J. Achs, D. Garfinkel, Am. J. Phys. 232, R158 (1977).

99. D. Garfinkel, K. A. Fegley, Am. J. Phys. 246, R641 (1984).

100. L. Garfinkel, D. M. Cohen, V. W. Soo, D. Garfinkel, C. A. Kulikowski, Biochem. J. 264, 175 (1989).

101. D. Garfinkel, Comput. Biomed. Res. 2, 31 (1968).

102. G.-C. Roman, D. Garfinkel, Comput. Biomed. Res. 11, 3 (1978).

103. D. Garfinkel, Comput. Biomed. Res. 2, i (1968).

104. N. Le Nov`ere, et al., Nature Biotech. 23, 1509 (2005).

105. A. Brazma, et al., Nature Genet. 29, 365 (2001).

106. B. Hess, A. Boiteux, Ber. Bunsenges. Phys. Chem. 84, 346 (1980).

107. A. Goldbeter, J.-L. Martiel, FEBS Letters 191, 149 (1985).

108. K. Goto, D. L. Laval-Martin, L. N. Edmunds Jr., Science 228, 1284 (1985).

109. L. N. M. Duysens, J. Amesz, Biophys. Biochim. Acta 24, 19 (1957).

110. B. Chance, B. Schoener, S. Elsaesser, Proc. Natl. Acad. Sci. USA 52, 337 (1964).

111. B. Hess, J. Theor. Biol. 81, 7 (1979).

112. J. Higgins, Ind. Eng. Chem. 59, 19 (1967).

113. I. Prigogine, R. Lefever, A. Goldbeter, M. Herschkowitz-Kaufman, Nature 223, 913 (1969).

114. K. Pye, B. Chance, Proc. Natl. Acad. Sci. USA 55, 888 (1966). 122 Bibliography

115. A. Goldbeter, FEBS Letters 43, 327 (1974).

116. O. Decroly, A. Goldbeter, Proc. Natl. Acad. Sci. USA 79, 6917 (1982).

117. M. Markus, B. Hess, Proc. Natl. Acad. Sci. USA 81, 4394 (1984).

118. B. Hess, M. Markus, Trends Biochem. Sci. 12, 45 (1987).

119. M. Markus, D. Kuschmitz, B. Hess, FEBS Letters 172, 235 (1984).

120. E. Di Cera, P. E. Phillipson, J. Wyman, Proc. Natl. Acad. Sci. USA 86, 142 (1989).

121. M. Bier, B. Teusink, B. N. Kholodenko, H. V. Westerhoff, Biophys. Chem. 62, 15 (1996).

122. K. A. Reijenga, et al., J. Theor. Biol. 232, 385 (2005).

123. S. Danø, P. G. Sørensen, F. Hynne, Nature 402, 320 (1999).

124. J. Wolf, et al., Biophys. J. 78, 1145 (2000).

125. F. E. Yates, Am. J. Phys. 235, R201 (1978).

126. P. W. Anderson, Science 177, 393 (1972).

127. R. Heinrich, S. M. Rapoport, T. A. Rapoport, Prog. Biophys. Mol. Biol. 32, 1 (1977).

128. J. J. Tyson, K. Chen, B. Novak, Nature Reviews 2, 908 (2001).

129. K. Bar-Eli, W. Geiseler, J. Phys. Chem. 87, 1352 (1983).

130. A. Arkin, J. Ross, J. Phys. Chem. 99, 970 (1995).

131. M. Samoilov, A. Arkin, J. Ross, Chaos 11, 108 (2001).

132. E. Klipp, B. Nordlander, R. Kr¨uger, P. Gennemark, S. Hohmann, Nature Biotech. 23, 975 (2005).

133. P. Brazhnik, A. de la Fuente, P. Mendes, Trends Biotechnol. 20, 467 (2002).

134. A. de la Fuente, P. Brazhnik, P. Mendes, Trends Genet. 18, 395 (2002).

135. A. de la Fuente, P. Mendes, Mol. Biol. Rep. 29, 73 (2002).

136. B. N. Kholodenko, et al., Proc. Natl. Acad. Sci. USA 99, 12841 (2002).

137. T. S. Gardner, D. di Bernardo, D. Lorenz, J. J. Collins, Science 301, 102 (2003).

138. J. Tegn´er,M. K. S. Yeung, J. Hasty, J. J. Collins, Proc. Natl. Acad. Sci. USA 100, 5944 (2003). Bibliography 123

139. M. K. S. Yeung, J. Tegn´er,J. J. Collins, Proc. Natl. Acad. Sci. USA 99, 6163 (2002).

140. A. Lucau-Danila, et al., Mol. Cell. Biol. 25, 1860 (2005).

141. T. Chevalier, I. Schreiber, J. Ross, J. Phys. Chem. 97, 6776 (1993).

142. R. D´ıaz-Sierra, J. B. Lozano, V. Fair´en, J. Phys. Chem. A 103, 337 (1999).

143. M. Sugimoto, S. Kikuchi, M. Tomita, Biosystems 80, 155 (2005).

144. E. Sontag, A. Kiyatkin, B. N. Kholodenko, Bioinformatics 20, 1877 (2004).

145. M. Wahde, J. Hertz, Biosystems 55, 129 (2000).

146. J. F¨orster,I. Famili, P. Fu, B. O. Palsson, J. Nielsen, Genome Res. 13, 244 (2003).

147. A. de la Fuente, N. Bing, I. Hoeschele, P. Mendes, Bioinformatics 20, 3565 (2004).

148. M. Bittner, P. Meltzer, J. Trent, Nature Genet. 22, 213 (1999).

149. K. Shedden, S. Cooper, Nucleic Acids Res. 30, 2920 (2002).

150. P. T. Spellman, et al., Mol. Biol. Cell 9, 3273 (1998).

151. O. Fiehn, Phytochemistry 62, 875 (2003).

152. U. Roessner, C. Wagner, J. Kopka, R. N. Trethewey, L. Willmitzer, Plant J. 23, 131 (2000).

153. M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein, Proc. Natl. Acad. Sci. USA 95, 14863 (1998).

154. F. Kose, W. Weckwerth, T. Linke, O. Fiehn, Bioinformatics 17, 1198 (2001).

155. W. Weckwerth, Annu. Rev. Plant Biol. 54, 669 (2003).

156. R. Goodacre, J. Exp. Bot. pp. 1–10 (2004).

157. D. B. Kell, Curr. Opin. Microbiol. 7, 296 (2004).

158. S. Chapman, P. Schenk, K. Kazan, J. Manners, Bioinformatics 18, 202 (2001).

159. K. R. Gabriel, Biometrika 58, 453 (1971).

160. W. J. Krzanowski, Biometrics 60, 517 (2004).

161. L. E. Peterson, Comput. Meth. Progr. Biomed. 69, 179 (2002).

162. O. Fiehn, Comp. Funct. Genom. 2, 155 (2001). 124 Bibliography

163. R. Steuer, J. Kurths, O. Fiehn, W. Weckwerth, Bioinformatics 19, 1019 (2003).

164. D. Camacho, A. de la Fuente, P. Mendes, Metabolomics 1, 53 (2005).

165. M. Liang, A. W. Cowley, A. Greene, J. Physiol. 554, 22 (2004).

166. P. G. Righetti, N. Campostrini, J. Pascali, M. Hamdan, H. Astner, Eur. J. Mass Spectrom. 10, 335 (2004).

167. S. Cho, S. Park, H. Lee do, B. Park, J. Biochem. Mol. Biol. 37, 45 (2004).

168. J. Janin, B. Seraphin, Curr. Opin. Struct. Biol. 13, 383 (2003).

169. L. M. Raamsdonk, et al., Nature Biotech. 19, 45 (2001).

170. N. V. Reo, Drugs Chem. Toxicol. 25, 375 (2002).

171. L. W. Sumner, P. Mendes, R. A. Dixon, Phytochemistry 62, 817 (2003).

172. V. V. Tolstikov, A. Lommen, K. Nakanishi, N. Tanaka, O. Fiehn, Anal. Chem. 75, 6737 (2003).

173. S. G. Oliver, M. K. Winson, D. B. Kell, F. Baganz, Trends Biotechnol. 16, 373 (1998).

174. B. R. Baggett, et al., Electrophoresis 23, 1642 (2002).

175. T. Soga, et al., Anal. Chem. 74, 2233 (2002).

176. C. H. Schilling, S. Schuster, B. O. Palsson, R. Heinrich, Biotechnol. Prog. 15, 296 (1999).

177. J. K. Nicholson, J. C. Lindon, E. Holmes, Xenobiotica 29, 1181 (1999).

178. D. B. Kell, Mol. Biol. Rep. 29, 237 (2002).

179. K. H. Ott, N. Aranibar, B. Singh, G. W. Stockton, Phytochem. 62, 971 (2003).

180. D. E. Atkinson, Cellular energy metabolism and its regulation (Academic Press, New York, 1977).

181. K. Hayashi, N. Sakamoto, Dynamic analysis of enzyme systems. An introduction (Springer-Verlag, , 1986).

182. R. Heinrich, S. Schuster, The regulation of cellular systems (Chapman & Hall, New York, 1996).

183. M. A. Savageau, Biochemical systems analysis (Addison-Wesley, Reading, 1976).

184. O. Fiehn, W. Weckwerth, Eur. J. Biochem. 270, 579 (2003). Bibliography 125

185. R. Heinrich, T. A. Rapoport, Eur. J. Biochem. 42, 89 (1974).

186. J.-H. S. Hofmeyr, A. Cornish-Bowden, J. M. Rohwer, Eur. J. Biochem. 212, 833 (1993).

187. J.-H. S. Hofmeyr, A. Cornish-Bowden, J. Theor. Biol. 182, 371 (1996).

188. P. Mendes, Comput. Appl. Biosci. 9, 563 (1993).

189. P. Mendes, Trends Biochem. Sci. 22, 361 (1997).

190. P. Mendes, D. B. Kell, Bioinformatics 14, 869 (1998).

191. B. Teusink, et al., Eur. J. Biochem. 267, 5313 (2000).

192. L. Pritchard, D. B. Kell, Eur. J. Biochem. 269, 3894 (2002).

193. G. R. Cronwright, J. M. Rohwer, B. A. Prior, Appl. Env. Microb. 68, 4448 (2002).

194. C. G. Moles, P. Mendes, J. R. Banga, Genome Res. 13, 2467 (2003).

195. L. Neves, F. Lages, C. Lucas, FEBS Letters 565, 160 (2004).

196. T.-H. Toh, et al., FEMS Yeast Res. 1, 205 (2001).

197. W. Weckwerth, O. Fiehn, Curr. Opin. Biotech. 13, 156 (2002).

198. J.-H. S. Hofmeyr, H. Kacser, K. J. van der Merwe, Eur. J. Biochem. 155, 631 (1986).

199. H. V. Westerhoff, Y.-D. Chen, Eur. J. Biochem. 142, 425 (1984).

200. H. Kacser, J. Burns, Genetics 97, 639 (1981).

201. B. Teusink, F. Baganz, H. V. Westerhoff, S. G. Oliver, Meth. Microbiol. 26, 297 (1998).

202. K. Basso, et al., Nature Genet. 37, 382 (2005).

203. I. Nachman, A. Regev, N. Friedman, Bioinformatics 20, I248 (2004).

204. F. J. Bruggeman, B. N. Kholodenko, Mol. Biol. Rep. 29, 57 (2002).

205. H. de Jong, J. Comput. Biol. 9, 67 (2002).

206. P. Mendes, W. Sha, K. Ye, Bioinformatics 19, ii122 (2003).

207. A. de la Fuente, Deciphering living networks: Perturbation strategies for functional genomics, Ph.D. thesis, Vrije Universiteit Amsterdam (2006).

208. D. Thieffry, A. M. Huerta, E. P´erez-Rueda, J. Collado-Vides, BioEssays 20, 433 (1998).

209. H. Jeong, S. P. Mason, A.-L. Barab´asi,Z. N. Oltvai, Nature 411, 41 (2001). 126 Bibliography

210. M. G. Kendall, Biometrika 32, 277 (1942).

211. S. Wright, Ann. Math. Stat. 5, 161 (1934).

212. A. Kremling, et al., Genome Res. 14, 1773 (2004).

213. S. Hoops, et al., Bioinformatics 22, 3067 (2006).

214. R. D. C. Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria (2006).

215. D. Baulcombe, Science 315, 199 (2007).

216. L. Mendoza, D. Thieffry, E. R. Alvarez-Buylla, Bioinformatics 15, 593 (1999).

217. D. Camacho, P. Vera-Licona, R. Laubenbacher, P. Mendes, Comparison of reverse engineering methods using an in silico system (2007). Submitted.

218. Z. S. Chan, L. Collins, N. Kasabov, Biosystems 87, 299 (2007).

219. E. D. Jarvis, et al., J. Comp. Physiol. A. Neuroethol. Sens. Neural Behav. Physiol. 188, 961 (2002).

220. J. Yu, V. A. Smith, P. P. Wang, A. J. Hartemink, E. D. Jarvis, Bioinformatics 20, 3594 (2004).

221. A. A. Margolin, et al., BMC Bioinformatics 7, S7 (2006).

222. M. Bansal, V. Belcastro, A. Ambesi-Impiombato, D. di Bernardo, Mol. Syst. Biol. 3, 78 (2007).

223. E. Mihaliuk, H. Skødt, F. Hynne, P. G. Sørensen, K. Showalter, J. Phys. Chem. A 103, 8246 (1999).

224. W. Vance, A. Arkin, J. Ross, Proc. Natl. Acad. Sci. USA 99, 5816 (2002).

225. M. O. Vlad, A. Arkin, J. Ross, Proc. Natl. Acad. Sci. USA 101, 7223 (2004).

226. M. Andrec, B. N. Kholodenko, R. M. Levy, E. Sontag, J. Theor. Biol. 232, 427 (2005).

227. C. Pe˜na-Reyes, M. Sipper, Artif. Intell. Med. 19, 1 (2000).

228. M. Clerc, J. Kennedy, IEEE Trans. Evol. Comput. 6, 58 (2002).

229. P. C. Fourie, A. A. Groenwold, Struct. Multidisc. Optim. 23, 259 (2002).

230. J. M. Varah, SIAM J. Sci. Stat. Comput. 6, 30 (1985).

231. K. Holmstr¨om,J. Petersson, Appl. Math. Comput. 126, 31 (2002). Bibliography 127

232. D. Juki´c,R. Scitovski, J. Comput. Appl. Math. 78, 317 (1997).

233. J. A. Jacquez, T. Perry, Am. J. Phys. 258, E727 (1990).

234. D. di Bernardo, et al., Nature Biotech. 23, 377 (2005).

235. L. Cyrne, L. Martins, L. Fernandes, H. S. Marinho, Free Rad. Biol. Med. 34, 385 (2003).

236. G. L. Wheeler, C. M. Grant, Physiol. Plantarium 120, 12 (2004).

237. S. Izawa, Y. Inoue, A. Kimura, Biochem. J. 320, 61 (1996).

238. A. M. Martins, et al., Yeast 24, 181 (2007).

239. G. Giaever, et al., Nature 418, 387 (2002).

240. E. A. Winzeler, B. Lee, J. H. McCusker, , R. W. Davis, Parasitology 118 Suppl, S73 (1999).

241. B. M. Bolstad, R. A. Irizarry, M. Astrand, T. Speed, Bioinformatics 19, 185 (2003).

242. R. A. Irizarry, et al., Nucleic Acids Res. 31, eI5 (2003).

243. J. D. Storey, R. Tibshirani, Proc. Natl. Acad. Sci. USA 100, 9440 (2003).

244. G. W. Thorpe, C. S. Fong, N. Alic, V. J. Higgins, I. W. Dawes, Proc. Natl. Acad. Sci. USA 101, 6564 (2004).

245. A. P. Gasch, et al., Mol. Biol. Cell 11, 4241 (2000).

246. D. C. Munhoz, L. E. S. Netto, J. Biol. Chem. 279, 35219 (2004).

247. C. Rodrigues-Pousada, T. Nevitt, R. Menezes, FEBS J. 272, 2639 (2005).

248. R. E. Dorsey, W. J. Mayer, J. Bus. Econ. Stat. 13, 53 (1995).

249. J. A. Foster, Nat. Rev. Genet. 2, 428 (2001).

250. T. Ideker, Nature Biotech. 22, 473 (2004).

251. N. Ishii, et al., Science 316, 593 (2007).

252. U. Sauer, M. Heinemann, N. Zamboni, Science 316, 550 (2007).

253. L. Hood, J. R. Heath, M. E. Phelps, B. Lin, Science 306, 640 (2004).

254. T. Ideker, T. Galitski, L. Hood, Annu. Rev. Genomics Hum. Genet. 2, 343 (2001).

255. K. Sanderson, Nature 447, 7 (2007). 128 Bibliography

256. J. Stark, R. Callard, M. Hubank, Trends Biotechnol. 21, 290 (2003).

257. S. Huang, Nature Biotech. 18, 471 (2000).

258. D. Butler, Nature 402, C67 (1999).

259. M. Tomita, et al., Bioinformatics 15, 72 (1999).

260. R. F. Doolittle, Nature 419, 493 (2002).

261. D. Segr`e, Trends Biotechnol. 22, 261 (2004).

262. A. C. Clarke, 2001: A space odyssey (ROC, 2005). Vita

Diogo M. Camacho was born on November 19th of 1975 in Lisboa, Portugal. He got his B.Sc. degree in Biochemistry from the Universidade de Lisboa in 2001 under the supervision of Dr. Ana Ponces Freire. His undergragduate research work was focused on the study of glycolytic oscillations in Saccharomyces cerevisiae with a high component of computer simulations associated with laboratory experiments, a work that he was invited to present at the 2nd European Workshop on Glycolytic Oscillations held in Copenhagen, Denmark, in February 2002. It was shortly after that he would begin his graduate work at the Virginia Polytechnic Institute and State University, under the supervision of Dr. Pedro Mendes. Working at the Virginia Bioinformatics Institute, Diogo was one of the first students enrolling in the Genetics, Bioinformatics and Computational Biology Ph. D. program at Virginia Tech. During his doctoral work Diogo attended and presented work at several national and international conferences and workshops, for most of which he successfully received external funding. His research work led to the publication of 3 peer-reviewed papers, with 2 more in the works. The work on the origin of correlations in metabolomics data has received a lot of attention in the community, totaling over 14 citations in a little over 2 years of publication, without being indexed in PubMed. Diogo has recently accepted a Post Doctoral Fellow position at the Biomedical Engineering department of Boston University, under the supervision of Dr. James Collins, where he will continue his research in the fields of systems biology, computational biology and reverse engineering of biologically relevant networks.