Quick viewing(Text Mode)

COMPLEX SYSTEMS BIOLOGY of MAMMALIAN CELL CYCLE SIGNALING in CANCER by JAYANT AVVA Submitted in Partial Fulfillment of the Requi

COMPLEX SYSTEMS BIOLOGY of MAMMALIAN CELL CYCLE SIGNALING in CANCER by JAYANT AVVA Submitted in Partial Fulfillment of the Requi

COMPLEX BIOLOGY OF

MAMMALIAN CELL CYCLE SIGNALING IN CANCER

by

JAYANT AVVA

Submitted in partial fulfillment of the requirements

For the degree of Doctor of Philosophy

Dissertation Advisor: Dr. Sree N. Sreenath

Department of Electrical Engineering and Computer Science

CASE WESTERN RESERVE UNIVERSITY

May, 2011 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the thesis/dissertation of

JAYANT AVVA

candidate for the PhD degree *.

(signed) SREE N. SREENATH (chair of the committee)

KENNETH A. LOPARO

VIRA CHANKONG

MIHAJLO D. MESAROVIC

JAMES W. JACOBBERGER

(date) 12/01/2010

*We also certify that written approval has been obtained for any proprietary material contained therein.

Copyright © 2011 by Jayant Avva All rights reserved

Table of Contents LIST OF TABLES ...... vi LIST OF FIGURES ...... vii ACKNOWLEDGEMENTS ...... xi ABSTRACT ...... xiii 1. INTRODUCTION ...... 1 1.1. Overview ...... 1 1.2. Chapter Organization ...... 1 1.3. Motivation ...... 1 1.3.1. Cell Cycle ...... 5 1.4. Computational models ...... 6 1.4.1. Necessity of dynamics ...... 6 1.4.2. Paucity of organized time profile data in cell signaling ...... 7 1.4.3. Extracting dynamics out of statically sampled data ...... 11 1.5. State of the art ...... 13 1.6. Thesis contribution ...... 16 1.7. Thesis organization ...... 18 2. BACKGROUND ...... 20 2.1. Overview ...... 20 2.2. Chapter Organization ...... 20 2.3. Introduction ...... 21 2.4. ...... 21 2.4.1. Different types of Biologies ...... 22 2.4.2. Complex Systems Biology ...... 25 2.4.3. Hierarchical and multi-level paradigm- concepts and significance ...... 27 2.5. Cross-level causality ...... 29 2.5.1. Multiscale modeling ...... 32 2.6. Modeling ...... 33 2.6.1. Importance of modeling ...... 33 2.6.2. Contextual view of types of models in cancer biology studies ..... 34 2.6.3. Mathematical and Computational Models ...... 36

i

2.6.4. Phenomenological vs. Mechanistic Models ...... 37 2.6.5. Static vs. Dynamic Models ...... 37 2.6.6. Deterministic vs. Probabilistic Models...... 38 2.6.7. Dominant relationship modeling: Unmodeled dynamics ...... 39 2.6.8. Modeling Approaches: How does one go about modeling ...... 40 2.6.9. Modeling errors and rectification ...... 42 2.6.10. Mathematical formalism ...... 43 2.6.10.1. Use of mass action modeling: An example ...... 44 2.6.10.2. General system modeling: Using mass action modeling ...... 46 2.6.11. Calibration and Validation ...... 49 2.7. Role of Data in Building Predictive Models ...... 54 2.7.1. Data measurement introduction ...... 55 2.7.2. Data measurement processes ...... 60 2.7.3. Western Blotting ...... 61 2.7.4. Flow Cytometry ...... 66 2.8. Data driven systems biology thinking ...... 71 2.8.1. In vivo data ...... 71 2.8.2. Ex vivo data ...... 72 2.8.3. In vitro data ...... 73 2.8.4. Measurement decisions ...... 74 3. TIME PROFILE EXTRACTION FROM WET LAB DATA ...... 79 3.1. Overview ...... 79 3.2. Chapter Organization ...... 79 3.3. Introduction ...... 80 3.4. Importance of cytometry data ...... 81 3.5. State of the art ...... 82 3.5.1. Classification/comparison of time profile data generation methods ...... 82 3.5.2. Need for our method ...... 83 3.6. Dynamic time profile extraction methodology ...... 84 3.6.1. Generic methodology...... 85 3.6.1.1. Experimental setup ...... 85 3.6.1.2. Pre-processing ...... 87 3.6.1.3. Phase-specific processing ...... 88 ii

3.6.1.4. Postprocessing ...... 89 3.6.1.5. Replicated filtered data ...... 91 3.6.1.6. Testing the methodology for repeatability ...... 91 3.6.2. Application to K562 cells ...... 92 3.6.2.1. Experimental setup ...... 92 3.6.2.2. Pre-processing ...... 94 3.6.2.3. Phase-specific processing ...... 96 3.6.2.4. G1 and S Phase Time Course ...... 98 3.6.2.5. G2 Phase Time Course ...... 99 3.6.2.6. M Phase Time Course ...... 103 3.6.3. Postprocessing ...... 106 3.6.3.1. Single color correction ...... 106 3.6.3.2. Practical issues ...... 109 3.6.3.3. Testing methodology for reproducibility ...... 112 3.6.3.4. Testing methodology on MOLT4 cell line data ...... 115 3.6.3.5. Reproduced filtered data ...... 117 3.6.3.6. Theoretical formulation of data variation ...... 119 3.7. CytoSys – a software for time profile extraction ...... 122 3.7.1. Introduction ...... 122 3.7.2. Data Input ...... 124 3.7.3. Data Structure ...... 124 3.7.4. Processing protocol ...... 125 3.7.5. File Structure in CytoSys ...... 126 3.7.6. Salient Features ...... 127 3.7.6.1. Gaussian fits to data ...... 127 3.7.6.2. Generic problem formulation ...... 129 3.7.6.3. Problem formulation: monotonic weight constraints ...... 132 3.7.6.4. Single color correction ...... 135 3.8. Future work ...... 136 3.9. Summary ...... 137 4. CELL CYCLE MODEL ...... 138 4.1. Overview ...... 138 4.2. Chapter Organization ...... 138

iii

4.3. Cell cycle ...... 138 4.4. Cell cycle control system ...... 145 4.5. Modeling of the cell cycle overview ...... 147 4.6. Modified Tyson model calibration attempts ...... 151 4.6.1. Replicating Tyson model outputs ...... 152 4.6.2. Calibrating the model with our cell cycle data ...... 154 4.7. Cell population model preview ...... 158 5. UPSTREAM MODELING ...... 161 5.1. Overview ...... 161 5.2. Chapter Organization ...... 161 5.3. Introduction ...... 161 5.3.1. Cancer stem cells ...... 167 5.4. Cell signaling ...... 168 5.4.1. Flt3 receptor ...... 168 5.4.2. Ras-Raf-MEK-ERK pathway ...... 171 5.4.3. PI3K-Akt signaling ...... 175 5.4.4. mTOR pathway ...... 177 5.5. Model building ...... 181 5.5.1. Mathematical model of Flt3 signaling pathway ...... 181 5.5.2. Modularization ...... 187 5.5.3. Modeling automation ...... 191 5.5.4. Future work ...... 192 6. CONCLUSION ...... 195 6.1. Thesis review ...... 195 6.2. Future work ...... 196 6.2.1. Methodology ...... 196 6.2.2. Hierarchical modeling ...... 197 6.2.3. Data measurement modeling and error quantification ...... 197 6.2.4. Future directions ...... 198 6.3. Additional thoughts ...... 199 APPENDIX A. Model reactions ...... 201 APPENDIX B. Modularization methods ...... 211 APPENDIX C. Single color correction ...... 212

iv

APPENDIX D. CytoSys ...... 213 Instructions ...... 213 Precautions & Notes ...... 215 Phase definition file ...... 216 Variable list file ...... 219 BIBLIOGRAPHY ...... 220

v

LIST OF TABLES

TABLE 2.1 Equation 2.1 reaction stoichiometries. Column 3 is obtained through application of equation 2.8……………………………………………………………48 TABLE 2.2 Flow cytometry measurement error percentages.……………………78 TABLE 3.1 Computation of percentage time that a nominal cell spends in each phase……………………………………………………………………………………97

vi

LIST OF FIGURES Figure 1.1 Systems biology iterative cycle [1]...... 3 Figure 1.2. Acquired Capabilities of Cancer [22] ...... 5 Figure 1.3 ‘State of the art’ cell cycle control system and time curves of key players (cyclins) [2]...... 14 Figure 1.4. Synchronized cell population becoming asynchronous with time [17]...... 15 Figure 1.5. The multilevel hierarchical context envisioned in this work...... 18 Figure 2.1. (R) Evolutionary view of systems biology: From data to network to systems to complex systems...... 24 Figure 2.2. Categorization of approaches to biological problems based on research focus...... 26 Figure 2.3. Rationale for formation [71] ...... 27 Figure 2.4. Modeling architectures [1] ...... 35 Figure 2.5. Model development process (feedback loops indicative - not limited to those shown)...... 41 Figure 2.6. Parameter estimation steps...... 50 Figure 2.7. Western Blot example [90]. Cyclin B1 expression is dependent on amount of Trypsin inhibitor here...... 56 Figure 2.8. Flow cytometry example [90]. Dynamic time profiles of Cyclin B1 obtained through single flow cytometry measurements (for different cell lines) ...... 57 Figure 2.9. Western Blotting proxies (modified from Wolkenhauer's unpublished textbook [43])...... 63 Figure 2.10. Data collection using flow cytometry (Image adapted from the Invitrogen flow cytometry tutorial). Here BPF refers to Band Pass Filter. .... 69 Figure 2.11. Decision making in flow cytometry. The choice of model system, data source, measurement method and biochemical to be measured in our flow cytometry work is mapped to illustrate our thinking...... 75 Figure 3.1. Flow Cytometry data pre-processing steps include (A-C) Fluorescence compensation (done by applying a bias) (D) Removal of doublets (done by gating out cells as shown) (E-H) Minimizing the effects of non-specific binding (done by applying a bias). These steps were done in WinList (from Verity Software), a Flow Cytometry listmode data analysis program...... 95 Figure 3.2. (A) Separation of cells in Interphase (R3) and mitosis (R2) (B) Gating of mitotics into prophase (R19 (oval)- a blue dot marks the mean of the prophase data), pro-metaphase (horizontal span except for R19 and R24), metaphase(R24) and late mitosis (vertical span except for R24) (C) Gaussian fit to prophase of Cyclin A2 histogram demonstrating characteristics of prophase data (D) Gating of cells in Interphase (a blue dot marks the mean value of cells about to exit S and enter G2) (E) Gaussian fits to Cyclin A2 data for G2 cells (the left blue bar corresponds to the blue dot in (d) and the right blue dot corresponds to the blue dot in (b) ) (F) Time profile (or Frequency profile) of Cyclin A2 obtained as a result ...... 98 vii

Figure 3.3. Generation of the Cyclin B1 time curve. (A) G1 (R5) and G2 (R6) data clusters are shown for Cyclin A2 vs. DNA (B) The G2 only cells Cyclin B1 vs. DNA distribution is shown (C) Gaussian fits to the Cyclin B1 data (log) in G2 (D) G1 only cells Cyclin B1 vs. DNA distribution (E) Gaussian fits to Cyclin B1 data (log) in G1B (the upper 2 gates in (D) represent the second subphase in G1- i.e.G1B) (F) Resultant Cyclin B1 time profile...... 100 Figure 3.4. The gates used and the corresponding section of the Cyclin A2 time curve (highlighted) are shown phase-wise for (A,E) G1, (B,F) S, (C,G) G2 (time curve shown is the result of the Gaussian fits in Figure 3.2(E)) and (D,H) M ...... 102 Figure 3.5. Features of scaled cyclin A2 and B1 time profiles in late G2 and M phases...... 104 Figure 3.6. Comparison of the change in late G2/M features in different applications of the time profile extraction methodology to the K562 data (A,B) before single color correction and (C,D) after single color correction...... 105 Figure 3.7. Single color correction for Cyclin A2 ((A)pre-scaling and (C) post- scaling)and Cyclin B1 ((B) pre-scaling and (D) post-scaling. The equations that were used for scaling are also given, and their derivation is shown in Appendix 8.3 ...... 108 Figure 3.8 All expression profiles (Only cyclins A2 and B1 are truly comparable here. Simple visual scaling was done for the other expression profiles for display purposes.) ...... 109 Figure 3.9. Comparison of analysis of the same data set done using three independent sets of gates (manual clustering repeated three times). (A,B) show unscaled data, and (C,D) show data after single color scaling...... 113 Figure 3.10. Standard deviation between multiple gating attempts. Note that the zero point value here for cyclin A2 is zero. It would have been flat in G1 had we assigned the median value of G1...... 114 Figure 3.11. Standard deviation time profiles of cyclins A2 and B1...... 115 Here we notice a sharp increase in mitosis, however G2 is relatively uneventful. What is interesting, and expected, is that the standard deviation of the data is the lowest in S phase. In fact there is a sharp dip at the G1/S transition. This data appears, based solely on its standard deviation profile, to offer the best tradeoff between measurement error and biological variance. Further investigation is required to confirm this statement...... 115 Figure 3.12. Application of extraction methodology to the MOLT4 cell line data...... 116 Figure 3.13. (A) Original data. (B) Reproduced data (Linearly blended CV) ..... 118 Figure 3.14. Comparison of real (A,B) and reproduced filtered data (C,D) where the cyclins are in log scale. Random variation added to time profiles shown in Figures 3.2, 3.3. Both original and reproduced data have 100624 points. . 119 Figure 3.15. Data processing diagram ...... 123 Figure 3.16. Data stages in CytoSys...... 124 Figure 3.17. CytoSys data processing protocol...... 125 Figure 3.18. Folder structure in CytoSys...... 127

viii

Figure 3.19. Prophase cells histogram with mitotics as inset...... 128 Figure 3.20: Example of Gaussian fit to Cyclin A2 (log scale) histogram for cells in G2...... 131 Figure 3.21. CytoSys integration scheme (Our long-term view of data flow through CytoSys)...... 137 Figure 4.1. Cell cycle phases. (figure reproduced from [106]) ...... 140 Figure 4.2. Mitosis [107]...... 142 Figure 4.3. CDK regulation can be done either via Cyclin availability, or through phosphorylation or through inhibitors (CKIs)...... 144 Figure 4.4. Sequence of cell cycle control system initiated activities that constitute the cell cycle...... 146 Figure 4.5. The Novak04 model components. Black cartoon reproduced from publication. Red names and arrows were added to indicate modeled reactions not originally included in diagram (Weis MC qualifier)...... 149 Figure 4.6. Tyson model refit to match published output curves...... 154 Figure 4.7. Plots of the K562 data versus current (refit) Novak04 model output...... 156 Figure 4.8. Two best fit solutions for a K562 cyclin B maximum (and corresponding cyclin A ratio) estimated as an additional scaling parameter. The top solutions with the scaling parameter being optimized to be 1.6 and the bottom ones it is optimized to 1...... 157 Figure 5.1. Hematopoietic [23]...... 164 Figure 5.2. Flt3 receptor structure and activation mechanism [23] ...... 169 Figure 5.3. MAPK core processes...... 172 Figure 5.4. mTOR’s role in cell signaling [154] ...... 178 Figure 5.5. Pathway computational modeling assumptions...... 183 Figure 5.6. Model conception...... 183 Figure 5.7. Biochemical reaction schemes of internalization, adapted from [35]...... 184 Figure 5.8. Schoeberl MAPK model [35] combined with ERKP and ERKPP activation of RSK and downstream (activation of CREB transcription factor, and transcription of the Cyclin A1 gene). The Schoeberl notation has variable numbers in navy blue (with internalized variables shown in brackets), and reaction numbers shown in dark green, with ‘v’ preceding the number...... 186 Figure 5.9. PI3K-Akt-mTOR pathway model. The receptor has to be changed to account for the fact that the Flt3 ligand is a dimer. The notation used for the non-Schoeberl portion of the pathway was light blue for the variable numbers, (with internalized variables shown in brackets, and the variable associated with the coated-pit protein shown in orange font) and light green for reaction numbers preceded by a ‘v’. S6* (activated S6 transcription activity) is unmodeled. eIF4E* role in translation initiation of Cyclin D1 is shown in 1 step...... 187 Figure 5.10. MAPK modules. Biochemical contiguity, functionality and the specific pathway context are combined to form modules. A color is associated with each module, and this is reflected in the block diagram

ix

shown in Figure 5.12...... 188 Figure 5.11. PI3K-Akt-mTOR modules...... 189 Figure 5.12. Block diagram of entire Flt3 signaling pathway. Flt3 ligand dimer (L2) is the system input. Cyclin A1 and Cyclin D1 are the system outputs. 190 Figure 5.13. Hierarchical representation of Flt3 signaling pathway modules. ... 191 Figure 5.14. Model calibration logic...... 192 Figure 5.15. (A) Flt3 model MAPK simulations (B) Simulations from Schoeberl MAPK paper...... 193

x

ACKNOWLEDGEMENTS I would like to express my deepest gratitude to Prof. Sree N. Sreenath, Prof.

James W. Jacobberger and Prof. Mihajlo D. Mesarovic for their invaluable

advice, criticism, support and guidance throughout my study. One page of

acknowledgements is too small a space to discuss all that I owe them. Prof.

Sreenath has been far more than an academic advisor to me, and I would like to acknowledge all the mentoring on multiple fronts. I would like to thank Prof.

Mesarovic for training me in systems thinking; if I haven’t mastered the same the

fault is my own. I would like to also point out my appreciation for the countless

sessions with Prof. Jacobberger that intended my edification in biology/cytometry

and general research attitude. These no doubt took up valuable time for both of

us, and while he wouldn’t say it, perhaps tested his patience, but their effect is

more far reaching than the work in this thesis may reflect. My words of

appreciation to my dissertation committee, Prof. Ken Loparo and Prof. Vira

Chankong, for their time and advice.

I would also like to take this opportunity to thank members of the Sreenath

Lab:

Dr. Radina P. Soebiyanto and Michael C. Weis who helped me in too many ways

to enumerate. A large part of my learning was through my interactions with them.

My appreciation to Dr. Evren Gurkan-Cavusoglu and Dr. Reza Jamasebi for

insightful discussions. I would like to thank Akshay Sridhar and Abhijit Kaushik for working with me on gathering the K562 cell line flow cytometry data. I would

also like to thank Mike Sramkoski for supervising us during those data gathering

xi

sessions, and for lots of help and discussions over the last three years. I would also like to acknowledge Tammy Stefan for the MOLT4 cell line data. I would also like to thank my dear friend, the late Marla Radvansky for so many different little kindnesses. I would like to thank Tracy Rehl for her scheduling help.

I would also like to acknowledge my brother for encouraging me to join a

PhD in the first place, and ‘setting me straight’. In addition I would like to thank a friend who has been a source of enormous support. Finally I would like to thank my parents for being who they are. They have consistently taught me through example that the purpose of education is always building one’s character, no matter the subject.

xii

Complex Systems Biology of Mammalian Cell Cycle Signaling in Cancer

Abstract

by

JAYANT AVVA

We present here a complex systems biology approach towards elucidating the role of mammalian cell cycle signaling in cancer. In this context, availability of copious amounts of biological data has done little to alter the paucity of contextually consistent dynamic time profile data, necessary for the calibration and validation of dynamic models of the biological systems. Such computational models form the heart of the complex systems biology approach, and paucity of appropriate data is an immediate impediment. To address this problem, we developed a novel methodology to filter measurement noise and extract time profile variation of cell cycle biochemicals from statically sampled flow cytometry data. Taking a hierarchical viewpoint, a mathematical model of the upstream signaling from the receptor through to the nucleus was developed. A computational model of the downstream cell cycle control system based on a modified Tyson’s mathematical model that uses our time profile extraction methodology for calibration was also developed. The approach was demonstrated separately using K562 and MOLT4 cell line experimental data from the wetlab. We built custom software, CytoSys, to facilitate the application of our methodology.

xiii

1. INTRODUCTION

1.1. Overview

This thesis presents a complex systems biology approach to understanding

the biochemical signaling network that regulates the mammalian cell cycle. In this

context, we elaborate on the role of data in achieving such understanding, and

present a methodology to help us harness embedded time profile data from

statically sampled multi-variable flow cytometry data. We demonstrate the use of such data for calibrating mathematical models.

1.2. Chapter Organization

Motivation for our approach is presented in Section 1.3. We introduce the

different hallmarks of cancer here, and make a case for how the cell cycle plays

a pivotal role with respect to one of these key hallmarks. Section 1.4 introduces

the importance of dynamics in systems biology models, and specifically the

importance of time profile data. Section 1.5 bolsters this argument, and

introduces different contributions of this thesis. Section 1.6 presents the thesis

organization.

1.3. Motivation

This thesis contributes to an understanding of the mammalian cell cycle in

health and disease using a complex systems biology approach. In the Complex

Systems Biology approach, viewing biological systems as multilevel hierarchical

1

systems, one generates an image of the via a computational

modeling approach founded on , and uses this image as a proxy for

the biological system to experiment with (systems biology iterative cycle [1],

shown in Figure 1.1). Data from leukemic cancer is used through flow cytometry

measurements. On a broader scale this approach and methodology can be

applied to other diseases.

The systems biology iterative cycle (Figure 1.1) is an intimate combination

of computational analysis and wet-lab experimentation. The cycle is initiated by a

biological hypothesis that makes biological investigators generate experimental

data. A mathematical model of the biological process under investigation is next

developed, and then made computational using experimental data. A detailed

explanation of this process, and on the difference between biological,

mathematical and computational models, is provided in Section 2.4. Principles of

theory are used in computationally analyzing the system,

and this drives further wet-lab experiments. The fresh data thus generated is

then used to update the mathematical model. The initial hypothesis can be

modified or completely altered if there is a large discrepancy between the wet-lab

results and the model simulations. Thus, we start with hypothesis generation,

which drives wet lab experimentation, which in turn generates the data used in

calibrating/validating a mathematical model of the biological process under study,

and simulating this model brings us a full circle back to hypothesis acceptance or

rejection and further research.

2

Figure 1.1 Systems biology iterative cycle [1].

While copious literature on the cell cycle exists [2-11], a fully functional

computational model reflective of actual biological data has been evasive [4,12-

20]. Here, we attempt to use biological data gathered using flow cytometry to

build precisely such a computational model as a fundamental step to complex

systems biology analysis.

This motivation is rooted in the thoroughly practical aim of understanding

disease mechanisms in human systems with the hope that this knowledge would

be useful in designing therapy that is personalized [21]. We chose to focus on

cancer due to its high relevance, ubiquity and the fact that it is a truly systemic

disease, and would therefore challenge us to apply nothing short of a systems

approach [22]. Additionally, we chose to focus on leukemic cancer, since:

3

a. Leukemic cells are sampled easily (by drawing patient's blood), in contrast

with solid tissue (more invasive- cutting out tissue).

b. Leukemia is characterized by altered cell sub-population percentages in

different branches of the cell differentiation hierarchy [23]. A measurement

method such as flow cytometry is the best available choice to give us such

cell sub-population data in a relatively easy manner [24-26].

Hanahan and Weinberg’s seminal paper [22] on the hallmarks of cancer

provides insight into why cancer requires systemic aberration. Six hallmarks

(functional aberrations) are listed as working in concert to give cancer its system-

wide quality (Figure 1.2):

1. Sustained angiogenesis

2. Limitless replicative potential

3. Evasion of apoptosis

4. Insensitivity to anti-growth signals

5. Self-sufficiency in growth signals

6. Tissue invasion and metastasis

4

Figure 1.2. Acquired Capabilities of Cancer [22]

These hallmarks offer us different vantage points from which to understand cancer, and perhaps different avenues to approach a solution. In this thesis, we chose to focus on limitless replicative potential of cancer cells. We identify certain basic open problems in the systems biology of the cell cycle process and provide approaches and methodologies to address this challenging and vast area.

1.3.1. Cell Cycle

The cell cycle is a highly regulated process through which cells replicate [2].

It is our task here to model this natural process in mammalian systems in order to understand limitless replicative potential of cancer. Knowledge of the key cell cycle molecules mutated in cancers [7] serves as a basic guide. Specifically we wanted to model the processes that make the cell cycle function the way it does.

Literature provides us enormous information about the protein interaction

5

kinetics that form the basis for signal transduction (upstream) [1] and cell cycle

control system (downstream) [19]. These can be translated into a mechanistic

model (a model of the actual molecular mechanisms that replicates the biological

process), essentially defined by a set of equations based on mass action (or its

variant approximations such as Michaelis-Menten) [27,28].

1.4. Computational models

1.4.1. Necessity of dynamics

Mathematical models of signaling in biology are largely deterministic models that are composed of Ordinary Differential Equations (ODE1), which are appropriate at this level of system [1,29,30]. These models are

primarily based on the law of mass action [31] and are intended to be able to

accurately computationally simulate different system variables (primarily protein

concentrations) as functions of time. However, this intention is very often

challenging to realize in practice and consists of two steps- calibration and

validation.

Calibration refers to fitting a mathematical model of a process to real data.

In other words the model coefficients (rates, initial conditions of biochemicals

etc.) are adjusted so that the model outputs match real data. The process of

validation involves ensuring that the computationally simulated model and the

1In systems biology, ODEs are used when the system under investigation is well stirred (uniform spatial distribution of biochemical ). This can be extended to model compartments when the compartments themselves are well-stirred, and the rates of transport between the compartments are observable [254]. 6

system under consideration give the “same” results when subjected to a new set of experimental conditions that are not the same as the ones used for calibration.

We refer to the calibrated and validated mathematical model as a computational

model to distinguish it from the mathematical model. Computational models are

useful in accurately predicting . However the calibration and

validation steps are possible only if we have appropriate biological data. Data

that is appropriate for the calibration and validation of deterministic ODE models

is dynamic in nature. Dynamic data refers to the variation of a biochemical

concentration with time in this thesis.

Remark 1.1: The terms kinetic data, dynamic data, time curve data, time profile

data and expression profile data are generally used interchangeably. We however will use ‘time profile data’ throughout this work.

1.4.2. Paucity of organized time profile data in cell signaling

There is a large paucity of organized attempts at generating time profile data in cell signaling studies in a consistent manner, when compared to the

number and sheer extent of the modeling attempts that have been done [1,32-

34]. This does not imply that time profiles for several of the biochemicals of

interest in different signaling pathways do not exist. However they are not

available in the same context (i.e. for the same cell line, under the same

experimental conditions). Putting together time curves from diverse experiments

cannot be justified ultimately and is unsound scientific practice.

Due to the large amount of effort involved in generating the time variations

7

(or dynamics) of protein concentrations, models which contain tens or even

hundreds of variables have contextually consistent time profile data available

only for a few of these variables. For example, one of the most respected, and

well-established models is the MAPK cascade signaling model from the

Schoeberl group [35]. It is a valiant attempt at modeling this complex pathway

from first principles and an important landmark in signaling models. The model

has 94 state variables, 125 parameters, 6 outputs and 1 input.

Remark 1.2. In the context of mathematical models, the term parameter usually

refers to values in the model that are not state variables. For example, in ODE

models, parameters would refer to rate constants. Some definitions of this term

also group initial conditions of state variables with rate constants under the term

parameters. Research in flow cytometry, however, refers to a state variable being measured as parameter. Keeping in view that this is an engineering thesis, the

term ‘parameters’ refers to rate constants only throughout this work. We instead

use the term coefficients to denote both parameters and initial conditions.

Our work here is based on the implicit recognition of the crucial importance

of data in models, particularly signaling models. This importance must be explained adequately before we proceed further. A mathematical model with many variables but very limited data leads to inaccurate computational models.

For the mathematical model to be computational, we must have numerical values

of all the coefficients in the model. The best case scenario would be to have a

mathematical model, and copious amounts of time profile data for all model state

variables under heterogeneous experimental conditions. This scenario is never

8

available in computational biology practice. The most common scenario, in stark

contrast, is a mathematical model, and time profile data for a few of the state

variables. The number of state variable dynamics available is usually too small to reliably estimate the coefficient values of the model in a meaningful way. In such

an event, modelers usually start out by trying to minimize the number of unknown

coefficients by adopting values culled from the literature. These values are from

disparate sources. The better amongst these are time profile data from similar

organisms or cell lines, under similar experimental conditions, and a tacit recognition of numerous practical disparities between what appear to be identical experiments [36]. These are in the minority however.

Current models obtain their coefficients from heterogeneous systems and diverse sources [32,35-38]. This is the current state of computational modeling data availability, and naturally this lack of precision/accuracy breeds errors in the building and simulation of such models. This host of problems has its roots in one problem alone – paucity of contextually consistent data, in our specific case, time profile data.

The ubiquitous paucity of contextually consistent time profile data limits the usefulness of the Schoeberl model. However, they are able to replicate measurements of model outputs using the model, and they do it by using a mixture of coefficients that are derived from different species, and state of the art parameter estimation methods.

This is very often how signaling studies tend to solve the problem of replicating model outputs. While it is understood that such solutions are not

9

identical with biological reality, they do sharpen our understanding of that reality

by raising relevant research questions. For instance, in [39], the authors show that using the systems concept of coordinator helped them reduce the number of coefficients that must be subjected to further biological investigation by up to one order of magnitude. Another example is listed in [40] where a computational model of ErbB1 receptor trafficking and signaling is used in conjunction with experimental results to conjecture that sensitivity to the drug gefitinib is marker of the reliance of the protein kinase Akt signaling for cell survival that may be brought about by impaired receptor internalization.

In complex biological systems modeling, it is our experience that the addition of even a single unknown parameter could increase the system complexity exponentially. This may be why parameter estimation solutions to fitting time profile data are treated with doses of healthy skepticism. In fact when a sensitivity analysis study is done to show output dependence on parameters, it is often possible to get the same results for different parameter sets [38,41]. This is, of course, not a criticism of the method, but the state of the art and also appears to be the main way forward in the short term. It is rather an appreciation of the true severity of the problem of time profile data paucity. Although the modeling approach we have chosen is advantageous in that it largely uses physical constants such as rates, it would be very difficult to verify that the parameter values actually occur in biology.

There are a few research groups who understand the severity of this lack of

10

time profile data in such models. For instance Sigal2 [42], have focused on such

measurements using human H1299 lung carcinoma cells. Their results are

normalized protein levels as functions of time measured in cell generations.

Similarly the Sorger group3 has systematically undertaken modeling and also generates the corresponding time profile data [38]. Moreover Sorger's group not only pursues a mechanistic modeling approach (modeling the molecular mechanisms underlying a biological process) that is bolstered by their own time profile data collection, but also gather data that helps fuel a parallel data-driven approach.

1.4.3. Extracting dynamics out of statically sampled data

We start out by defining statically sampled data.

Definition 1.2:

Statically sampled data simply refers to data that doesn’t explicitly include biochemical concentrations as functions of time.

Currently there is far more statically sampled data available than there is time profile data [43]. This is probably because generating time profile data about a biochemical means being able to measure it as it changes over time as it would in its native environment, or in an almost similar environment. Such an environment is causally linked to the underlying intracellular mechanisms of the cell, and in most measurement techniques, it is difficult to extract measurements

2 Alon group at the Weizmann Institute of Science, Israel 3 Department of Systems Biology, Harvard Medical School 11

from live and intact cells. In contrast a static measurement of a biochemical requires a measurement of relative or absolute quantity. While ideally preservation of the cell’s microenvironment means a more biologically realistic measurement, the fact that we are measuring only one static snapshot of the cell means that it is relatively easier to freeze the cell’s microenvironment at a certain point in time and capture the measurement. As a consequence static data is far more widely available than its dynamic counterpart.

While there are numerous ways of measuring biochemical concentrations and their changes in a cell, our experience indicates that the measurement technique of flow cytometry furnishes us with relatively high throughput data gathering method that has embedded in it information about cell cycle time

[44,45]. Flow cytometry enables us to measure the relative content of a biochemical by tagging it with a fluorescent chemical and measuring the fluorescence from that chemical as a proxy for the amount of biochemical relative to the sample. It can also be used to measure absolute amounts (e.g. molarity) if we know the molecular weight of the biochemical and the absolute amount of the overall sample [46].

There is a wealth of flow cytometry data available currently [47]. Moreover, due to technological progress, it is relatively easy to generate flow cytometry data in its current form [6,47]. However, when we require time profile data, fresh kinetic experiments must be performed. Kinetic experiments refer to wet lab experiments that involve measuring the same biochemical in a cell sample at different points of time. This is always labor intensive, given the state of today’s

12

measurement art. The research in improving the state of today’s measurement

techniques will probably pay off in the future, but until then we see a large gap

between the technical sophistication of computational models, and the measurement techniques used to generate data to make these models

meaningful. In such a scenario, there is the need for a methodology to harness

data that comes out of existing measurement methods such as flow cytometry to

generate time profile data. This thesis aims to do precisely this.

1.5. State of the art

One of the key issues in cell cycle modeling is the lack of actual time curves

of different regulators of the cell cycle in consistent contexts (for the same cell

line, under similar experimental conditions, etc). While we cover this topic in

detail in Chapter 4, here it serves to illustrate the problem of contextually

consistent time profile data paucity. Over the last twenty years, several groups

have attempted to make detailed models of the cell cycle control system (e.g.

[19]). Modeling attempts for simpler eukaryotes such as yeast have been more

comprehensive, however. Yeast models have led the field in cell cycle modeling

for quite some time [12,20,48-53].

Figure 1.3 shows us cell cycle regulatory protein time profiles from Morgan

[2]. This textbook contains a comprehensive and carefully compiled summation

of current cell cycle knowledge, all of which is qualitative in nature with regards to cell cycle control system dynamics. As an example, we see that Figure 1.3 does not offer us the precise dynamics, but instead dynamics that are based off of 13

biologists’ intuition about the cyclins represented. It can be clearly seen that the

roles of the cyclins in the cell cycle checkpoints (indicated by the red arrows) are

what have driven the intuition that generated these time curves. There is very

limited actual measurement information drawn from consistent sources here.

This, in a nutshell, is the state of the art for mammalian studies. Our suspicion is

confirmed when we compare these cyclin curves with actual biological data we generated (see Chapter 3, Figures 3.2, 3.3).

Figure 1.3 ‘State of the art’ cell cycle control system and time curves of key players (cyclins) [2].

Additionally, generation of the time profile data required to fill the gap in

knowledge, typically requires labor intensive wet lab experimentation. This would

usually involve an experimental biologist measuring the data at different points in

time, for the same cell culture, while maintaining a strict protocol where the

sequence and timing of measurements would be of crucial importance to

generate a reliable time profile. This would typically challenge the biologist’s

endurance.

14

This is not the only problem in measuring cell cycle proteins, however. An equally serious problem is that of cell-cell heterogeneity. This means that if we tried to measure cell cycle protein dynamics for a population of cells using kinetic experiments, we end up with a tangled mass of curves. Due to the asynchronicity that exists naturally in a population of cells, each cell starts the cell cycle at its own pace (based on its unique micro-environment). To offset this problem, some biologists synchronize cell populations before performing such kinetic experiments [17]. Cell synchronization simply means chemically shocking the population of cells so that the cells are “forced” to start their cell cycle together. It is apparent that such a system is not identical to cells in their naturally heterogeneous state. Moreover, such artificial synchronization is not maintained very long, and results in various practical measurement issues as is demonstrated by [17] in Figure 1.4. This is primarily attributed to biological variance.

Figure 1.4. Synchronized cell population becoming asynchronous with time [17].

15

Gene Network Sciences Inc. in collaboration with the Dowdy group4 used

contact inhibition [17] to synchronize cells. From Figure 1.4, we observe that cells

are not completely synchronized even in the bimodal distribution at the zero hour

(in fact 10% of the cells are in S phase at this point already). The larger peak in

Figure 1.4 is the population of cells in G1, and the smaller peak is the population

of cells in G2/M. The trail in between them is the population of cells in S phase.

The changing fractions of cells between phases of the cell cycle can be observed

visually and from the percentages (%S phase) listed. The measurements

obtained from such cell populations, while much more accurate than those taken

similarly from asynchronous populations, would still be very inaccurate.

Sigal [42] attempted an in silico synchronization of cells- i.e., they measured

protein levels in heterogeneous H1299 lung carcinoma cells, and then ensured

that the resultant protein dynamic curves were all made to start at the same

point. Their work demonstrated how due to biological variance, synchronicity was

lost within the first cell cycle. This further underlines the importance of the work

presented in this thesis, which aims to generate the dynamics of cell cycle

proteins from cell populations that have not been chemically shocked as in

synchronization studies.

1.6. Thesis contribution

In Figure 1.5, we see a multilevel hierarchical structure as the basis for the

4 Department of Cellular and Molecular Medicine, Howard Hughes Medical Institute, University of California San Diego School of Medicine, La Jolla, CA, USA 16

work in this thesis. Here the cell population is the highest level (Level 3), the cell

cycle control system is the next level (Level 2), and the intracellular signal

transduction that happens upstream of the cell cycle control system is the lowest level (Level 1). Level 1, also referred to as upstream signaling, is studied in

Chapter 5. Level 2, also referred to as downstream signaling, is studied in

Chapter 4. The cell population model referred to in Figure 1.5 is presented in

Chapter 4, Section 4.7.

Salient contributions of this thesis include:

1. A novel systematic methodology for the extraction of time profile signals

of cell cycle variables from statically sampled, multiparametric flow

cytometry data, including systems engineering based noise filtering

techniques (Chapter 3, Section 3.6.1)

2. Cell cycle protein concentration time profiles, measured for K562 cells

[54] (a chronic myeloid leukemia cell line) (Chapter 3, Section 3.6.2)

3. CytoSys (Cytometry data characterized using Systems), a MATLAB

based software with an easy to use graphic user interface (GUI) front-end

to implement our methodology (Chapter 3, Section 3.7)

4. A computational model of the downstream signaling, using the above

data (Chapter 4, Section 4.6)

5. A mathematical model of the upstream cell signaling

17

External Autocrine Paracrine Etc. Stimulus Cell population model* (N cell Collective)

Receptors

MAPK PI3K‐Akt

Signaling model** RSK (Signaling Pathways) Cytoplasm mTOR CREB Cyclin D

Cell cycle model*** (Modified Tyson’s) Nucleus

Figure 1.5. The multilevel hierarchical context envisioned in this work.

*Not modeled in this thesis

**Mathematical model was developed by assembly from literature

***Calibrated mathematical model was developed

1.7. Thesis organization

Chapter 2 presents background on complex systems biology, modeling,

hierarchical systems and the role of data in models. Chapter 3 presents the time profile data extraction methodology using flow cytometry data, the core thesis 18

contribution. Chapter 4 presents the cell cycle control system or the downstream mathematical model. It also presents model calibration attempts, and previews a cell population model. Chapter 5 presents the upstream mathematical model that starts with receptor activation, and continues all the way to the point where the cell cycle control system is actuated. Chapter 6 presents a summary and areas offering promising avenues for future work.

19

2. BACKGROUND

2.1. Overview

In this chapter, we discuss systems biology and how it is practiced by prominent researchers in the field, and then show why we have converged on our particular approach to systems biology. We detail all methods under the rubric of Complex Systems Biology, starting with problem definition, computational modeling (modeling biological mechanisms with quantitative rigor from real biological data), and the use of certain systems engineering approaches to utilize the model for purposes of understanding. In the latter part of the chapter, we elaborate on the role of data in modeling. In this context we explore modeling the measurement process for two popular methods - Western

Blotting [55] and Flow Cytometry. This is work in progress.

2.2. Chapter Organization

Section 2.3 briefly discusses the term systems biology. In Section 2.4, we explore the different disciplines that are practiced under the umbrella of systems biology. Specifically we elaborate on Complex Systems Biology, and its antecedents, philosophy, and practice. Section 2.5 offers a few thoughts on cross-level causality and multilevel modeling. In Section 2.6, we discuss different modeling approaches. We also elaborate on deterministic modeling using the mass action assumptions. Section 2.7 elucidates the role of data in predictive modeling. In Section 2.8 we preview the specific system we are modeling in this

20

thesis.

2.3. Introduction

The term ‘systems biology’ is used in different ways by different researchers, and it is essential to understand precisely what this term means before adopting a ‘systems biology’ approach. In whichever manner systems biology is defined, it is always positioned at the intersection of biological sciences and disciplines rooted in mathematics, engineering and computation. This means the systems biologist must be conversant with the biology of the problem he/she decides to tackle, and with the best suited computational/ mathematical/ engineering approaches to do the needful.

2.4. Systems Biology

The massive influx of data due to technological advances on multiple fronts,

and a realization that reductionist biology would not meet the challenges posed

by today’s biological problems is what has led to the ‘evolution’ of systems

biology [56]. Current biological challenges require us to ask increasingly refined

questions that demand answers grounded in an understanding of the complexity

of the systems involved, which in turn are increasingly dependent on accessing

all of this data. The results of the Human Genome Project have shown us that

the complexity lies not in the number of the genes involved, but in their

interconnectivity [21,56]. How the genes are interconnected, and how their cross-

level regulation is achieved, is what makes the human organism (and all other

21

biological organisms) complex, and an understanding of this key principle also

leads us to the heart of systems biology. These results have also given more

impetus to the burgeoning field of epigenetics [57] (where causes of inherited

changes in phenotype are sought in factors other than the underlying DNA

sequences).

2.4.1. Different types of System Biologies

Systems biology was first described in its current avatar by the systems

scientist Mesarovic5 [58,59]. The formal study of systems biology, as a distinct

discipline, was launched by Mesarovic in 1966 at an international symposium at

the Case Institute of Technology in Cleveland, Ohio entitled "

and Biology" [60,61].

Systems biology focuses primarily on understanding the functional

interactions between different components of the organism, at various spatial

levels (organism, organ, tissue, cell, sub-cellular). This view is the broad

consensus of Wolkenhauer6, Kitano7, Hood8 and Mesarovic. The particulars of

how these pioneers of systems biology research understand this functional

interaction playing out, and the specific nuances of their research philosophies

characterizes the difference in their approaches.

Hood places emphasis on systems biology aiming to quantify molecular

5 Professor of and Mathematics, Case Western Reserve University, Cleveland, OH, USA 6 Chair in Systems Biology and Bioinformatics, University of Rostock, Rostock, Germany 7 Director, Sony Computer Science Laboratories Inc., Tokyo, Japan 8 Founder, Institute for Systems Biology, Seattle, Washington, USA 22

elements of a biological system in order to understand their interactions, and to

further integrate this understanding into predictive network models [56].

Wolkenhauer and Mesarovic both concur that it is systems dynamics and

organizing principles of complex biological phenomena that give rise to functioning and function of cells [62]. Kitano categorizes the field into four key areas of focus in keeping with its philosophy, and to help it develop as a science due to its burgeoning state because of renewed interest in the past decade: systems structure, systems dynamics, systems control method and systems design method [63,64].

Most of the work presented in this thesis deals with systems dynamics exclusively; and the other sub-goals of systems biology (as formulated by Kitano

[63]) are in relation to this study of system dynamics. In other words, extracting

system dynamics where necessary; understanding system dynamics in relation

to phenotype; understanding how system structure facilitates system dynamics

and how system dynamics alter structure This includes a study of control

structures within the system that cause such dynamics, etc.

Another way of categorizing the existing systems biology approaches are

depicted in Figure 2.1. This is based on the two predominant views on systems

biology – the first based on networks and the second on systems.

23

Figure 2.1. (R) Evolutionary view of systems biology: From data to network to systems to complex systems.

The aim within network systems biology (as a part of postgenomic biology

research) is to systematically catalog the molecules and their relationships within

a iving cell. This often is translated at the practical end into identifying the

relationships between entities buried in large, quantitative data sets.

Bioinformatics is a salient example of this paradigm in operation. Network

systems biology has been largely embraced by the genetics community in

studying gene networks. Protein networks are not currently studied this way due

to paucity of data in this field. A detailed treatment of network systems biology

practices are available in [29,65]. This approach is placed in the context of

research focus in Figure 2.2.

The systems approach to system’s biology is intrinsically tied to the concept

of systems dynamics. For instance in computational biology, a system is viewed

as a relation on objects and concepts such as dynamics are used to understand

the system’s evolution with time (Section 2.4.2).

24

Some kinds of systems biology practice do not follow the basic consensus

definition given above (of understanding the functional interactions between

organisms). One example is discovering electrical circuit motifs in biological

systems dynamics [66,67]. Another example is what is termed by some as

systems biology but is more properly called mathematical biology. The aim here

is to research biologically inspired mathematical problems [68].

2.4.2. Complex Systems Biology

Mesarovic et al. [69] mention that systems biology constitutes much more

than simply considering simultaneously all variables in biological observations;

i.e. observations on a systems level is not enough. The distinct behaviour of the

overall system has to be recognised. Moreover they pioneer the concept of a

system of systems. A system is defined as a relation on different components

(parts/items). These could, for instance, be genes or proteins. Such a system could in turn be organized into systems that are again viewed simultaneously.

Complex Systems Biology adopts this view of a system in practice. As the approach delineated in this work is rooted in complex systems theory, it will be referred as the complex systems biology approach henceforth [29].

25

Research driven by

Focus on state variables Focus on relationships between state variables

Systems engineering disciplines Biological disciplines Systems biology Bioinformatics

-omics Dynamics Statics

proteomics metabolomics Complex Other transcriptomics Systems Biology genomics

USE COMPLEX UNDERSTAND UNDERSTAND WHAT SYSTEMS STATIC CONSTITUTES THE PARADIGMS TO INTERACTION SYSTEM UNDERSTAND NETWORKS SYSTEM FUNCTION

Figure 2.2. Categorization of approaches to biological problems based on research focus.

In Figure 2.2, we show a broad categorization of different approaches in systems biology based on research focus. Here we see that the –omics seek to complete our understanding of the biological details within each cell, or in systems terminology, of the state variables and how they are regulated. The Nature

Omics Gateway [70] offers a comprehensive information on the –omics disciplines. Complex Systems Biology seeks to fathom, instead, the essential links that make a system function as a system. It seeks to do so through the application of complex systems paradigms, in the light of how the system evolves in time [60,69,29]. Network systems biology (Section 2.4.1) is categorized here under static approaches. 26

2.4.3. Hierarchical and multi-level paradigm- concepts and significance

The precise conceptual framework required for CSB can be found in the system of systems concepts pioneered in multi-level hierarchical theory [71] and in Mathematical General Systems Theory (MGST) [72]. In ‘Theory of Hierarchical

Multilevel Systems’ [71], the authors describe the different levels that constitute the multi-level hierarchical structure as being based on three different criteria

(illustrated below)

Abstraction (Modeling LEVELS significance)

Hierarchical multilevel Decision-making systems complexity LAYERS

HIERARCHY (BASED ON) Priority (Action) ECHELONS

Figure 2.3. Rationale for hierarchy formation [71]

The three methods of assigning levels in a multi-level hierarchical system are formalized in the framework of mathematical theory of general systems. The concept of the multi-level, hierarchical system cannot be explained concisely. In fact [71] shows us that explanation of this term is done by enumeration of the characteristics of such a system.

The main characteristics of multilevel hierarchical systems are:

 Vertical arrangement.

o In any hierarchy there is included a vertical hierarchy of subsystems.

27

Moreover the terms ‘system’ and ‘subsystem’ simply refer to the

transformation that converts input(s) into output(s).

 Right of intervention

o The higher level has the right to intervene in lower level modules’

performance if the lower level module crosses the bounds set on the

tolerance for acceptable performance.

 Performance interdependence

o Lower level subsystems are independent of their peers; i.e. two

subsystems at the same level cannot be dependent on one another for

performance.

Multilevel hierarchical theory offers us the conceptual theory that helps us tackle a very key dilemma in complex systems decision making. [71] states the following two characteristics of the decision making process:

1. When decision time comes, the making and implementation of a decision

cannot be postponed (for instance, inaction is the deliberate action of not

doing anything)

2. The uncertainties regarding the consequences of implementing various

alternative actions, and the lack of sufficient knowledge of the

relationships involved, prevents a complete formal description of the

situation which is needed for a rational selection of a course of action.

These two characteristics or factors lead us to a dilemma in making decisions in complex systems settings. A hierarchy of levels, where the immediately higher level serves as a motivator/helper for the level at which such

28

a dilemma is faced, adequately solves this problem. For instance, if this dilemma

was faced at a cellular level, i.e., cells had to decide on whether to turn on an

apoptotic signal or not- due to a certain combination of biochemicals. Clearly the cells would have to refer the question to the tissue level as to whether this was in the best interests of the tissue, which would refer the question to the organ, etc- and finally to the organism- and the response would be translated back across

levels in some form or manner- and finally be answered for the cells.

The beauty of this approach is that each of these levels operates on a

different time scale, and while the underlying mechanisms are currently not clear,

such a hierarchy of levels completely eliminates such dilemmas. This has been repeatedly demonstrated using simulations of complex signaling pathway models, and their ‘predictions’ at the level where the phenotype is defined

(example – tissue in prostate cancer; cells in leukemias; etc) [36,30,29,73].

Another pertinent application of hierarchical multilevel theory, in our opinion, is in the practice of systems biology itself. In truly complex systems, investigators studying the problem at a lower level, invariably find that they face questions that pose dilemmas if answered at the same level that they were asked.

Consequently, they are compelled to move to a higher hierarchical level if they seek an answer that satisfies (or perhaps even optimizes).

2.5. Cross-level causality

Intracellular signaling that regulates the cell cycle control system is

investigated in this thesis. We attempt to contextualize this work, by exploring the 29

following questions:

 How does one understand signaling pathways (lower level or level 1) in

relation to the phenotype that is manifesting (higher level or level 2)?

 If we view a system as composed of hierarchical levels, then how can we

understand the relation between these levels? For instance, how does one

connect what is happening at the signaling level with what is happening at

the tissue level? How does one connect what is happening at the tissue

level with what is happening at the organism level?

 How does one answer questions at a given level, based on higher level

information? (In other words, in the light of the context provided by higher

level information, what methods exist to improve our investigation when

we are confined to a lower level with the exception of the aforementioned

information?)

To answer these questions, we must ask: What is the most comprehensive cross-level modeling that one can do? Ideally, one must not only be prepared to model subsystems within the overall organism, but also the interrelations between different subsystems, and their overall context. With reference to our modeling choice, we must model signaling pathways within cells, and cell populations, and the organization of cell populations with the constraints imposed by virtue of their respective natural environments (e.g. tissue architecture constraints, for cells constituting solid tissue; lymphatic system constraints for cells constituting lymph; etc.), as well how these relatively macroscopic systems themselves are organized into the organism; we must also be prepared to model

30

the relationships between cells and cell populations; between cell populations

and tissues; between tissues and the organism. It gives one pause, when we wonder if these are two different things, or if the very understanding of the structural layout (both spatially and temporally) would answer how these different

levels were interrelated.

The answer seems to lie in the fact that we never face the ideal situation.

There are numerous resource constraints, and these constraints force us to tackle the problem on multiple fronts. In this case we utilize the twin-pronged approach where a model is built at each level, and the connection these respective models is then painstakingly crafted. Ideally however, simply having one picture of the entire biology (a dynamic picture, encompassing the entire span of the organism in question) would answer most, if not all our questions. In the absence of such a picture, all our research is directed at constructing it.

One very basic level of approaching the cross level problem is using a binary representation of phenotype. This is a relatively common approach, and involves studies where, for instance, the constitutive activation of a certain signaling pathway has been implicated in a specific disease. In such an event, the idea is that when this pathway is not constitutively active, the phenotype is characterized as healthy, and when it is constitutively active, as pathological.

A systems theoretic approach to cross-level causality is being explored by the systems scientist Mesarovic. While a very interesting and sorely necessary area for research, a discussion on the same is beyond the scope of the work presented here.

31

2.5.1. Multiscale modeling

In the context of multiscale modeling, we must seek to model cell

populations. In turn, in cell population modeling context, cell-cell heterogeneity is

an important topic for study. While it is beyond the scope of this thesis to discuss

this in depth, efforts in this area include those by the Sorger lab [74] and the

Altschuler and Wu lab9 [75].

Ultimately, our work is intended to proceed in the direction of building multilevel, hierarchical models. Invariably the levels in these models would be at several of the spatial levels listed below:

 Genetic level

 Intra-cellular level

 Cell population level (Inter-cellular level)

 Tissue level

 Organ level

 Organism level

 Environment level

At each of these levels we know that a model is required, an image of the reality at that level. Our next challenge is then, to conceive a model at every level. We are interested in the levels that start at the cellular level protein interaction network in our work. For our purposes, it is practical to start our conception here. We have the Tyson model of the cell cycle control system that

9 University of Texas SouthWestern Medical Center 32

is actuated by an upstream signaling network. The connection between the two is not exactly known, but the literature asserts that this happens via Cyclin D [76].

2.6. Modeling

2.6.1. Importance of modeling

Like you need a set of tools to do experiments, you need concepts to understand.

Mihajlo Mesarovic [43]

A model is never precise- in that it is an image or a caricature of reality but not reality itself; however it can get us closer to understanding reality. In the absence of any coherent model, we have no way of relating to reality itself. A model is a bridge between the intuition of the researcher and biological reality.

Understanding with the intention of converting pathological to normal drives most biological research. Such an approach then leads us to always develop models that are highly contextual. The goal of such modeling is to pin down the best contextual image of reality. In the complex systems biology approach, we want to pin down the best contextual image of reality for understanding the system’s organizing principles. In most systems biology studies, a firm grasp of these is intended to illuminate what leads to life as an emergent property. In contrast, the thrust of Complex Systems Biology as defined by us is on the organizing principle of coordination, and discussions pertaining to are limited to how coordination relates to it. Here coordination refers to higher-level decision function which motivates the lower-level subsystems such that the overall 33

behavior of the system is advanced.

In the modeling of most biological systems, the ultimate purpose is to

understand system behavior under certain conditions in a given amount of time,

or at a certain point in future time. In other words, we are interested in how

certain system states that encapsulate system behavior vary as functions of time.

This is why we stress the need for dynamic models.

2.6.2. Contextual view of types of models in cancer biology studies

Two questions drive the choice of modeling methodology:

 What portion of reality do we want to model?

o Complementarily, in the systems biology context- what biological

mechanism should be ignored?

 What is the optimal level of detail for such modeling (to best learn the

kinds of lessons we want to learn through modeling)

o In the systems biology context- shall we model at the gene, cell,

tissue or organism level?

For instance, the questions we asked in one such exercise are:

 What cellular process to model?

o Choice of cell cycle is due to a confluence of factors

. Ubiquity of process

. High relevance to pathology of interest (cancer)

. Availability of experts in the field for collaboration

. More than 20 years of modeling literature on the process itself

34

 At what level of detail should the process be modeled?

o Protein interaction level

. Since cell cycle regulation via cell control system happens at

this level

 Our goal is to understand cell cycle regulation and the

role of regulatory processes in cancer, the pathology of

interest

Once these questions have been answered, we explore different choices

available for modeling appropriately. A comprehensive way of classifying

modeling approaches is shown in Figure 2.4.

Figure 2.4. Modeling architectures [1]

The modeling scheme is chosen according to the characteristics of the biological system and the context in which it’s being studied. For instance biochemical reactions occurring under specific conditions such as a certain protein concentration threshold may be modeled using a hybrid approach. In

35

another instance Partial Differential Equations (PDE10) were employed in

modeling inter-cellular protein interaction [77]. At the level of protein interaction,

for the purpose of understanding its dynamics, the most appropriate modeling

scheme in this classification is continuous state, continuous time, deterministic modeling. In this thesis a detailed discussion will be reserved for this scheme only. Moreover, the method used to translate the biochemical reactions into

Ordinary Differential Equations will be detailed.

2.6.3. Mathematical and Computational Models

A brief look at different kinds of modeling in systems biology is presented now. We first turn our attention to Mathematical Modeling, and contrast it with

Computational Modeling. Both terms are very broad, and the term modeling here

refers to whole families of modeling rather than specific modeling schemes.

Definition 2.1:

A mathematical model is built around mathematical laws with relationships

expressed in terms of mathematical equations or inequalities. Alternately, it is

simply the expression of mechanistic interactions that explain the biology in a

mathematical framework.

Definition 2.2:

A computational model is a mathematical model that relies on real data and

10In systems biology, PDEs are usually used in modeling systems where the assumption of a well stirred compartment is not valid, and hence we must account for the spatial distributions of biochemical species. 36

a quantitative rigor that allows them to compute extrapolations to biologically

proven scenarios. They rely on extensive computational resources to achieve

this end. Examples are the Tyson cell cycle model [19], and the Soebiyanto JAK-

STAT signaling model [39].

2.6.4. Phenomenological vs. Mechanistic Models

An alternative way of broadly differentiating amongst models is classifying

them into phenomenological models and mechanistic models. A

phenomenological model takes empirical observations of phenomena and relates

them to each other, using mathematics to explain the behavior of the system.

This doesn’t mean that the model is able to explain how these relations were

arrived at. These are essentially input/output models.

Mechanistic models on the other hand focus on using biologically proven

interactions as the base for modeling. Such a model is built upon first principles

and is simply the translation of biology into mathematical language using some

modeling scheme [73,37,19,39].

2.6.5. Static vs. Dynamic Models

Models may also be differentiated into static and dynamic models. A static

model is one which is concerned with interactions between system components,

but not with the evolution of these interactions with time (e.g. a model of gene interaction where one gene turns on another). A dynamic model is concerned not

only with interactions between system components, but also with how these

37

interactions evolve as functions of time. Such an understanding leads the researcher to an understanding of different possible configurations of system interactions which are of relevance to phenotype. Some examples include Tyson cell cycle model [19], Soebiyanto JAK-STAT signaling model [39], Ashtagiri

MAPK model [73] and Schoeberl MAPK model [35].

It is important to realize that a certain process may be modeled in more than one way and this has the advantage of viewing the same biological reality from different vantage points and thereby gaining different insights. The drawback is that assimilation of established models (e.g. individual metabolic pathways) into larger super-models is often practically impossible (e.g. complete cellular metabolism) [9].

2.6.6. Deterministic vs. Probabilistic Models

Human disease provides the motive for most systems biological research.

For investigators interested in protein interactions of human systems, an appropriate modeling scheme is deterministic, with a focus on the concentrations of the proteins involved, and the level at which the modeling is done (cellular in our case). A dynamic model which predicts at any time point (t) the precise value of a state variable (x(t); this is the biological species of interest) is referred to as deterministic [43]. In contrast, a model which accounts for random variations in the state variables and/or coefficients is referred to as stochastic. Such models are appropriate when the number of molecules (concentration) of the state variable is markedly lower than what is applicable in deterministic models [59,29].

38

When this happens, population variability of the species of interest increases.

2.6.7. Dominant relationship modeling: Unmodeled dynamics

We are interested in modeling the relationships within system substructures that drive the system to different configurations of interest. Dominant relationships at any given hierarchical level of the system are what must ultimately be included in any model that intends to capture biological reality.

Dominant relationships are interactions that drive the system. In systems terminology, these are the relations on variables that are causally linked to the system output.

Biological truth has infinite variation, but as researchers we are problem- solvers and need to define the scope of our research efforts so that we pick a problem we can realistically solve. Hence the concept of dominant relationships, in addition to giving key insights into the problem we are solving, also helps us define boundaries around the questions we want to answer. This helps us define a tractable problem, given time and resource constraints. Picking the right amount of detail to model is as much of an art as it is of skills and experience.

Each time we draw such boundaries, we leave large portions of biology unmodeled. Truthfully, even in problem context, the vast majority remains unmodeled in current research. These unmodeled dynamics represent biological reality which our model does not account for. Our modeling scheme must include some way of bridging the gulf thus created between model and reality. There are several instances of this in the literature. For instance, the Tyson group clearly

39

scoped their problem as including a model of the cell cycle control system and whatever immediate biochemical interactions made this system function [78,19].

Any other dynamics upstream of this were replaced with a switch like mechanism. Another example of similar treatment replaces unmodeled dynamics with ‘fictitious inputs’ and references the theory of universal inputs [79].

2.6.8. Modeling Approaches: How does one go about modeling

The science of modeling, particularly in the engineering disciplines, has been studied very thoroughly. Adapting the process of model development from

[9], we present its different steps in Figure 2.5. The process starts out with formulating a scientific problem of interest to the investigator, and once formulated, it gauges the scope of the problem so that before investing inordinate time and resources in it, we are assured of its being tractable and practically feasible.

40

Reconciling Problem Biological Formulation vs. Computational Scope? Results

Information Success? Verification Experimental Adequate? Verification of Model Predictions Model Structure Selection Satisfactory?

Theoretical expertise? Sensitivity Analysis

Simple Model Results? Establishment

Iterative Model Success? Refinement

Figure 2.5. Model development process (feedback loops indicative - not limited to those shown).

In Figure 2.5, each backward dotted arrow indicates a negative outcome at

its decision box, while the solid forward arrow indicates a positive outcome. The

actual outcomes of the modeling exercise may be the same as what was initially

intended in some cases. In others it may be different, and could help formulate a

strategy for problem solution, while in yet others the modeling may help come up

with suggestions for experimental design. It is important to remind ourselves that

each step listed here is interdependent on the others. For instance if we formulate a problem (step 1) for which there is no available information (step 2), 41

then we would have to go back and reformulate the problem. This list of steps are intimately dependent on one another, and the practical modeling process is likely to require moving between different steps as needed by the specific problem demands.

More dramatic advice for the modeler, with a practical and almost avuncular slant to it [80] includes the following three rules for model development- (1) Lie,

(2) Cheat and (3) Steal.

Lying refers to the fact that a good model must be a working model, and a working model almost always contains incorrect assumptions, in an attempt to ensure that the ‘number of parameters does not outstrip the available data’.

Cheating refers to using what data is available to fit the model, and very often this means fitting univariate data to a multivariate rate equation for instance, which would make statisticians very uncomfortable. Stealing refers to using other people’s work in modeling. If someone else’s model in a different field fits the purpose, then one must use it. If someone else’s model calibration method appears to be the rational choice for your problem, then one use it. Unwritten but implied is the reminder to give credit where credit is due.

2.6.9. Modeling errors and rectification

Any one of the steps of model development (steps 1-8 in 2.3.4) can be subject to errors. However the nature of biological phenomena is inexact, and this makes the modeling process very context specific. Errors in the modeling process can broadly be classified into:

42

1. Errors in modeling

2. Errors in data

This section briefly discusses errors in modeling. Errors in the data are discussed in Sections 2.6 and 2.7.

One example of a modeling error is the exclusion of important biochemical reactions from the modeling (step 3), which may be the result of misinterpreting literature, insufficient knowledge or misinterpreting the biologist’s intuition, or simple oversight (step 2). Needless to say, such errors have very serious consequences such as change of overall system behavior. The only proper rectification for such an error is the inclusion of what was previously left out.

Another example of a serious error is adoption of a wrong modeling scheme. For instance when modeling signal transduction, it is generally accepted that at this level deterministic modeling is proper method to model system dynamics unless there is some specific exception. One possible mistake at this point is for a modeler to choose ODEs to model a system where compartments within must be modeled using PDEs instead (see footnote 8, page 21).

2.6.10. Mathematical formalism

Mathematics is the language of precision. Such a language forces the researcher to precisely define the problem at hand. In the biological sciences where uncertainty abounds and all but the most primary axioms fluctuate based on context, precision is vital. In this section we try to pin down the all important mathematical portion of model building. We illustrate it with the most commonly

43

used modeling approach in systems biology signaling pathway studies- deterministic ODEs. The mathematical notation used in this thesis is thus introduced in this section.

In strict systems language, any system, biological or otherwise, may be represented as an input/output system. This study uses a deterministic model in which the biochemical reactions are converted into a mathematical representation using the mass action law essentially derived from first principles

[81]— and Michaelis-Menten approximations (see [27,28] for derivations). The mass action law was introduced in 19th century by Guldberg and Waage [31]. It states that the rate of a chemical reaction is proportional to the probability that the reacting species (molecules) collide. This collision probability is in turn proportional to the concentration of the reactants.

2.6.10.1. Use of mass action modeling: An example

To illustrate the use of mass action law in representing biochemical

reactions

mathematically, we use an example of four biochemical species (A, B, C and D)

and their reactions (example adapted from [29]). We suppose that these

reactions are the following elementary reactions (no intermediate reaction):

ABCk1  (2.1a)

CABk2  (2.1b)

3CDk3  (2.1c) where k1, k2 and k3 are the rate constants (i.e. the rate of binding). Here, 44

biochemical species A and B combine at a rate of k1 to form a new species C

(reaction 2.1a). C in turn breaks down into A and B (reaction 2.1b), and develops

into a new species D (reaction 2.1c). The rate constants (k1, k2, k3) determine

how much of each biochemical species is formed over time.

For each reaction in (2.1), the biochemical species on the left hand side are

called reactants, and those on the right hand side are products. Based on the

mass action law, the rate of reaction (v) is directly proportional to the product of the reactant concentrations [28]:

vkAB11 (2.2)

vkC22 (2.3)

3 vkC33 (2.4) where the bracket indicates concentration of the species—typically expressed in units of nano-Molars (nM). The rate of change of the concentration of each biochemical species is determined by: the net rate of the production and consumption reactions, and the stoichiometry (number of molecules of the species involved in the reaction).

Let the state x denote the concentration vector of the biochemical species

(i.e. x = ([A], [B], [C], [D])T, then:

x1 vv12  x -vv 2  12 (2.5) x vv--3 v 3 12 3 x4 v3 

45

2.6.10.2. General system modeling: Using mass action modeling

In general, any elementary reaction j as in (2.1) can be written as:

in in ink j out out out Rsxsxsxsxsxsxj :1,jj 1 2, 2 ... NjNj , 1, 1 2, j 2  ... NjN , (2.6)

where in any reaction j,

in th sij, = stoichiometry of the i species/variable acting as reactant

out th sij, = stoichiometry of the i species acting as product

i = 1, . . . ,N; and j = 1, . . . , P

The stoichiometry of each species is equal to the number of molecules of

species ‘i’ involved in reaction j. Since the reactions in the signaling pathway are

not only mathematically transformed using the mass action law, but also using

the assumption of Michaelis-Menten kinetics [27,28], we will separate these

reactions according to their kinetic assumption by adding the following subscript

to their reaction (R) and rate constant (k): ma for mass-action based kinetics and

MM for Michaelis-Menten kinetics. Similarly, we will also distinguish reactions

involving the ligand input as Rip and the rate constant as kip. The pathway being

studied can now be represented as a nonlinear-affine system:

x fxk12(,MM ) f (, xk ma ) gxku (, ip ) (2.7a)

yhxu (,) (2.7b) where xXN is the state vector (concentrations), uU L is the vector of

input (ligand) concentrations, yY M is a vector of measurable output

concentrations. f1(.) and f2(.) are Michaelis-Menten and mass-action functions, 46

respectively, and are defined as follows. Let the vector of reaction rates be v, and

the stoichiometry of species i in biochemical reaction j be:

in out sij, = −sij, + sij, (2.8)

in which si,j is an element of the stoichiometry matrix S that takes integer values

(SZ PΧN ). Then:

fxk1(,MM ) S MM v MM (, xk MM ) (2.9a)

fxk1(,ma ) Sv ma ma (, xk ma ) (2.9b)

The subscript MM and ma indicates Michaelis-Menten and mass action reactions, respectively. The vector function vMM is described in [27,28], while the

vector function vma is derived in [34] as:

T Sxin log( ) vdiagkma ()e ma (2.10)

where the matrix Sin has sin(i,j) as its elements. Here, exponential and logarithmic

operations are element-wise matrix operations. Elements of vma are typically

formulated as

vkxsj(in ) in , [1,2,..., p ] maj,,, j ij ij (2.11) i

in x i,j is species i acting as reactant in reaction j. g(x,kip ) is defined as follows:

T gxk( ,ip ) diagk ( ip )exp( S ip log( x )) B (2.12)

B is a stoichiometry matrix for the input. In (2.12) both exp(.) and log(.) are again element-wise operations. The system in (2.7) assumes that the ligand– receptor binding has 1 to 1 binding stoichiometry.

Hence, for the example described in equation 2.1, the reaction

47

stoichiometries are:

Reaction # Non-zero Stoichiometries Overall

in in out 1 sss111; 12 1; 13  1 ssss11 1; 12 1; 13 1; 14  0

in out out 2 ss231; 21 1; s 22 1 sss21 1; 22 1; 23 1; s 24 0

in out 3 ss33 3; 34  1 sss31 0; 32 0; 33 3; s 34 1

Table 2.1. Equation 2.1 reaction stoichiometries. Column 3 is obtained through application of

equation 2.8

Since all of these reactions are mass action based, clearly

f1MMMMMMMM (x,k )=S v (x,k ) 0 (2.13)

For the example, the mass action function fxk1(,ma )is given by:

-1 -1 1 0 kxx112  fxk( , ) Sv ( xk , ) 1 1 -1 0 kx 1 ma ma ma ma 23 (2.14) 3 0 0 -3 1 kx33

The ODEs generally describe the change of each biochemical species in the pathway due to production, consumption and degradation of the species.

Transfer of biochemical species from one compartment to another, such as from cytoplasm to nucleus, can also be described using an ODE when [32] (i) the compartments are:

(i) well-stirred and

(ii) the rates of transport between compartments are observable.

A well-stirred compartment means the biochemical species are evenly distributed in space. When these conditions are not satisfied, a PDE becomes more suitable as it takes into account the spatial distribution of the biochemical species. This is

48

outside of the scope of this thesis, but a detailed discussion of the same can be

found in [82].

2.6.11. Calibration and Validation

Once our ODE model has been built, we are in need of coefficients (rate

constants and initial conditions) in order to make the model computational. Some

of these are available in the literature, but almost always there are those that must be estimated. The process of identifying which coefficients must be

estimated, and estimating them, is referred to as model calibration. In contrast

when we test these estimated coefficients on fresh biological data, we refer to it

as model validation.

Any parameter estimation problem is essentially an optimization problem.

For such estimation, there are a few well defined steps. It will be seen that we

are indeed dealing with an optimization problem, as we discuss these steps,

which are shown in Figure 2.6.

49

Define error function of importance

Define parameters to be optimized

Define parameter bounds

Choose estimation algorithm

Run algorithm iteratively NO

Satisfactory Solution? NEVER YES Revisit model DONE formulation

Figure 2.6. Parameter estimation steps.

We go through the model calibration steps, one at a time:

STEP 1. Define error function

If we use the term θ to denote the collective coefficient set (all rate constants

and initial conditions), then the error function is defined as the weighted least

squares function:

Exp T Exp min [(,,)-ytdddd u y ()] t Qyt [(,,)- u y ()] t   (2.15) d

y(t)Exp Here, d is the experimental data array (vector) at time d, and yt(,,)d  uis

50

the simulated model output at time d. Q is a diagonal matrix that weights the measurement errors and is indexed according to the time. In the event that an output is not included in the measurement error calculation, or if it hasn’t been

measured in the first place, the corresponding entry in Q would be zero. Similarly

those outputs that are viewed as being more relevant to the parameter estimation problem are given higher weights.

For semi-quantitative data, we can slightly alter the formulation (2.15) to:

Exp T Exp min [(,,)-ytdddd u wy ()] t Qyt [(,,)- u wy ()] t   (2.16) d

The additional term w in (2.16) represents additional weights used to scale the semi-quantitative experimental data. Examples of such data are standard

Western blotting data, as well as flow cytometry data, that have not been converted to absolute terms using purified protein data.

STEP 2. Define coefficients to be estimated

Here we notice that we want to estimate . According to the more

comprehensive formulation (2.14), this includes the set of all ICs (x0), and all the

rate constants (k), as well as the scaling weights (w).

Most parameter estimation problems in computational biology are non-

linear, probably non-convex and definitely non-intuitive. As such this means that

trying to tackle them head on would get us hopelessly bogged down and bring

the modeling iteration to a freeze. This is primarily due to the computational

complexity involved in the model simulation, and a large part of this is faced in

the parameter estimation step. Increasing the number of coefficients to be

51

estimated by even one generates an exponential increase in the computational demands on the modeler. Defining the coefficients to be estimated in an intelligent way, and keeping the coefficient search space tractable can help us avoid such a freeze.

Common sense dictates that the only way to reduce computational complexity, during the parameter estimation exercise, is to reduce the coefficient search space (have less unknown coefficients). This can be either done by intelligent guesses of some parameter values, or by performing a sensitivity analysis, or a combination of the two. Sensitivity analysis involves finding out which parameters are important to our defined system outputs, and ranking them according to this importance, and then including the higher rank parameters in our estimation problem, while randomly assigning values to the lower rank ones.

Mathematically sensitivity analysis is based on the calculation of the ratio of change in system output (or some proxy for it) to the change in a parameter. This is the system output sensitivity with respect to that parameter. The sensitivity analysis results are dependent on which outputs we are interested in [83,38]. The most commonly used method is to calculate a sum of squares error between the model outputs with original parameter set (perhaps based on the literature) and the model outputs based on an altered parameter set. Custom functions may also be used [84].

At the end of such analysis, the modeler has a list of parameters prioritized according to their sensitivity value, and chooses those that the system is most sensitive to, as candidates for estimation, while randomly assigning the

52

remainder.

STEP 3. Define parameter bounds

Since we are dealing with an optimization problem, making the problem tractable involves imposing constraints on the parameters. A good starting guess for such constraints is in biologically feasible values from the literature, or through collaboration with biological labs. Such starting constraints can be further narrowed based on the specifics of the problem. Example of such constraints are found in [38], or in the work presented in Chapter 4 of this thesis.

STEP 4. Choose parameter estimation algorithm

Parameter estimation problems in models of biological systems are more often than not non-linear, non-convex and multi-modal. This means standard local deterministic approaches such as Levenberg-Marquardt and Gauss-Newton often get stuck in local minima. We find in the literature several examples of practical workarounds.

Parameter estimation methods can be classified into deterministic, stochastic and hybrid methods. To circumvent local minima issues, we can either randomize the starts to the local method ( [83]: pros-avoids local minima; cons- high computational cost), or we can apply direct search methods, which proceed by solving the optimization at a mesh of points around the current point ( [85,86]: pros- avoids local minima; cons- high computational cost). Alternatively we can apply purely stochastic methods such as genetic algorithms, simulated annealing and evolutionary strategy ( [38,17]: pros- avoids local minima, parallelizable; cons- unreliable, may have high computational cost). Finally we can also apply

53

hybrid algorithms. In one such example in the literature, a stochastic global

method (evolutionary strategy) found the neighborhood of the global optimum, and a local deterministic method was used to converge on the local optimum (

[87,88]: pros- covers both global and local search aspects at appropriate resolutions; cons- potential high computational cost associated with stochastic method, local minima possible to be resolved using multi-start approaches).

STEP 5. Revisit model formulation if no satisfactory solution found

In the absence of a solution to our parameter estimation problem, there is either something lacking in the parameter estimation process itself, or in the

model formulation including unmodeled dynamics. If we assume that our

parameter estimation has been fairly rigorous, then our model formulation must

be at fault. Further discussion of this case can only be done in practice or

through examples. While this is outside the scope of this thesis, Voit [89]

provides a detailed treatment of parameter estimation and sensitivity analysis

with several examples. We provide the mammalian cell cycle model calibration in

Chapter 4 of this thesis.

2.7. Role of Data in Building Predictive Models

The term computational model (in predictive models) includes within it the

numbers that are used in the model; i.e. model is not only model structure and

modeling methodology- it also includes the initial conditions and rate constant

values used. Changing the numbers in a model (model parameters) while

retaining the model structure often generates widely divergent simulation 54

characteristics. Our work in progress for the Flt3 signaling pathway demonstrates

precisely how large an issue the paucity of data is. Note how in the absence of

such data, the model is practically useless for numerical analysis or simulation-

since converging on the correct set of parameters is practically very difficult and often success is quite limited, and the modeler often has to reach a compromise

and estimate some parameters, and perform guesswork on others. A more detailed treatment of this topic is presented in Chapter 3, Section 3.8.

2.7.1. Data measurement introduction

Data paucity is a long-running woe in the systems biology community. When data is available it suffers from a host of problems. These include data being gathered from different cell lines, and different species, with such heterogeneity distancing it from what one would expect in real biology. Moreover, data measurement technologies usually give us some proxy for the data that must then be converted into a suitable form. This conversion has not been formally studied in different measurement technologies using precise mathematical language, and this possibly accounts for why a number of the conversion methods are not geared toward the requirements of computational biologists.

The most common data measurement technique in biological systems is the quantitated Western Blot, and this gives us concentrations of the biochemical being measured in terms of the blot intensity. The measure obtained is strictly relative to the sample being measured in the Western Blot, and we can compare values in a single blot with each other only. An example of a Western Blot from

55

[90] is shown here in Figure 2.7.

Western Blots give us the amount of protein averaged over the entire

sample (not the absolute concentration). This is valuable data, and measuring

the same at different time points we can obtain a dynamic time profile. However

this is very labor intensive. The Western Blot protocol is described in Section

2.6.3.

Figure 2.7. Western Blot example [90]. Cyclin B1 expression is dependent on amount of Trypsin inhibitor here.

Contrast this understanding, with the data in Figure 2.8, which shows a

Cyclin B1 time profile generated using flow cytometry data. Flow data gives us the additional information of how the biochemical of interest (in this case Cyclin 56

B1) is distributed over a given population of cells. Notice how this distribution

over the population of cells is beautifully mined here for obtaining a time profile, without having to repeat the experiment at different times. This method is

described in detail in Chapter 5.

Figure 2.8. Flow cytometry example [90]. Dynamic time profiles of Cyclin B1 obtained through

single flow cytometry measurements (for different cell lines)

We now turn our attention to the issue of the form in which data is available.

It is one thing to say we need time profile data, and that obtaining it is labor

57

intensive. It is another thing to understand that what is read out as data is never

an absolute concentration of the biochemical of interest. It is generally a proxy for

the biochemical concentration, sometimes several times removed from it.

We are ideally interested in the absolute in vivo concentration of a certain biochemical, presumably a protein. However, with current measurement techniques, the best starting point that can be managed is the absolute in vitro concentration of that biochemical.

Continuing our adoption of the mathematical terminology from Soebiyanto

[29], with reference to the system described in equation (2.7), we note the following relationship between observable output (y) and the system state variable (biochemical) being measured (x):

yCxtku int , , (2.17)

where,

int(.): function of intensity;

CR MΧN : matrix that expresses output as a linear combination of state

variables;

x(t,k,u): vector (array) of state variables, each of which are functions of

time (t),rate constants (k) and input (u);

y: output vector (array);

The formulation (2.15) varies according to the measurement technique.

Keeping this in mind, we lay out the broad cases possible when it comes to time profile data:

1. Absolute data 58

Non-relative or absolute measures of biochemicals within the cell can be

obtained using purified preparations of the same biochemical as a standard [90]

E.g. Western Blot (or Flow cytometry) with purified or recombinant protein

available to convert relative intensity values into absolute concentration values

for the specific experimental measurement of the given biochemical (i.e. for the

specific Blot or Flow run)

Here we can perform the most straightforward conversion, given by:

int(Cxtku  , ,   T Cx (2.18)

2. Data absolute within bounds

E.g. Western Blot (or Flow cytometry) with purified protein general conversion available

Practically it may not be possible to run a purified protein experiment for every experimental measurement of every biochemical. In such a case, the norm

is to generate a standard conversion curve using a single purified protein

experiment, that gives us the conversion between the intensity (of the Blot or the

fluorescent chemical in Flow) and the concentration of the biochemical of

interest. In such a case, we can still apply equation (2.18), but the value of α is

not the same for a specific experimental measurement as it is for the generic

curve generated. Hence its value must be estimated. One way of doing this is by

including it in our parameter estimation problem described in equation (2.15)

Exp T Exp min((,,,)-ytdddd uy ())((,,,)- t Qyt  uy ()) t   (2.19) d

3. Relative data

59

In the event of data being relative, and there being no access to a purified protein, the measured intensity is normalized using a reference point in time (tr) according to the following formulation:

1/int(Cxtku1 ( (r , , ))  1/int(Cxtku ( ( , , )) ydiag 2 r int( Cxtku ( ( , , )) (2.20) M  1/ int(Cxtkunr ( ( , , ))

E.g. Western Blot (or Flow cytometry) without purified protein experiment.

2.7.2. Data measurement processes

The primary challenge in systems biology models, broadly speaking, is data management. This can mean different things for different stripes of systems biologist. For instance a bioinformatician, who uses the title systems biologist, must actually extract meaning from vast amounts of data that is primarily static in nature. On the other hand, our brand of systems biology places emphasis on time profile data (biochemicals, primarily proteins, as functions of time). In our case the challenge isn't lots of data being managed, as it is managing a lack of contextually consistent time profile data in models.

Systems biology is a field in its early stages, but it does have an intermittent history of four decades, of which the last decade has seen a concerted effort on part of different investigators worldwide to make it a cohesive and well organized science. In all of this time, why has there been no answer to the specific data paucity that seems to plague systems biology as we practice it?

The answer is chiefly rooted in a lack of the will to organize a consistent 60

effort in the direction of gathering contextually consistent time profile data. The fact that the upper limit for the collection of time profile data is set by the rate of advances in measurement technology also influences the lack of such data.

Luckily such limitations do not entail a complete abandonment of the systems dynamics approach to systems biology. The latter limitation, on the contrary, has

helped fuel twin advances in data measurement methods and computational

methods to circumvent it at a rapid rate. We try to understand pertinent issues that surface in the measurement process through illustrative example measurement methods, namely Western Blotting and flow cytometry.

2.7.3. Western Blotting

Usually the term Western Blotting is used when the biochemical we are

interested in is a protein. The steps in the Western Blotting process are listed as:

1. Select protein to be measured (antigen)

2. Determine if relative or absolute measure is required (we shall assume we

are satisfied with a relative measure in the steps that follow)

3. Determine if a two-step or one-step blotting process will be used (i.e. two

antibodies or one; the two-step process is more popular due to various

practical reasons, and we shall assume this in the steps that follow)

4. Extract cell lysate containing protein

5. Perform gel electrophoresis on cell lysate

1. Place lysate on gel (agarose or polyacrylamide)

2. Apply electric field (as a result proteins in lysate are separated

61

according to size)

6. Transfer the size-sorted contents of the cell lysate from the gel to a

membrane (usually made of nitro-cellulose); process used is

electroblotting

7. Immuno-probing is carried out (using antibodies to probe for the antigen)

1. Primary antibody is added to the blot and excess is washed away

(repeated washings)

2. Secondary antibody conjugated (covalently coupled) to HRP (Horse

Radish Peroxidase) is added to the blot and excess is washed away

(repeated washings)

3. Blot is overlaid with buffer containing chemiluminescent luminol and

H2O2 (hydrogen peroxide)

8. Wipe off buffer

9. Light is emitted wherever secondary antibody is present on the blot, since

HRP uses H2O2 to oxidize the luminol and produce light (called Enhanced

Chemiluminescence (ECL))

10. Perform laser densitometry to scan the bands formed by the luminescence

(over and under exposure are major issues here, and potential sources of

error)

The points where inaccuracies may creep into the Western Blotting process can be encapsulated in Figure 2.9:

62

XI: third proxy

XS: second proxy

XP: first proxy

X: to be measured

Figure 2.9. Western Blotting proxies (modified from Wolkenhauer's unpublished textbook [43]).

Different points along the data gathering process where errors and inaccuracies may creep in. Here, Ras is the biochemical that is being measured

(antigen). We have denoted this by 'X'. The concentration of the primary antibody that tags it is denoted by this by 'X'. The concentration of the primary antibody that tags it is denoted by 'XP'; the secondary conjugated with the enzyme HRP by

'XS' and the light emission intensity by 'XI'.

The reasons why X, XP, XS, and XI are error-prone in the blotting process are many. With reference to the steps in the blotting process, we note the following:

1. Assume that the lysation, and electrophoresis processes are error free

2. When the size-sorted proteins are transferred to the membrane, not all

may transfer

63

3. In immunoprobing, when the primary antibody is added, not all of it binds

to the antigen

4. When the secondary is added, not all of it binds to the primary

5. The time that the membrane is incubated with ECL buffer varies, and this

determines how much of the luminescent chemical reacts with the

secondary antibody

6. The time that the blot is exposed to the film determines how the bands on

the film turn out

In short, we can say that the signal detected can be represented as follows:

SfXPSECLET   , AA , AA , , (2.21) where:

S = Signal captured by densitometer

PAA = Primary Ab affinity,

SAA = Secondary Ab affinity,

ECL = ECL buffer exposure conditions,

ET = film exposure time

Note here that:

SfXffX  I I IS  S   fffX ISP P  ffffX ISPX  (2.22)

where '  ' denotes function composition, and ffISPX f f. is the same as f(.) from (2.21), although the other parameters have not been shown here.

An example of detailed treatment of Western Blot process modeling can be found in [55], which takes us through the different steps in the Western Blotting

64

signal chain. This chain includes the different steps that are displayed as different

functions in equation 2.22. Adapting their model to our notation, we have:

a XfXXPX() P,max - (2.23) X - X0

Where:

a = constant

X0= constant

X0 and ‘a’ have no direct physical significance but are constants that describe the

curvature of the hyperbola.

Similarly the relationship between the primary antibody and the secondary

antibody is also taken as being hyperbolic. We have:

b XfXXSPP() S,max - (2.24) XPP- X 0

Substituting (2.24) in (2.23) we get:

(-e ) Xf - (2.25) S X - d

a b Where dX, cX - X, fX  and eXfXd (-)*(-) 0 c PP,max 0 S,max c S

Additionally, we also note the following relations. The ECL reaction is treated as an enzyme-substrate reaction with the substrate being abundant. Here the secondary antibody (XS) is bound to the enzyme. The intensity of light captured by the densitometer (or local light emission rate ‘I’) is given as a function of the local concentration of the enzyme (same as that of the secondary and hence XS) and the enzyme reaction rate (VECL).

65

IXV SECL* (2.26) and the reaction rate (VECL) is determined by the Michaelis-Menten equation

(applicable since substrate is abundant):

VXmax * subst VECL  (2.27) KXm subst

Substituting (2.27) in (2.26), we get a hyperbolic relation between the light intensity and the initial protein that we tried to measure. While this may not be the case in all Western Blotting experiments, this example serves to provide us with how the measurement process must be approached if we are to attempt to quantify the (potential) errors and uncertainties that abound at each stage of the process. Such quantification has been attempted in [91] where the measurement noise is dealt with.

We do a similar analysis for the flow cytometry process. Here we employ the same method for discussion. We first go over the flow cytometry protocol in detail, and then sift through the steps again to point out where the potential for error and inaccuracies is there.

2.7.4. Flow Cytometry

The first few steps of the flow cytometry process are the same as Western

Blotting. The main differences are:

1. Instead of HRP and luminol, a fluorescent chemical is used.

2. Instead of sending the tagged sample through a densitometer, it is sent

through a flow cytometer.

66

3. The term Western Blotting is used when we are measuring only proteins,

while flow cytometry refers to the measurement of any biochemical

(protein, DNA, and so on) using fluorescent emission captured by a flow

cytometer.

4. Cells are usually intact in flow cytometry, but cell lysate is used in Western

Blots

The flow cytometry process can be listed as consisting of the following points:

1. Select protein to be measured (antigen)

2. Determine if relative or absolute measure is required (we shall assume we

are satisfied with a relative measure in the steps that follow)

3. Determine if a two-step or one-step blotting process will be used (i.e. two

antibodies or one; the two-step process is more popular due to various

practical reasons, and we shall assume this in the steps that follow)

4. Take cell sample containing the biochemical to be measured.

5. Immuno-probing is carried out (using antibodies to probe for the antigen)

1. Primary antibody is added to the blot and excess is washed away

(repeated washings)

2. Secondary antibody conjugated (covalently coupled) to fluorescent

chemical is added to the blot and excess is washed away (repeated

washings)

6. Immuno-probed sample is sent through the flow cytometer where the

following internal steps happen

67

1. The sample is subjected to hydrodynamic focusing

2. One cell at a time is subjected to a laser (the point where this happens

is referred to as the interrogation point)

3. The laser light is reflected and refracted by the cell and captured by

different detectors

1. Light that scatters due to size of the cell (i.e. it bounces off the cell's

exterior itself) is called Forward Scatter and is captured by a

Forward Scatter detector

2. Light that scatters due to the surface features of the cell, or which

penetrates the cell and scatters due to its internal features is

referred to as Side Scatter and is captured by a Side Scatter

detector.

3. Light that causes fluorescence from the fluorescent tag is captured

by a detector that has the appropriate frequency range

4. The signal captured by the detectors is sent to an electronics and

computer system

5. The readout that is obtained is usually a curve that is generated. This

is because each cell that passes the laser spends a finite amount of

time passing the laser, and thereby generates a wave.

The flow cytometry data collection system is depicted in Figure 2.10.

68

Electronics system

BPFs

Fluidics system

Laser Fluorescence Detectors

Interrogation Forward point Scatter Detector

Figure 2.10. Data collection using flow cytometry (Image adapted from the Invitrogen flow cytometry tutorial). Here BPF refers to Band Pass Filter.

The flow cytometry values used are usually the area under the curve that is obtained when a cell passes the laser. We next list the major steps that relate

‘what it is that we are trying to measure’ to ‘what it is that we obtain through the measurement process’.

As discussed for Western Blots, the best starting point that can be managed is the absolute in vitro concentration of the biochemical we want to measure. This tagged with a primary antibody whose amount is the first proxy. The amount of secondary antibody conjugated with the fluorescent chemical that binds to this

69

primary antibody is the second proxy. The amount of fluorescent chemical that is exposed to the laser is the third proxy. The amount of fluorescence emitted that is captured by the detectors is the fourth proxy. This fluorescence is essentially captured in the form of a time curve which is the fifth proxy. The area under this time curve is the value that is often used as the value of biochemical corresponding to that cell. This is what shows up on a flow cytometry scatter plot.

It is the sixth and the final proxy for the biochemical.

Developing a mathematical model of the flow cytometry measurement process (as done for Western Blotting above) is work in progress. Some initial thoughts on how this work will proceed are discussed below.

Think of the proteins that we are initially interested in measuring as x; then the fluorescent chemicals (conjugated to secondary antibodies) that function as proxies for these proteins can be represented by xa; similarly the emissions of these fluorescent chemicals function as a proxy for the amount of the chemical that was taken up by each cell; so let us represent this emission amount by xb; In the same vein, the amount of emission that is trapped by the photo detectors is xc; The response of the electronics system that is connected to the photo detectors is xd; The readout of the electronics system that we see on the screen is xe;

Hence xe is a proxy for xd which is in turn a proxy for xc, which is in turn a proxy for xb which again in turn is a proxy for xa. In other words: x edfx ffx c  fffx b  ffffx a  f f f fx a eededcedcbedcb            (2.2)

The purpose of this thought process is to clarify to the furthest extent possible the 70

nature of the composite function (to the extreme right) in Equation 2.28. This will help us find a practical way to evaluate its inverse function so we can recover the amount of protein (which we intended to measure in the first place) from the cytometer electronics system readout.

2.8. Data driven systems biology thinking

We now illustrate through an example, how a systems biologist proceeds to think, so as to set up the solution to a biological problem that is likely to have adequate data, so that it is ultimately tractable.

The starting point is the choice of the problem that must be solved. Let us assume that we chose intelligently. For instance, let us say we choose something like leukemia because it is relatively easy to sample blood cells, and because the wealth of correlated leukemia data makes it a good model system. Next, we must choose the measurement method. Let us say we choose flow cytometry for some reason (the advantages of flow cytometry are discussed in Chapter 5). In such an event, can we shine a systems scope on the problem?

2.8.1. In vivo data

The first point in the process, now that we have decided upon the problem to be tackled and the experimental technique, is source of experimental data. Of course we want this to be the best possible source. The very best would be an in vivo source. We want to be able to get cells in their native environment. Let us assume that we have this choice available. Now presumably we are interested in 71

human leukemia. In this case, what are our chances of getting cells which have the precise micro-environments that they have when they are inside the ? Moreover it must be remembered that we are interested in systems dynamics. Whatever the difficulties of obtaining these dynamics in vivo, they are magnified when the case in point is humans, since in vivo trials in humans are on a strictly voluntary basis, and that too for a very skewed section of the population.

Almost always this section includes people who are advanced in their cancer, and for whom for some reason the other available cancer therapies are not working, or are not applicable. In a nutshell, when we look at most leukemia data, it is not in vivo for this very reason. Moreover animal studies capture the in vivo dynamics in a very limited manner (references needed).

2.8.2. Ex vivo data

The next best source of data would ex vivo in nature. Such data can include human cells as well. Of course there are various contextual challenges in obtaining these cells. For instance, if the case of leukemia, blood cells may paint an accurate picture in more advanced cases, however one would have to obtain bone marrow aspirate for the less advanced cases (this aspirate is in fact one of the standard ways in which leukemia is detected in the first place). Now we start attempting to understand where error begins to creep into the measurement process. What is the difference in signaling dynamics that go on inside the cell when inside the body and those that go on when the cell is taken outside the organized complexity of the body. One can only surmise that these are manifold.

72

However, assuming that one adopts the best experimental practices, how would one go about quantifying the change in the signal dynamics that are frozen in the cells that have been extracted from the body, and the dynamics that were initially present when those cells were in the body? How rapid is this change? Were significant transients lost? Is there any way to find out? It appears that many of these questions cannot be answered using state of the art measurement techniques. However in this paragraph, we begin the process of attempting to show how the cumulative accumulation of errors begins in the data measurement process.

2.8.3. In vitro data

To reiterate, our system is leukemia cells, and we have chosen flow cytometry as the measurement technique. The third type of source available is in vitro. This is the most commonly used source of data modeling studies, and in almost all human studies. Cell lines used have been usually immortalized, or cultured repeatedly in petri dishes, so that their present states reflect stable characteristics associated with spent in a petri dish. In other words, any transient traits that persisted some generations due to their ancestors having been initially ex vivo cells have long since died out.

Let us assume, as our starting point for our experimental technique, a standard scenario. We are interested in the dynamics of a signaling pathway that has been implicated in some form of leukemia. Furthermore, we begin to measure these dynamics in vitro. These cells are already different in dynamics

73

from in vivo cells. We appreciate this fact, and modify our goal accordingly. Our goal now is to measure accurately the signaling dynamics of in vitro cells, and perhaps conjecture at a later point as to how their in vivo counterparts would be.

2.8.4. Measurement decisions

The moment we begin the measurement process in flow cytometry, we realize that obtaining dynamics data means measuring the same biochemical at different points in time repeatedly. For understanding the errors that creep into the measurement process, let us in the present discussion restrict ourselves to measuring just one such time point. We can deal with the additional issues and complications that arise from multiple measurements later. Our goal now is to measure accurately the levels of a specific biochemical of a sample of in vitro cells that have been frozen at a specific point in time. Our cells have not been synchronized in any way, so the level of the biochemical of interest is likely to reflect the heterogenous nature of the cell population.

74

3

4 1

2

Figure 2.11. Decision making in flow cytometry. The choice of model system, data source, measurement method and biochemical to be measured in our flow cytometry work is mapped to illustrate our thinking.

The first measurement step is to add a primary antibody to the cells, assuming that an antibody for the biochemical of interest was found. Now every antibody that is designed for a biochemical has a certain success rate of binding.

In other words not all the antibody binds to it. Moreover, a certain amount of the antibody also binds to other biochemicals within the cell. This is called non- specific binding. These are the very first errors in the measurement process.

Hence the amount of bound primary antibody is proportional to the amount of biochemical, but not identical to it. Note also that many biochemicals have isoforms, and antibodies available usually are poor discriminators between such 75

isoforms. Very often the computational biologist is interested in a specific isoform, but must settle for the concentration of the sum total of all isoforms of the biochemical present. This is an additional error introduced, and would require some modeling assumptions and/or corrections.

The second measurement step is the binding of the secondary antibody to the primary antibody. Again the success rate, for reasons mentioned earlier, is not a hundred percent. This further compounds the error. The amount of secondary antibody that binds to the primary is proportional to the amount of primary but not identical to it. Hence, the amount of secondary antibody is proportional to the amount of the biochemical we want to measure.

The secondary antibody is usually conjugated (or bound) to a fluorescent chemical, in the flow cytometry process. Usually such binding is within the biologists’ hands, and so we assume that the entire secondary antibody amount is conjugated to the fluorescent chemical.

The cell that has been thus tagged with this fluorescent chemical, as described in Section 2.6.4 (Figure 2.10), is passed through the interrogation point in the flow cytometer. At this point, the light from a laser is incident on it. Now while the light from the laser irradiates the cell, not all molecules of fluorescent chemical receive the light. Since the internal structure of the cell is very crowded, there are many structures that can potentially obstruct the laser. Some of these are optically dense/opaque, and would effectively reduce the number of molecules of fluorescent chemical that received the laser light. Hence the amount of fluorescent chemical that is irradiated with laser light is proportional to, but not

76

identical to the amount of fluorescent chemical. In other words, the amount of irradiated fluorescent chemical is proportional to the amount of biochemical to be measured.

The next stage involves reflection/refraction of the received laser light by the fluorescent chemical, as well as the absorption of light energy by the chemical to cause fluorescence. These are captured by various detectors in the flow cytometer. Now there are several possible errors at this stage of the process. For instance, the laser doesn't always emit the same amount of radiation. There is a certain variance associated with the laser emission. Similarly, not all the fluorescent chemical in the cells receives the laser light. Since the cell is densely crowded with numerous structures, there are several ways in which the laser light incident on the cell can be obstructed from reaching those locations within the cell where the fluorescent chemical is. Additionally, the non-specifically bound fluorescent chemical that receives laser light also adds to the error.

When fluorescent chemical within each cell receives laser light, it characteristically emits fluorescent light of a particular wavelength. Since in most flow cytometry experiments, we attempt to measure several biochemicals simultaneously, hence the cell contains numerous fluorescent chemicals. The emission spectra of these chemicals often overlap, causing an error referred to as spectral overlap. Since each detector is specifically designed to capture light of a certain wavelength range, and such overlap causes the fluorescence from two different biochemicals to fall within that range, we must subtract out the light that the detector wasn't meant to receive. Usually we do this by applying a bias.

77

However, since this is done manually, it is not precise compensation and also contributes to the error. Moreover the detector also captures stray light.

Errors Amount

Electronic Noise <0.01%

Laser Variance <0.1%

Room Temp <0.1%

Stray Light <0.5%

Fixation Artifacts (epitope availability) varies/unknown/relative

Autofluorescence (CV 30%) <2% of max signal

Spectral Overlap 70% ph3 signal

Technical Errors

(Staining/pipetting/staining time/fixation Unknown

time/waiting time)

Run conditions (Machine

speed/Operator error) Unknown

Table 2.2. Flow cytometry measurement error percentages.

The detector is the interface between the optics system of the flow cytometer, and its electronics system. This electronics system is subject to electronic noise. Some errors in the flow cytometry process are tabulated below.

This table gives us yet another indication of the number of unknowns in state of the art measurement methods.

78

3. TIME PROFILE EXTRACTION FROM WET LAB DATA

3.1. Overview

We explain our data extraction methodology in this chapter. In other words, we explain the methodology involved in processing statically sampled multi- variable flow cytometry, in order to get the time profiles of state variables of interest.

3.2. Chapter Organization

We start out this chapter by introducing the importance of data in the context of models of biological processes (Section 3.3), thus preparing the ground for the rest of the chapter.

In Section 3.4, we reason why flow cytometry was the most appropriate measurement method in our case (modeling the cell cycle control system in the context of CML/AML). An explanation of the flow cytometry process can be found in chapter 2. For a comprehensive introduction to flow cytometry see [92-94]. For a methodological overview of the technology and science applied to the mammalian cell cycle see [95-97].

Section 3.5 presents the state of the art in the literature with respect to study of biochemical time profiles. In this context it prepares the rationale for the methodology presented in this thesis.

Section 3.6 discusses how we ‘convert’ the flow cytometry read out into time profile data that can be used in calibrating/validating computational models of biological processes (in this case a non-linear ODE model of the cell cycle). This 79

process of conversion includes experimental error reduction (data pre- processing), application of the biologist’s intuition on the cell cycle to develop heuristics whose application yields biochemical concentration as a function of cell cycle time (measured as a percentage), as well as an application of measurement method knowledge to decide which portions of the signal have inherently high noise and hence need additional noise filtering.

We then discuss CytoSys (Section 3.7), our custom software that facilitates application of the above methodology is discussed next.

In Section 3.8, we present the future work to be done in this area, and we conclude the chapter in Section 3.9, by summarizing the work presented here.

3.3. Introduction

Any biological process model, independent of the context and our modeling philosophy, is incomplete without hard experimental, wet lab data. However there is a wide gap between the data needs of a modeler and the priorities that the current scientific socio- tends to reward in biologists. Due to this gap, it is a fairly ubiquitous experience for modelers to find that the data they ask for (e.g. protein concentrations in absolute units; amounts of proteins in molecules per cell) is almost always either unavailable, or available only qualitatively (i.e. relative data- where the numbers themselves don’t mean anything, but their distribution is meaningful relative to an independent variable or another variable), or available in a form that is a proxy for the actual data (which is sometimes several process steps removed from the data), or all of the above 80

in plausible combinations.

3.4. Importance of cytometry data

One of cytometry’s key (and distinguishing) feature is that it gives us relative content of a biochemical over a population of cells- i.e. it gives us biochemical distribution over a given sample in relative terms. Moreover one cytometry assay can be used to obtain this information for multiple proteins. This ability of cytometry to sample the entire state space of multiple variables simultaneously is under-appreciated and almost always ignored in published studies.

The reason for this is that despite all efforts of investigators to impose synchronicity on biology, cell populations are rarely if ever synchronous. It is a fair statement to say that they are never completely synchronous. Coupled with well-appreciated cytometric abilities of correlated variables and large data sets, we can precisely extract the time based expression profile of any measurable variable. We can then use these profiles to find numerical values of coefficients

(e.g. rate constants) to calibrate mathematical models (e.g. ODE models).

In a population of cells of the same type, each cell will have a different shape, size, volume, cell cycle completion time, etc. In this context, essentially each cell behaves uniquely albeit under certain boundary conditions.

By virtue of its capture of protein distribution information, flow cytometry data lends itself easily to the study of hematopoietic malignancies. In this context, flow cytometry data is a richer source of information than either Western blotting or ELISA data. In a Western Blot the intensity of darkness of a blot is the 81

measure of the content of the protein in a given sample but this does not give us information about the distribution of the protein in the sample. Similarly for ELISA the amount of protein in the sample can be inferred from the magnitude of the fluorescence- again there is a total absence of information on protein distribution in the sample.

Hematopoietic malignancies are marked by abnormal numbers of certain cell lineages, and/or the presence of dysplastic cells (cells with abnormal features). Hence having information about a protein’s distribution in a cell population is the key here. In such a case, a biological hypothesis could be made that a certain protein is activated when a certain cell type has elevated numbers.

The protein of interest can be tagged with a fluorescent label and run through a flow cytometer, and its distribution could be gauged in the sample- thereby helping us reach a conclusion on our original hypothesis.

3.5. State of the art

3.5.1. Classification/comparison of time profile data generation methods

In biological experiments time is available either (a) explicitly or (b) implicitly

(where a need for extraction arises). a. Explicit time

Methods based on kinetic experiments

An example would be the method used by Uri Alon’s group [42], where YFP

(Yellow Fluorescent Protein) is retro-virally inserted into the genome of human

H1299 lung carcinoma cells and its expression is photographed at different 82

points in time. Different variations of this technique are available in the literature and while it does give an accurate way to obtain time profile data, it is labor intensive. Another contender for this category is the use of time-resolved fluorescence spectroscopy [98]. This technique uses pulsed lasers and the best time extraction is dependent on the lifetime and availability of the fluorophore.

This technique is more labor/resource intensive than a standard flow cytometry experiment that yields us multidimensional data (without explicit time). b. Implicit time

Methods which circumvent kinetic experiments

The only concrete example of this, to the best of our knowledge, is the method presented here, where we use heuristics based on intuition about the biological process and simple statistical laws to extract time.

3.5.2. Need for our method

The above two sections (3.4, 3.5.1) provide the rationale for the necessity of our method for time profile extraction in systems biology. In 3.2 we showed how flow cytometry gives us the protein distribution in a sample, and moreover this can be done in a single assay for a whole host of biochemicals. This laid the foundation for our choice of data gathering method. In 3.3.1 we briefly displayed our literature search results for methods that could extract time profiles from populations of cells and no comparable methods are available to the best of our knowledge. Hence, in the absence of explicit time measurement, it is the only methodology we know that generates expression profiles.

83

3.6. Dynamic time profile extraction methodology

In this section, we first explain the generic data extraction methodology and then attempt to clarify through an example. Fluorescent antibodies are intermediate reporters; the amount of fluorescence signal is proportional to the abundance of the biochemical being measured (referred to in experimental sciences as the epitope or antigen) in the cell, and therefore becomes a proxy for the quantity of the biochemical and often the larger molecular context (e.g., the whole protein containing the biochemical or even a protein complex). It is important to reiterate that in mathematical modeling of biochemical reactions, use of methods that report relative quantities, is an intermediate step. Models will be most meaningful when they are realized in absolute quantities - molecules and molarity. The work described here yields expression profiles in relative quantities.

The approach to extracting expression profiles, demonstrated here, was first explored by Jacobberger et al. [45]. Cyclin B1 was quantified as a function of

DNA content, and the levels expressed in G1, S, and G2 were quantified. In that study, the idea that the cyclin B1 distributions in G1 and G2, which were not resolved in time by DNA content (as is S phase), could be reduced to a phase specific kinetic expression profile by fitting a series of Gaussian distributions based on more uniform expression (a single Gaussian) from another, more uniform cell cycle compartment was introduced. The mitotic cells were used to obtain the most narrow log-normal distribution within the cell cycle and the G1 and G2 phase expression profiles were obtained by plotting the centers of a 84

Gaussian series versus average phase times for typical cells. As stated above,

S phase expression was uniquely resolved as a function of DNA content. A similar approach was used and extended by Jacobberger and Frisa [90]. In that study, the frequency information within regions set across the bivariate (DNA versus cyclin B1) distributions was used as a surrogate for time spent within the expression range of each region. Using the frequency information puts the ideas presented in [45] into action. However, fitting distributions manually (e.g., in spreadsheet programs) is tedious and clumsy in practice. CytoSys and its underlying methodology combine region setting and Gaussian fitting in a semi- automated manner. While the measurements made by this approach are relative, they are correlated.

3.6.1. Generic methodology

3.6.1.1. Experimental setup

A methodology which relies on data as input is only as good as the data fed into it. Hence the experimental setup that generates such data must be designed based on what is desired in the expression profiles resulting from the methodology. We can broadly classify resultant expression profiles desired into:

a. Expression profiles that are sought only for individual use, and whose

absolute values are not important.

b. Expression profiles that are sought to be used for comparison to one

another, and whose absolute values are not important.

c. Expression profiles that are sought only for individual use, and where

85

absolute values are required.

d. Expression profiles that are sought to be used for comparison to one

another, and also whose absolute values are required for individual or

other use.

Based on this classification, option a. requires that a multicolor (worm) flow experiment be conducted (in such an experiment, all biochemicals of interest are stained with anti-bodies and fluorescent tags simultaneously and then a single flow measurement is taken). On the other hand option b., in addition to the multicolor experiment, we require single color experiments (staining one biochemical with antibodies and fluorescent tag at a time and measuring using the flow, and repeating this one biochemical of interest at a time). The single color experiments allow for the comparison between expression profiles. The necessity for rescaling is understood when we know that different fluorescent tags have different emission spectra and the numbers that relate these emission spectra to protein expression are different for each tag. Hence comparing the numbers of the time curves obtained from the multicolor data is meaningless, since the actual numbers are related to fluorescent tags with very different properties. However in all the single color experiments, the same fluorescent tag is used to mark all biochemicals of interest. Hence when single color correction of the multicolor time curves is done, it is now possible to compare the corrected

(or rescaled) time curves. The rescaling is usually done according to correlation plots between multi color data and their corresponding single color data.

Option c. requires the addition of an experiment that helps us convert the

86

expression profile of qualitative value to us into something that has quantitative value as well. This may involve setting up a flow experiment with purified proteins

[99]. Option d. clearly requires all of the above experiments to be conducted.

In addition to such experiments, some investigators may want independent confirmation of the expression profile obtained through the methodology, either for helping expand the methodology or simply confirm it or some other purpose.

In such cases kinetic experiments using synthetic nucleosides such as BrdU may be conducted [100].

3.6.1.2. Pre-processing

Flow cytometry data collected usually conforms to a published Flow cytometry Standard (FCS) format. WinList 6.0 3D (Verity Software House) was used to access FCS files and preprocess the data to ensure data quality and fidelity. Preprocessing, involves correction and/or filtering of measurement errors

(e.g., overlapping emission wavelengths), experimental protocol peculiarities

(e.g., non-specific binding – antibodies binding to non-target proteins) and biological errors (e.g., errors due to cells sticking together) that are inherent in cell fixing, staining and the flow cytometry data gathering process. The data for each cell is collected when the cell flows through the cytometer providing a rich source of data for each of the biochemical stained that is visible under various lasers of the cytometer.

Data analysis includes filtering to remove or minimize errors, and to subsequently identify different phases of the cell cycle that individual cells are in.

87

With the exception of DNA (which simply doubles at most), all biochemicals involved in the cell cycle vary many folds some multiple magnitudes (hence use of log scale when appropriate when plotting).

3.6.1.3. Phase-specific processing

The pre-processed interphase data is binned (or gated) into a discrete number of bins using the visual clustering tool by hand in the WinList software

(Figure 3.2). Each dot here represents a cell, and, cells in each bin are represented by different colors. The bins are non-overlapping, i.e. no one cell belongs to 2 bins.

Throughout this discussion we use the terms bin, cluster, gate or region interchangeably- to denote a portion of data that is circumscribed by a bin/gate

(shape border with no fill) and can be treated as a collective unit. Such gating is usually done in WinList or any other software that allows this functionality.

Examples are R5 and R6 (Figure 3.2).

To convert the information collected into a time profile of any protein for the average cell in the sample, the following reasoning is used:

Heuristic 3.1.

The longer a cell spends in any cell cycle phase, the more likely it is to find such cells in a randomly chosen sample of the cell population [54,45,90].

This is simply a recognition of the fact that time spent in any portion of a

88

rule-based process leaves behind very distinct clues, which can be used to extract the time.

There are certain portions of the data which have an exceedingly high amount of error (measurement error and other types of error). In the data set discussed in the example that follows, these are fortunately correlated with specific phases of the cell cycle. We have developed a method that uses the log normal nature of the data being measured as a guide to correct the data. We call the method Gaussian fitting (gfit) and this is discussed in detail in the context of an example, and in the section on our custom software CytoSys.

3.6.1.4. Postprocessing

Once the initial time curve has been obtained, if the single color processing experiments (introduced in Section 3.6.2.8) were conducted, then the postprocessing involves plotting a correlation curve between single color data and multicolor data and using this curve to scale the multicolor data. The single color data is again subjected to the same kind of processing in WinList as was done for each variable of the multicolor data. In other words the data is clustered manually in WinList, with the crucial stipulation that these clusters must be closely correlated with the multicolor data. It must be understood clearly that the single color data for a certain protein (e.g. Cyclin A2) and the corresponding multicolor data differ only in the fluorescent tag that was used to measure them, and in the random variations that are a function of the sampling, biology, etc.

Essentially it is the same protein being measured, albeit using a different

89

fluorescent tag. Hence the investigator who clusters the data is advised to start with the single color data (since there are fewer error sources in this case, by virtue of the relative simplicity of the experiment), and having done so, to pay close attention to these clusters when clustering the multicolor data. The purpose of this closely scrutinized clustering is that the span of the single color data values used in clustering must proportionally equal the span of the multicolor data used. This may be done by taking the region information directly from

WinList and working with an Excel spreadsheet, or these gates can instead be written to text files and the process can be done in CytoSys. This work is presented in Appendix 8.3.

Other heuristics include the following:

When a constant value persists throughout a phase or subphase:

We impose a larger number of time points for phases of the data where it is known that a constant value of the protein persists throughout the phase, but which have only one corresponding gate. This is done to preserve the constant shape, which may otherwise be lost post-interpolation. For this purpose, the median value of the gate is simply repeated a certain number of times for the beginning to the end of the phase.

Accounting for distribution of protein values due cell count increase during mitosis

Additionally, it is known that most biochemicals post mitosis become half the value they had right before cell division. In other words, taking the example of

DNA, right before cytokinesis, if there are x molecules of DNA in the parent cell,

90

then right after cytokinesis- there will be x/2 molecules of DNA in each of the daughter cells. This is true for most of the time profiles we obtained, and hence we can impose the restriction that y(t0) = y(tf)/2, where t0 is the point where G1 begins, and tf is the point at the end of mitosis, but right before cytokinesis, and y(.) is the biochemical concentration.

3.6.1.5. Replicated filtered data

Replicated filtered data is simply the interpolated time profile with random variation drawn from a log-normal distribution. The term ‘replicated filtered data’ is interchangeable with ‘simulated data’. It was understood that since many of the variables measured have log-normal distribution, the random variation would have to be drawn from the same when generating such data.

It cannot be expected to mimic the original data precisely, since it has been filtered for noise. On the other hand, if we were to employ the same filtering technique directly to the original data, we would be able to compare replicated filtered data and the real data.

3.6.1.6. Testing the methodology for repeatability

The testing we refer to in this section is testing over and above any cross- checks that are already built into the experimental design. The methodology we offer here is dynamic in both method and goals. The extent of what facilities/resources the biologist brings to the table here decides in the first place how much of the methodology can be implemented. In a broader context, the

91

steps in the methodology and how they are used depends also on the questions that we want to answer. Very specifically let us say we were interested in the most precise expression profile and we want it in absolute units. Let us also assume that all the cross-checks built into the experimental design were accurate, and were implemented as accurately as humanly possible. At the point we receive the flow cytometry data, we cluster it in WinList. Currently this clustering is a manual process that is heuristic based. Visual heuristics are subjective, since they depend on whose vision we are referring to. Hence redoing such clustering and obtaining expression profiles that fall within certain predetermined error margins is one of the ways of testing the methodology for repeatability.

One form of error reduction included in the methodology discussed here is done by fitting a sum of Gaussians to log data (Section 3.7.6.1). Needless to say, such fits are prone to human subjectivity. Despite this, in our application of this method, the results have not been drastically different between satisfactory fits. It is our experience that as long as the criteria for fit optimization are consistent between users, the fits and the ensuing expression profiles will be consistent.

3.6.2. Application to K562 cells

3.6.2.1. Experimental setup

The major relevant "wet science" methodological concerns are:

 Preparing a single cell suspension

 Fixation

92

 Staining for intracellular antigens (since most proteins of interest within

cell cycle studies are expressed inside the cell).

There are several aspects in each of these areas to be concerned with, but the area has been reviewed thoroughly [101-104].

The analysis we discuss here utilizes data from a sample of exponentially growing K562 cells that were fixed and stained for cyclin B1, cyclin A2, phospho-

S10-histone H3 and DNA content as described in [104]. Generally, analyses of this type will include:

1. A proliferating population of cells, sampled at one or more times, with or without some treatments, fixed either by formaldehyde/MeOH or formaldehyde/detergent methods [101].

2. Staining for one or more cell cycle regulated epitopes (biochemicals of interest). The focus could be on the protein, as in this case, cyclins A2 and B1, or the epitope - e.g., phospho-S10-histone H3. The samples do not have to include DNA content, although this facilitates the analysis here by isolating the three major interphase sub-phases (G1, S, and G2). The element that is necessary for a complete analysis is that the expression should be able to be followed as a closed loop without ambiguity.

3. List mode data with enough total events such that each region defining a data subset is populated with statistically significant data. In making this determination, if cells are not limiting, then each region should contain approximately 100 to 400 events (so that the coefficient of variation for accurate detection is between 5 and 10%). However, if cells are limiting, then keep in

93

mind that statistical significance is positively affected by the number of cells in the bounding regions, and the target values of 100 - 400 cells can be substantially lower.

4. Compensation and background, non-specific antibody binding controls. For compensation controls, either antibody binding beads or cells stained singly with each antibody for the probes with spectral overlap problems can be used. Non- specific binding has to be determined independently using each antibody in indirect assays with and without the specific antibody. These results can then be mapped to the data using multiple conjugated primary antibodies in direct staining procedures. If there are cells that are essentially biologically negative within the population, these are the best non-specific binding controls.

Below, multicolor staining data analysis is first presented, followed by the single color set of experimental measurements. This analysis helps determine the time courses of the biochemicals of interest – DNA, cyclin A2, cyclin B1 and phospho-S10-histone H3.

3.6.2.2. Pre-processing

The data shown in Figure 3.1 depicts the process of filtering and correcting various known and significant measurement and experimental errors (clustering and ‘binning’, filtering, defining and correcting bias in data, etc.). Though the data collected by the cytometer is for the tags that are surrogates for the biochemicals

(DNA, cyclin A2, B1, B2, and pHH3), in the description below we use the biochemical names themselves for clarity.

94

Fluorescence Compensation The first correction is fluorescence compensation, which accounts for the fact that the fluorescent properties of the labels used for different proteins sometimes have an overlap in their emission wavelength ranges. This is particularly significant between the fluorescent tags of pHH3 and cyclin A2 (Figure 3.1(A)). This compensation is accomplished by first defining a set of linear fits (two lines in our case – see Figure 3.1(A)) drawn through the centroid of the binned cyclin A2 vs. pHH3 data. The signal below the line for a given data point represents the spillover from pHH3 channel to cyclin A2, and hence this is deleted.

Figure 3.1. Flow Cytometry data pre-processing steps include (A-C) Fluorescence compensation

(done by applying a bias) (D) Removal of doublets (done by gating out cells as shown) (E-H)

Minimizing the effects of non-specific binding (done by applying a bias). These steps were done in WinList (from Verity Software), a Flow Cytometry listmode data analysis program.

Removal of Doublets/Triplets The second correction is removal of doublets and triplets (Figure 3.1(B)) (cells that stick together at any point of the cytometry

95

measurement process). Peak (height) DNA vs. the area of DNA signal from the cytometry data is used to isolate this error. Cells sticking together show up as larger area for the same peak signal in this plot and hence are easily excluded.

Removal of Non-Specific Binding The third major correction is removal of non- specific binding information (Figure 3.1(C)). This is done by plotting cyclin A2 vs. the side scatter signal and drawing a curve through the centroid of the lower cluster. This step is based on empirical knowledge of the particular cytometer used.

Separation of Mitotic And Interphase Cells The cells in mitosis are separated from the interphase cells for convenience by plotting pHH3 vs. DNA (Figure

3.2(A)). The mitotic cells have high pHH3 values for the same DNA value as compared to the cells in G2 and hence can be clustered and separated. A separate file is created for cells in interphase (G1, S and G2) phases.

3.6.2.3. Phase-specific processing

The pre-processed interphase data is binned (or gated) into a discrete number of bins using the visual clustering tool by hand in WinList (Figure 3.2(D)).

96

% Number of Frequency SI #Phases Bins/Gates Cells in each spent in Cumulative Frequency phase each phase R18,R28,R 1G1 31672 31.48 31.48 Interpha 29,R30,R31 se (R3) 2 S R7-R12 60984 60.61 92.08 3 G2 R6 7048 7.00 99.09 4 Prophase R19 197 0.20 99.28 Pro- R14,R15,R 5 Metaphas 376 0.37 99.66 20,R21,R22 e M (R2) Metaphas 6 R23620.0699.72 e Late R24,R16,R 7 285 0.28 100.00 Mitosis 25,R26,R27

TOTAL 100624 100.00

Table 3.1: Computation of percentage time that a nominal cell spends in each phase

It is clear from biology (p 29, Ch. 3, [2]) that cyclin A2 does not increase much in

G1 phase while the DNA content is fairly constant. Thus, G1 phase corresponds to cluster in bin R5 (Figure 3.2(D)). In S phase the DNA content increases along with increase in cyclin A2 – corresponding to clusters R7-R12. The cluster in bin

R6 corresponds to G2 phase.

Heuristic (3.1) (see 3.6.1.3 (Generic methodology)) is applied to convert the information into a time profile for Cyclin A2, Cyclin B1, pHH3 and DNA.

Using this logic, it can be expected that the number of cells in each phase as a percentage of total number of the cells (interphase plus the mitotics) represents the percentage of time that a nominal (or typical) cell spends in that phase. K562 cells spend ~31.48% of the cells are in G1, ~60.61% in S, ~7% in

G2 and ~0.91% in M phase – which corresponds to time spent in those phases

97

by the nominal cell [54,45,90]. Table 3.1 gives the time (in percentage) spent by the nominal cell in each of the phases based on the data.

Figure 3.2. (A) Separation of cells in Interphase (R3) and mitosis (R2) (B) Gating of mitotics into prophase (R19 (oval)- a blue dot marks the mean of the prophase data), pro-metaphase

(horizontal span except for R19 and R24), metaphase(R24) and late mitosis (vertical span except for R24) (C) Gaussian fit to prophase of Cyclin A2 histogram demonstrating characteristics of prophase data (D) Gating of cells in Interphase (a blue dot marks the mean value of cells about to exit S and enter G2) (E) Gaussian fits to Cyclin A2 data for G2 cells (the left blue bar corresponds to the blue dot in (d) and the right blue dot corresponds to the blue dot in (b) ) (F) Time profile (or

Frequency profile) of Cyclin A2 obtained as a result

3.6.2.4. G1 and S Phase Time Course

Computing the median cyclin A2 value in bin R5 gives the constant value of cyclin A2 in G1 phase (notice that the scale is log). On the other hand, the S-

98

phase cells exhibit simultaneous changes in both cyclin A2 and DNA. Typically, cyclin A2 rises much faster than DNA. To find the intermediate points of cyclin variation along the time axis in S phase, unique, non-overlapping bins are used along the direction of variation of the data distribution in the S-phase (see the unlabeled bins in Figure 3.2(D)), and the median value of cyclin A2 in each bin is plotted against the incremental time points as determined by the number of cells in the bin. Figure 3.3 shows the time profiles as they are generated phase-by- phase, with protein expression on the y-axis (shown for cyclin A2 only- to demonstrate methodology), and time (cell cycle time as a percentage) on the x- axis.

3.6.2.5. G2 Phase Time Course

The last gate in Figure 3.2(D) (bin R6) contains cells with most of their variation along the cyclin A2. Clearly since the variation in DNA for these cells has halted- they are cells in a gap phase G2, and they are also small in number

(as compared to G1 and S). G2 cells not only have biological variance in their cyclin values (cyclin A2 in Figure 3.2(D)), but also the inevitable error due to noise in the measurement process. There is no direct way to distinguish between these two variations. One way of resolving the problem would be to obtain statistical measures of the measurement error in the data and subtract this from the measured signal [54,45,90]. The subset of the data that has zero biological variance is the set of cells that are in the prophase of the cell cycle (Figure

3.2(B)). Hence any deviation of the distribution of these cells in G2 would be

99

solely due to the measurement error.

The variance of signal from the prophase measurements can be estimated assuming a log-normal distribution. The standard deviation computed for the

prophase cell distribution ( pro ) provides a basis for estimating the measurement error in G2 using the logic described before (also see [54,45,90]). Essentially the characteristics of measurement error should be the same in all phases.

Figure 3.3. Generation of the Cyclin B1 time curve. (A) G1 (R5) and G2 (R6) data clusters are shown for Cyclin A2 vs. DNA (B) The G2 only cells Cyclin B1 vs. DNA distribution is shown (C)

Gaussian fits to the Cyclin B1 data (log) in G2 (D) G1 only cells Cyclin B1 vs. DNA distribution (E)

Gaussian fits to Cyclin B1 data (log) in G1B (the upper 2 gates in (D) represent the second subphase in G1- i.e.G1B) (F) Resultant Cyclin B1 time profile.

To find the signal in the G2 phase using  pro as a basis, a constrained

100

nonlinear optimization problem is framed to minimize the mean squared error between the data histogram of cyclin A2 (in log scale) in Figure 3.2(E), and the sum of multiple successive weighted Gaussian fits that a user imposes. The standard deviation of the Gaussian curves used for the best fit is constrained to be “approximately” equal to the standard deviation of the prophase cells (i.e., 

≈  pro ) in order to filter the signal and thus to remove the measurement error.

The process is started by fitting a single Gaussian ‘best fit’ to the data and computing the error. The mean and variance are found by solving the nonlinear constrained optimization problem. Next, two Gaussian curves that overlap one another while ensuring that no one curve is completely contained in another, are defined. A best fit is then found by manipulating the mean and variance of both the curves simultaneously so as to minimize the mean square error between the data histogram and the sum of the multiple Gaussian fits. Thus, successively computing the

101

Figure 3.4. The gates used and the corresponding section of the Cyclin A2 time curve

(highlighted) are shown phase-wise for (A,E) G1, (B,F) S, (C,G) G2 (time curve shown is the result of the Gaussian fits in Figure 3.2(E)) and (D,H) M error for multiple (three, four, etc) Gaussian curves can be done. The number of

Gaussian curves is consecutively increased and error found till one obtains the best possible fit or till one is satisfied with the errors of the fit. The user then chooses the solution with the minimum error and/or the best “visual fit.”

While the mean values of cyclin A2 provides the y-coordinate, the weights for each curve are be used to compute the incremental time point along the cyclin A2 time course trajectory. The same procedure can be seen for cyclin B1 data in Figure 3.2(F).

The procedure to eliminate the measurement error in G2 based on prophase data statistics was initially performed by hand using ‘trial and error’ in and looking for the best “visual” fit [45] using MS Excel spreadsheet. In view of automating this process, the procedure was implemented in MATLAB (using fmincon 102

routine). Using the MATLAB routine we generated automates the entire process, saving researchers enormous amounts of time.

The results shown in Figures 3.2, 3.3, 3.4, 3.5 and 3.6 were done using

CytoSys. CytoSys is specialized, custom-made software to convert statically sampled multi-parametric cytometry data into time profiles using heuristics designed by an experienced biologist. CytoSys has been designed for users who are knowledgeable in the analysis of cytometry data [54].

3.6.2.6. M Phase Time Course

The cyclin values of the cells in mitosis (M-phase), are especially sensitive to variations in the clustering procedures used, since in any given cell population, typically a small number are in mitosis. Hence excluding a small number of cells from a mitotic cluster will change the details of the mitotic time curve. Figure 3.5 shows the detailed view of the time curves for late G2 and Mitosis- yielding us features that aren’t apparent in full time curves (Figure 3.4). When the clustering was repeated on different occasions, and the resultant time curves were plotted, a majority of the time curves were in agreement. Figure 3.6 shows more such features for repeated application of the methodology to the K562 data (overall results not presented in this thesis), where some of the features are artifacts of the procedure used for gating. This is shown to illustrate the importance of clustering or gating data properly. Many of the sudden peaks in Figure 3.6 occur at the G2/M intersection, where one must pay special attention to how one clusters the data. More careful clustering has avoided such peaks in Figure 3.5.

103

Figure 3.5. Features of scaled cyclin A2 and B1 time profiles in late G2 and M phases.

104

Figure 3.6. Comparison of the change in late G2/M features in different applications of the time

profile extraction methodology to the K562 data (A,B) before single color correction and (C,D)

after single color correction.

105

3.6.3. Postprocessing

3.6.3.1. Single color correction

The correlation curve between a biochemical in the multicolor experiment and its single color counterpart is used for scaling the multicolor times curves

(Appendix 8.3). The multicolor experiment measures the expression of this biochemical using one fluorescent tag, and the single color experiment measures the same expression using either the same or a different fluorescent tag- this correlation curve gives us the relationship between the fluorescent tags used

(since the protein expression in both experiments is the same). Cyclin A2 is scaled to 17.89% its initial multicolor value, and Cyclin B1 is scaled to 27.78% its initial multicolor value. Hence this gives us the ratio of fluorescence for the tags used (Cyclin A2/Cyclin B1 ~= 1.55). We have made it a practice of choosing data from cells that are going through the S phase of the cell cycle, but which are well within what would be considered the central portion of S phase. Our reason for doing this is that it is at the intersection of successive phases that we observe changes in the data characteristics. While performing this correlation, we wanted to choose data that uniformly had the same dominant characteristics (a condition that was fulfilled by the central portion of S phase).

In the multicolor experiment we conducted, the fluorescent tag that serves as the proxy for cyclin A2 is PE (Phyco-Erythrin) and the fluorescent tag that serves as the proxy for cyclin B1 is A647 (Alexa 647). So the measures that we obtain of the respective cyclins are in terms of units that are functions of these fluorescent tags. Hence for purposes of convenience, let us adopt the convention

106

of saying that cyclin A2 is measured in PE units and cyclin B1 is measured in

A647 units. In the corresponding single color measurement experiments, both cyclins are measured using a fluorescent chemical called FITC (Fluorescein

IsoThioCyanate). Hence, using the convention adopted for the multicolor data, we can say they are measured in terms of FITC units. Hence the scaling illustrated in Figure 3.7 shows conversion of cyclin A2 from PE units to FITC units, and cyclin B1 from A647 units to FITC units.

107

Figure 3.7. Single color correction for Cyclin A2 ((A)pre-scaling and (C) post-scaling)and Cyclin

B1 ((B) pre-scaling and (D) post-scaling. The equations that were used for scaling are also given,

and their derivation is shown in Appendix 8.3

108

Figure 3.8 All expression profiles (Only cyclins A2 and B1 are truly comparable here. Simple visual scaling was done for the other expression profiles for display purposes.)

3.6.3.2. Practical issues

Several issues in the data extraction methodology made themselves apparent during the scaling process. These issues have their roots in the noise inherent in present day flow cytometry measurement techniques, and the inevitable ceiling on precision that accompanies manual/automated analyses

(which perhaps occurs at a lower threshold for the manual case). These issues are discussed next.

109

The first issue faced was how we should determine the value of each time curve at the zero point (when time t=0). Biological knowledge was the only way to proceed in this case. In the case of cyclin A2, it was known that there is no change in the amount of cyclin A2 in the G1 phase. This meant that the value of cyclin A2 at its zero point must equal the value of cyclin A2 at other points in G1.

As previously done in our analysis, we took the median cyclin A2 value of the entire cluster of cells in G1 as a representative of G1 cyclin activity. We also imposed this median value on the zero point. Such an imposition created another problem. When a cell enters G1, the amount of its internal content must equal half the amount of its parent cell. Hence the statistical value of cyclin A2 of a cell at zero point must equal half the value of cyclin A2 of the median cell completing mitosis. We found that this value was not the same as the median G1 cyclin A2 value. This problem has yet to be solved; however repeated attempts have convinced us that this difference is due measurement limitations and manual error introduced in our analyses.

The second issue faced had to do with single color correction. When taking data from the central portion of S phase for developing the single-multi color correlation curve, we obtain a negative intercept in the case of cyclin A2

(0.0559*x – 23.574 where x represents the unscaled multicolor cyclin A2 data).

Now if we applied this scaling equation directly to the data, we end up with negative values. To circumvent this issue, we had the following options, both of which have their own limitations:

110

 Use the correlation equation as obtained, and then translate the entire

scaled curve so that its minimum value was non-negative.

o This was the more straightforward approach, but the potential issue

here was that it made the comparison between the cyclins

problematic. The minimum and maximum values of unscaled cyclin

A2 are (51.33, 8922.8) (PE units). When scaling with the equation

(0.0559*x-23.574), these values become (-20.7047,475.2105)

(FITC units). Making this non-negative makes these values

(0,495.9152) (altered FITC units).

 Use the correlation equation without its negative intercept. Again

comparing cyclins was made problematic here.

o This approach got rid of the need to translate the scaled curve

since it did not generate negative values. The min/max values of

the scaled time curve here are (2.8693,498.7845) (altered FITC

units).

111

We tested both methods in our methodology. While the results presented in this thesis utilize the second approach, there is no proper solution to this problem.

The only perceived strength of the second approach was that it preserved the relative magnitudes of the time curve with respect to zero. Since the cyclin B1 scaled time curve has min/max values of (34.5588,465.702), this would make comparing the minimum values more precise. Comparison of the maximum values would lose more precision, but this was tradeoff we decided to accept for the moment.

3.6.3.3. Testing methodology for reproducibility

We re-applied our methodology to our dataset multiple times, to see how much variation in the time curves resulted. As Figure 3.9 indicates, the resultant dynamic time profiles preserve the key characteristics, and the visual variation is well within the tolerable range.

112

AB

C D

Figure 3.9. Comparison of analysis of the same data set done using three independent sets of gates (manual clustering repeated three times). (A,B) show unscaled data, and (C,D) show data after single color scaling.

Remark 3.1. The 0 point in A2 is usually back calculated. The reason for this is as follows. When we plot out the first time point, it is plotted at a time corresponding to the relative number of data points in the corresponding gate

(region). This means that we don’t have a definite value for the 0 point. Moreover due to flow cytometry measurement issues, we cannot resolve the data finely enough to be able to vouch for the veracity of fine data slices carved close to the lower value ranges.

The standard deviation between the runs is depicted in Figure 3.10(A). The

113

standard deviation of the Cyclins in late G2 and mitosis is shown in Figure

3.10(B). The behavior seen is not an artifact of gating since multiple, independent sets of gates gave similar results. Significant changes over a short span of time are expected biologically in late G2 and M, and the replicability of the overall time curves can be only as good as the replicability for this portion of it, as indicated by Figure 3.10.

STD STD AB120 A2 A2 100 100 B1 B1 80 80 60 60 40 40 20 Protein expression Protein 20 expression Protein 0 0 0 50 100 98 99 100 FrequencyTime FrequencyTime

Figure 3.10. Standard deviation between multiple gating attempts. Note that the zero point value here for cyclin A2 is zero. It would have been flat in G1 had we assigned the median value of G1.

The standard deviation time profiles of the cyclins A2 and B1 are shown in Figure

3.11.

114

Figure 3.11. Standard deviation time profiles of cyclins A2 and B1.

Here we notice a sharp increase in mitosis, however G2 is relatively uneventful.

What is interesting, and expected, is that the standard deviation of the data is the lowest in S phase. In fact there is a sharp dip at the G1/S transition. This data appears, based solely on its standard deviation profile, to offer the best tradeoff between measurement error and biological variance. Further investigation is required to confirm this statement.

3.6.3.4. Testing methodology on MOLT4 cell line data

All of the work presented previously included application of our methodology to cell cycle and control biochemicals measured from samples of the K562 cell line. We wanted to test how generic our methodology was, by applying it to data from a completely different cell line. Hence we measured the same cell cycle and control biochemicals for a sample of MOLT4 cells (human acute lymphoblastic leukemia cell line). Preliminary results for cyclin A2 are presented in Figure 3.12.

115

ABC Cyclin A2 Cyclin B1 Cyclin Frequency

DNA Cyclin A2 Cyclin A2 D E F Cyclin A2 Cyclin A2 Cyclin A2 Cyclin

Time Time Time

Figure 3.12. Application of extraction methodology to the MOLT4 cell line data.

In Figure 3.12(A), the segmentation of Interphase into regions is shown.

The same procedure of fitting a series of Gaussians to the G2 portion of the data

(the oval gate on the right upper hand corner of 3.12(A)) is applied in 3.12(B). In

3.12(C) we see the segmentation of the mitotics into regions, as discussed for the K562 work. In 3.12(D) we see the time profile thus extracted, and its interpolated version with 5000 time points can be seen in 3.12(E). Figure 3.12(F) shows us the scaled version of the time curve from 3.12(E).

Note here that when the analysis presented in Figure 3.12 was done, we assigned the value of cyclin A2 at time zero as zero. However, based on our current understanding, this value should instead equal the median value of cyclin

A2 since cyclin A2 concentration typically should not change in the G1 phase.

The back calculation is discussed in Remark 3.1 earlier in this chapter. 116

3.6.3.5. Reproduced filtered data

Our exploration of data variation is documented here. What we refer to as

‘reproduced filtered data’ in the section heading is an attempt to recreate flow cytometry like data by adding random variation to the median time curve extracted. This random variation is imposed in several different ways, and the results compared to one another and more importantly to the real data.

Ignoring all the data variance means forgoing a rich source of information about cell-cell heterogeneity. Since in the extraction of a time profile from cytometry data, we have achieved the goal of time profile data, but we have collapsed all of this variance into single values along the time curve, we decided to investigate how we could re-insert such variance.

Logically, it appeared that since during the extraction process, we did not simply take a region by region median value, but also performed a specific kind of data filtering (Gaussian noise reduction), we must account for the effects of such filtering when generating synthetic data. In the case of Cyclin A2 data presented, the G2 section of the data was filtered.

Remark 3.2. We have redefined CV as Standard Deviation/Median for our analysis. Such a definition is also utilized in the statistical software STATA. Our reason for such a definition is that we are looking for variation about the median instead of the mean.

One method attempted was to extract the coefficient of variation (CV) of the

G1 data, and the CV of the Prophase data, and use the average of these values

117

in generating a log-normal distribution from which random values are picked to create the ‘synthetic data’. This approach is obviously flawed, since such a CV would be too low for the G1 data (which has relatively high variance) and too high for the Prophase data (which has relatively low variance). Our results reflected this (not shown). Moreover, there is no practical solution offered in such an approach for the fact that a certain part of the data was filtered.

Another method we attempted was to create a linear combination of the G1 and Prophase CVs, and use this for ‘synthetic data’ generation (linearly blended

CV- Figure 3.13). While this captures the data spread admirably, the trail of mitotics (lower right of figure) were clumped into distinct populations. Moreover the intersection of S and G2 is sparser than the original data. We need more investigation to interpret this result.

A B

Figure 3.13. (A) Original data. (B) Reproduced data (Linearly blended CV)

Another method we explored to mimic the real cytometry data was to take the CV from each of the gates (regions) that were formed in WinList, and then use this information to generate CV as a function of time. Preliminary results are

118

shown in Figure 3.14.

Figure 3.14. Comparison of real (A,B) and reproduced filtered data (C,D) where the cyclins are in log scale. Random variation added to time profiles shown in Figures 3.2, 3.3. Both original and reproduced data have 100624 points.

3.6.3.6. Theoretical formulation of data variation

We can theoretically view multi-variable flow cytometry data in terms of a median cell (a cell with characteristics of the data median). This median cell can be used to express the variables pertaining to each of the other cells (measured data) as:

119

Measured data = Median + Inherent biological variation + Measurement noise.

Adopting the notation: x = measured data xm= median data xb= biological variation xmn= Measurement noise

We get: x xxxmbmn (3. 1)

Now in attempting to replicate/reproduce the measured data, using the extracted data median (xm), we must add back the biological variation (xb) and the measurement noise (error: xm). Our dynamic time profile extraction methodology includes a method that reduces and ideally removes measurement noise.

Ideally, however, adding back the measurement noise defeats the purpose of the methodology. Hence we instead must aim to reproduce the filtered data (i.e. (x - xmn) or (xm+ xb) ). This filtered data cannot be compared to the measured data. It must instead be compared to the measured data minus the noise. This would require us to design a methodology to subtract noise from the measured data directly (separate from our existing methodology).

For the present, however, we try and compare reproduced measured data (x) by making the following assumptions:

1. G1 cyclin data has zero measurement noise

2. Prophase cyclin data has zero biological variation

3. The data that lies in between G1 and Prophase has a linear combination

of measurement noise and biological variation 120

4. The post-Prophase data (Prometa-, Meta-, Ana-, Telo-) is equivalent to

Prophase data in having no noise.

Admittedly some of these assumptions are only, at best, approximately true.

However this can be considered a first step in validating our methodology. We define these assumptions in mathematical terms below.

We get: x xxxmbmn (3. 1)

Let the coefficient of variation (SD/mean) of the biological variation be cvb.

Let the coefficient of variation (SD/mean) of the measurement noise be cvmn.

Assumption 1 implies that cvmn_G1 = cvmn of the G1 cells = 0.

Assumption 2 implies that cvb_Pro = cvb of the Prophase cells = 0.

Hence the cyclin A2 value for the ith cell is given by:

ximibimnixxx (3. 2)

Using (2) we get x bi cx vb mi (3. 3)

Using (3) we get x mni cx vmn mi (3. 4)

Inserting (3.3) and (3.4) into (3.2) we get:

xximivbvmn(1 c c ) (3. 5)

Now equation (3.5) has been written for the ith cell, and it holds for every cell in the population of cells under consideration. If there are N cells, and we hold the mass assumption to be true (the population of cells are treated as a homogeneous, well mixed solution of volume V), then we get:

121

NN xxcc(1 ) imivbvmn (3. 6) ii11 VV

Clearly the left hand side is the concentration of the biochemical of interest (in our example it is cyclin A2).

N  xmi i 1 (3. 7) xcc(1vb  vmn ) V

N  xmi If we define median concentration as []x  i 1 , then equation (3.7) becomes: m V

x (1ccvb  vmn ) x m  (3. 8)

This gives us the (approximate) relationship between the measured concentration of the biochemical (x), and the median value extracted through our methodology (xm).

3.7. CytoSys – a software for time profile extraction

3.7.1. Introduction

CytoSys is specialized, custom-made software that converts statically sampled multi-parametric cytometry data into time profiles using heuristics designed by an experienced biologist. CytoSys has been designed for users who are knowledgeable in the analysis of cytometry data. Specifically CytoSys expects data that has been pre-processed (see Section 3.7.2). In addition,

CytoSys expects that the data is already binned (see Section 3.7.3).

The data flow diagram in Figure 3.15 places CytoSys in context. The

122

processes that take place in each of the blocks indicated are:

(iii) Data capture using the Flow Cytometer (FCS – flow cytometry

standard- data file). Flow cytometers (commercial and otherwise) allow

users to save their data as multi-variable FCS files.

(iv) Data conditioning such as binning, fluorescence compensation, doublet

discrimination, and, removal of non-specific binding, using software such

as WinListTM (from Verity Software).

(v) Processing statically sampled flow cytometry data to generate time

profiles of biochemicals in the context of cell cycle using our software

CytoSys. CytoSys has been designed to take in binned text files from

WinListTM.

FCS file Text file Text/FCS file Flow Cytometer WinList

Figure 3.15. Data processing diagram

An overview of the different stages of data as it flows through CytoSys are shown in Figure 3.16.

123

Linearly interpolated time Flow cytometry Dynamic time profile with number of time “static” data profile points determined by the user

Simulated data

Figure 3.16. Data stages in CytoSys.

3.7.2. Data Input

The data fed into CytoSys is tab delimited text files. CytoSys imports these files as Matlab matrices and appends region, subphase and phase definitions

(explained in Processing Protocol) as additional columns to each matrix.

3.7.3. Data Structure

In the cell cycle context, it is seen that what part of the cell cycle the cytometry data belongs to, plays a key role in deciding what processing is needed. For instance, cells that are in the G1 phase of the cell cycle have a more or less constant Cyclin A2 value- but due to the noise that is typical for the lower data ranges that Cyclin A2 in G1 occupies, we observe a wide data distribution.

Visually this distribution width is further accentuated by the log scale used in standard flow cytometry software. For such data, the ideal way would be to assign the median value of the entire data distribution to generate the corresponding portion of the time curve. In sharp contrast, data that immediately succeeds it must processed region by region (defined in processing protocol). 124

To accommodate a wide variety of processing alternatives, it was decided that the most atomic data subset to be processed would be a region.

Accordingly, we implemented the data structure shown in Figure 3.17 for convenient processing at the phase, subphase (defined in processing protocol) and region levels.

Processing

Consolidated Gfit

Phase G1 Additional Process as a single consolidated entity

Consolidated Subphase G11 G12 G13 Gfit Additional

Process at a more atomic level Consolidated Region R22 R23 R24 R25 R26 Gfit

Gaussian fit to data for minimizing contribution of’ measurement/other errors

Region: Fed in from WinList

Figure 3.17. CytoSys data processing protocol.

3.7.4. Processing protocol

We describe the protocol for processing data fed into CytoSys using Figure

3.17. Since the whole analysis is in the cell cycle context, the data is first broadly

125

classified phase-wise. Further classification consists of:

1. The classification of each phase into constituent subphases (in Figure

3.17 the phase G1 is composed of the subphases G11, G12 and G13).

2. The classification of each subphase into constituent regions (in Figure

3.17 subphase G11 is composed of the regions R22 and R23).

3. Each region is the data that is manually enclosed in a ‘gate’ or a box in

WinList due to contiguity. The user is expected to know which regions

constitute a subphase, and which subphases constitute a phase.

Moreover, the most atomic unit of data in CytoSys is the region- as far as

processing is concerned. So region definition should be done very

carefully in the first place.

3.7.5. File Structure in CytoSys

To facilitate this processing protocol, we implemented the file structure shown in Figure 3.18. Each data directory has five sub directories: MULTI,

SINGLE, RESULTS, BACKUP and WORK. The latter three directories are automatically created if not already there, during CytoSys execution. MULTI contains the Multi color data as tab delimited text files. It is expected that the user names each text file after the gate that was used to bin the data in WinList. For example, if the gate R22 was used section off a certain subset of the data, then the corresponding text file should be named ‘R22.txt’. This naming convention must be closely adhered to.

126

Data directory

Dataset 1 Dataset 2

Variable list MULTI SINGLE RESULTS BACKUP WORK MULTI SINGLE RESULTS BACKUP WORK Phase definition Variable list and phase definition files not shown for Dataset 2 to avoid crowding

SINGLE 1 SINGLE 2 SINGLE 3 SINGLE 1 SINGLE 2

Variable Phase Variable Phase Variable Phase list definition list definition list definition

Figure 3.18. Folder structure in CytoSys.

The folder MULTI also contains, as shown in Figure 3.18, two text files:

‘Variable_list.txt’ and ‘phase_definition.txt’. ‘Variable_list.txt’ contains a list of all the Multi color variables. ‘phase_definition.txt’ contains the data structure shown in Figure 3.17 for this specific data set. More information on this is available in the Appendix 8.4.

3.7.6. Salient Features

3.7.6.1. Gaussian fits to data

We now explore noise reduction in portions of the data using Gaussian curve fitting. For example, the data distribution of cells in G2 phase, has both biological variation of Cyclin content, and the inevitable measurement error.

There is no direct way to distinguish the two. One way of resolving the problem

127

would be to obtain statistical measures of the measurement error in the data. The subset of the data that has zero biological variability is the set of cells that are in the prophase of the cell cycle. Hence the data distribution of these cells would be solely due to measurement error.

The Cyclin A2 data is log normal, i.e. histogram of log data is normal. We took the standard deviation of the log data for the prophase cells and used it to derive weighted Gaussian fits to the log data for the G2 cells [54,45,90]. The data histogram of the prophase cells is shown in Figure 3.19. The inset shows the histogram of all the mitotic cells, which includes the metaphase and the prometaphase also. The prophase cells can be seen in the red box in the inset.

40 Prophase cells Cyclin distribution 200 Mitotic cells Cyclin 150 distribution 100

30 cells # 50 bins 50 used in display. 0 0 1 2 3 4 5 log (Cyclin A2(1)) 20 10 # cells

10 10 bins used for display.

0 3.8 3.9 4 4.1 4.2 log (Cyclin A2(1)) 10

Figure 3.19. Prophase cells histogram with mitotics as inset.

The objective here is to minimize the measurement error’s contribution to the data variability. The approach to accomplish this is discussed below.

128

3.7.6.2. Generic problem formulation

We use  pro (standard deviation of the prophase cells) as the starting point for deriving Gaussian fits to the Cyclin data of the G2 cells. Specifically, we take the Cyclin A2 data histogram for the G2 cells and fit it with a series of weighted

Gaussians. The performance of the fit is evaluated based on two criteria:

 Quantitative

o Squared error (square of the difference between the area under

data histogram and the area under the sum of Gaussian fits)

 Qualitative

o Visual fit of the individual as well as the sum of the Gaussians to

the data histogram

o Looking at the linearity of the proxy for the Cyclin time curve (mean

values vs. cumulatively summed peaks)

Since our logic for assigning time depends on the number of cells at any specific point in the cell cycle, the peak represent the period of time that is spent by the average cell with the corresponding mean levels of Cyclin. However, if we want to count time from the beginning of G2, we would have to add the time value for all the previous Gaussian fits to the current fit. Thus the cumulatively summed weighted peak values act as a proxy for time.

The problem of fitting weighted Gaussians to the log data of the G2 cells

using σpro is formulated as an optimization problem. The mathematical notation required in the formulation is presented here.

129

We start with some basic definitions. Let nG = number of Gaussians used in the fit, nbins = number of bins used in plotting the histogram of G2 data a = minimum significant value of data on “x” axis b = maximum significant value of data on “x” axis

th i = mean of the i Gaussian fit to the data, where i = 1, 2, …, nG

Aab = area under the data histogram between points a and b

th Pi(x) = Probability Density Function (PDF) of the i Gaussian fit to the data.

(-x  )2 - i w 2 P(xein )i 2 , 1,2,..., (3. 9) iG2

th Where wi = The weight of the i Gaussian fit

We also denote by i the cumulatively summed peaks of the Gaussians.

i w i iG ,in 1,2,..., (3. 10) j1 2

130

500 61 bins in 450 histogram

400

350

300 # Cells w 2 peak2  250 σ 2π

200 w 150 3 peak3  w σ 2π peak  1 100 1 σ 2π

50 a b 0 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3

1 d d 3 Log(Cyclin A2) 2

Figure 3.20: Example of Gaussian fit to Cyclin A2 (log scale) histogram for cells in G2.

We want to obtain the best weighted Gaussian fits to the data (shown in

Figure 3.20 by the green histogram), in an effort to minimize the measurement error component of this data. Now this problem can be formulated as minimizing the mean squared error between the area under the data histogram (Aab) and the area under the sum of the weighted Gaussian fits (cyan curve in Figure 3.20).

The error simply quantifies the precision of our fit. The accuracy of the fit must be gauged, and this can be done using the visual fit. Finally, the mean values vs. the cumulatively summed peak heights of the Gaussian fits gives us the most precise and accurate fit, that is likely to give us the best performance in our time profile.

The optimization of Gaussian fits can be done subject different sets of constraints. The constraints are determined by what is biologically feasible, and while formulating the problem, we went through two specific constraint sets. The first constraint set is presented involved did not rank order the weights using the 131

matrix equation 3.15. While this gave low error fits, they were incorrect. This is because, to be able to approximate true biology using such Gaussian fits, we must use a Gaussian weight progression that mimics the behavior of cell populations as they move through the cell cycle. Consequently, the more evolved and biologically realistic constraint set was the second constraint set- based on assignation of monotonic weight constraints using equation 3.15. A mathematical discussion of this follows.

3.7.6.3. Problem formulation: monotonic weight constraints

The optimization of Gaussian fits can be formulated as shown below:

nG min(Aab -Px i ( )), i1,2, ..., n G (3. 11a)  ,,dw  i i and

min(hgjj - ), i1,2, ...,;" n G j 1,2,..., n bins ;  ,,dwi (3. 11b)

th Where gj = j element of the overall Gaussian time curve, which has been

th linearly interpolated to have nbins values and hj = height (frequency) of j bin, subject to the constraints:

0wjjmax w , j=1,2,...,n G  wwjj+1H for jn‐1  (3. 12)  wwjj+1H for jn 

where nH is the index of the Gaussian that is assigned the highest weight and is given by

132

n G +1 , for n even  2 G n =  (3. 13) H n1  G , for n odd  2 G i.e., when nG is even, the later of the two central Gaussians is assigned the highest weight, and when nG is odd, the central Gaussian is assigned the highest weight. The weights are represented in vector notation as:

w1 w w = 2 ...  w nG n1G

The inequality constraints (3.12) can be written in matrix notation as:

Aγ  0 (3. 14)

Here γ is the vector of optimization variables:

w  γ = ε d (nG  2) 1

For convenience of notation, we show the matrix equation for the weights only.

Let Aw denote the matrix of coefficients of the weights from the inequality constraint below:

Aww  0 (3. 15)

Here, Aw has the following structure:

133

(nHH 1) n 11 0 

A = 11 w  11 0   11(n 1) n (n n ) (n n 1) GG GH GH

σσ proε pro 33

The remaining constraints, shown below, are identical to those in the generic problem formulation.

The mean of the first Gaussian ( 1 ) is a non-integer multiple ( ) of standard

deviations of prophase ( pro ) distant from the minimum significant data value (a).

μ1pro=a + κσ

The distance between consecutive Gaussians is ‘d’, which is bounded as follows:

σ pro<=d<=2σ pro

The standard deviation of all the Gaussians is given by:

σ = σpro + ε

This optimization problem can be implemented in 2 ways:

1. Manual fitting

The user ensures that all constraints are met via a GUI based tool created for this purpose. The tradeoff that the user makes, and this is subjective, is between the area error (as defined in (3.11(a))) and the fit error (based on the goodness of fit as defined in (3.11(b))). In the manual fit the problem is more of satisfaction 134

than optimization. Moreover the fit error can be calculated in a more relaxed way

(i.e. instead of taking the mean square error between every data bin and corresponding fit value, we can use every other value or every third value).

2. Optimization routine for fitting

The optimization problem was simulated using fmincon (a MATLAB constrained optimization routine). The routine fmincon attempts to find a constrained minimum of a scalar function of several variables starting at an initial estimate.

This is generally referred to as constrained nonlinear optimization or nonlinear programming. An assumption is that the problem is convex.

The work presented in this thesis used the manual method. We had limited success using the fmincon routine, and it was more useful in determining starting values for our manual process. The optimization routine therein requires some more refinement.

3.7.6.4. Single color correction

1. Manually using Excel

2. Using Ezyfit

A free online curve fitting tool (Frederic Moisy, University of Paris-

Sud) was included to allow users to perform single color scaling in

CytoSys.

3. Using MATLAB curvefit toolbox

A version of CytoSys is available for users with the full version of

MATLAB, that allows single color scaling using MATLAB’s inbuilt

135

curvefit toolbox.

3.8. Future work

The methodology described in this chapter has scope for improvement in many areas, and is intended to be a first (and hopefully large) step towards understanding the process involved in obtaining dynamic expression data from a statically sampled dataset.

One of the improvements to be made to the system is describing the cell cycle as a continuum of states whose number can be user specified. Other improvements include speed of operation and feasibility with large datasets.

For our software CytoSys, the future data integration vision is shown below:

136

MODELING/ANALYSIS FIRST PRINCIPLES PHENOMENOLOGICAL ODE/PDE Hybrid/Stochastic MATLAB VCELL Gemstone Copasi .NET DATA VISUALIZATION CYTOMETRY SIGNALING Flow

PRISM .NET .NET KEGG Photoshop Prototype: MATLAB PANTHER Implementation: C BIOCARTA FCS .NET

DATA SOURCES CYTOMETRY

Flow ELECTROPHORESIS Laser Scanning WinList Western Blotting FlowJo Imaging DeNovo

Figure 3.21. CytoSys integration scheme (Our long-term view of data flow through CytoSys).

3.9. Summary

We present a methodology for extracting the embedded dimension of time from statically sampled flow cytometry data. In the methodology, the raw flow cytometry data is pre-processed so as to remove certain standard errors, and then processed to obtain an imprecise time curve. As part of the processing, noise reduction is performed on certain portions of the time curve. This time curve is then rescaled using single color experiment measurements. The whole methodology post WinList can be executed using our software CytoSys. 137

4. CELL CYCLE MODEL

4.1. Overview

In this chapter we explain the cell cycle process, and the control system that regulates it. We choose the most appropriate model from a host of choices, and present its features. Calibration efforts on this model are also presented.

4.2. Chapter Organization

In Section 4.3, we explain the cell cycle process. The cell cycle control system is explained in Section 4.4. Section 4.5 reviews the different mathematical models of the cell cycle available in the literature. We continue, in the same section, to describe an existing model that we chose from the literature, and modified in an attempt to replicate published results. In Section 4.6 we present our calibration of the cell cycle model discussed in Section 4.5. Section

4.7 presents a simple cell population model that is intended to start work in this direction.

4.3. Cell cycle

As was explained in Chapter 1, our motivation for investigating the cell cycle process stems from its causal role in cell proliferation, and the fact that aberrance of the cell proliferation mechanism holds one of the keys to understanding cancer

[22]. The role of cell cycle molecules in regulating proliferation is highlighted by the fact that a number of these molecules are found to be mutated or deregulated

138

in numerous tumors [76].

The cell cycle is a tightly regulated sequence of events that culminates in a cell’s division into two daughter cells. The cell cycle is divided into four distinct phases or sections, two of which are principal, and two are “gap” phases which ensure proper conditions for the error free functioning of the principal phases.

The first principal phase is referred to as the DNA Synthesis phase (S phase) during which the cell’s genome and all other cellular components are doubled.

The second principal phase is referred to as Mitosis (M phase) during which the genome is halved and distributed symmetrically in the double-content cell, which then cleaves into two daughter cells in a process referred to as cytokinesis.

The gap phases that separate these principal phases are referred to as Gap

1 (G1) phase and Gap 2 (G2) phase. G1 separates M phase from S phase, while

G2 separates S phase from M phase. A third gap phase, Gap 0 (G0), in which cells remain in a quiescent or resting state, occurs before what is called the restriction point (in G1) in eukaryotic cells, and is usually due to a lack of growth factors or nutrients.

The different cell cycle phases are illustrated below (Figure 4.1)

139

Figure 4.1. Cell cycle phases. (figure reproduced from [106])

G0 (when present), G1, S and G2 are collectively referred to as Interphase (inter meaning between mitoses). The main activity in Interphase is the cell content duplication that occurs during the S-phase.

Mitosis itself consists of a sequence of rapidly orchestrated changes in the cell that ensure that the doubled cell content is symmetrically distributed between the two daughter cells. The first of these is prophase, during which the nuclear membrane disintegrates and the nucleolus disappears [107]. The nucleolus is a non-membrane bound structure present in the nucleus that produces ribosomes, which translocate out of the nucleus to positions where they are critical in protein synthesis. As prophase continues, the chromosomes condense and begin to appear. Mitotic spindles made of microtubules begin to form between the poles

140

and kinetochores (protein structures on the chromosomes where the spindles attach) begin to mature and attach to the spindles. During metaphase, the second sub-phase in the mitosis, the chromosomes become attached to the mitotic spindle at the kinetochores and align along the metaphase plate at the equator of the cell.

During the next stage of mitosis, called anaphase, the kinetochore microtubules shorten, and separate the chromosomes to opposite poles of the cell, while the polar microtubules and the cell elongate. This is followed by a stage called telophase, during which the chromosomes reach the poles of the cell, and begin to disappear. The polar microtubules continue to elongate.

Nuclear membranes form, and nucleoli appear, and chromosomes decondense.

The last stage of mitosis is cytokinesis, during which a cleavage furrow is formed at the center of the cell, and the cell divides. Sometimes mitosis is treated as consisting only of the stages that precede cytokinesis, and cytokinesis is treated as a phase itself (C-

141

Cytokinesis

Chromosomes reach poles ready for division

Chromosomes start to separate to opposite poles of cell

Chromosomal attachment to spindle Alignment of chromosomes Chromosomal condensation on equator of cell Microtubule elongation

Figure 4.2. Mitosis [107].

phase). In this scheme, mitosis is specifically defined as the nuclear division phase and cytokinesis is defined as the cytoplasmic division phase.

The length of time that it takes for the mammalian cell to progress through the cell cycle varies. This is mainly dependent on cell type (assuming equal environmental conditions). The average cell cycle time is approximately 24 hours

[76]. Inter-cell type variation of cell cycle time is mainly due to a difference in the time spent between cytokinesis and the restriction point. The time a cell takes to pass from S-phase into M is remarkably constant between cells and is approximately 6 hours for S phase, 4 hours for G2 and 1-2 hours for mitosis and cytokinesis [8].

Disruption of the cell cycle mechanisms has serious consequences. Cell

142

cycle timing is very crucial in this regard and is determined by a set of chemical reactions and “checkpoint” controls in the form of feedback/forward mechanisms.

Checkpoints control critical, irreversible transitions such as the transition between the G1 phase and DNA synthesis (S phase) (G1/S checkpoint). Similarly, once a cell has replicated its chromosomes and entered G2 phase, it must pass a second checkpoint (G2/M checkpoint) before proceeding to mitosis (M phase).

Having passed this threshold, there is an additional checkpoint ensuring proper alignment of the chromosomes on mitotic spindles (metaphase-anaphase checkpoint).

The general cell cycle scheme appears to be conserved from yeast to mammals [9,10]. Passage through the eukaryotic cell cycle is regulated by synthesis and destruction of Cyclins that bind and activate Cyclin Dependent

Kinases (CDKs).

CDKs are more or less constant throughout the cell cycle. CDKs, in turn, are regulated by three distinct mechanisms:

1. Cyclin availability

The kinase subunits are believed to be widely abundant throughout

the cell cycle, but their activities are regulated through binding to

tightly controlled levels of their cyclin partners. Cyclin

concentrations are determined by the opposing forces of

transcription and degradation: specific transcription factors and

ubiquitin dependent proteolysis systems (e.g. the anaphase-

promoting complex (APC)) which themselves are regulated by

143

additional controls.

2. Regulation through phosphorylation

For example, CyclinB/Cdk1 can be inhibited through

phosphorylation by hWee1 or activated by the phosphatase

Cdc25C.

3. Via CDK inhibitors (CKIs)

Active cyclin-Cdk complexes can also be inactivated by binding

CKIs such as Kip1. The levels of CKIs depend on their production

rate, which is governed by regulated transcription factors, and their

destruction rate (phosphorylated CKIs are rapidly ubiquitinated and

degraded).

P Cyclin

CDK

CKI

Figure 4.3. CDK regulation can be done either via Cyclin availability, or through phosphorylation or through inhibitors (CKIs).

The predominant mode of control of CDK activity is cyclin level regulation.

Cyclins are broadly classified into G1 cyclins, G1/S cyclins, S cyclins and M

144

cyclins.

4.4. Cell cycle control system

As already mentioned, the cell-cycle control system is centered around the cyclin dependent kinases (CDKs). We aim to understand their precise role in different phases of the cell-cycle.

Figure 4.4 shows us this involvement sequentially, using generic terminology. Based on their roles in the cell-cycle, different CDKs, and their cyclin partners are named as G1, G1/S, S and M. For instance cyclin D in vertebrates is a G1 cyclin, that according to established cell-cycle literature plays a role in converting extracellular mitogenic signals into a trigger for entering

(starting) the cell-cycle [76]. Similarly cyclin E in vertebrates is a G1/S cyclin, whose role is mainly to ramp up during G1 and prepare the way for S-phase.

Cyclin A plays the role of the S cyclin, and cyclin B is the M cyclin. The cyclins are key to their CDK partners being able to perform their duties, so in our discussion we use the term CDK activity to mean cyclin-CDK complex activity.

The cell cycle in multicellular eukaryotes is primarily controlled by two

CDKs- CDK1 in M phase, and CDK2 in S phase. Animal cells also contain CDK4 and CDK6, which play a role in cell cycle entry in response to extracellular factors.

145

Extra‐cellular signals G1/S and S cyclins increase G1 commences

Start S cyclins are inhibited by APC and CKIs CKIs/APC degrade

S cyclin‐CDKs trigger S cyclin‐CDKs become active chromosome duplication S commences

Checkpoints G1/S CDKs promote own inactivation G1/S cyclins degrade By triggering G1/S cyclin destruction

G2 commences M cyclins accumulate (inhibited by phosphorylation)

Removal of inhibitory phosphorylation

G2/M

M commences APC increases

Meta/ Ana Destruction of sister chromatids/ S and M cyclins

Figure 4.4. Sequence of cell cycle control system initiated activities that constitute the cell cycle.

Now while we know that cyclin activity is required for CDK activity, it must also be understood that full CDK activity is achieved only when CDKs are phosphorylated by enzymes called CDK-activating kinases (CAKs). However

CAKs are usually maintained at a constantly high level throughout the cell cycle, so cyclins determine the rate limiting step in cell cycle control.

Extracellular signals trigger entry into the cell cycle. G1/S and S cyclin gene expression is triggered and causes G1/S CDK activity, while S CDK activity is inhibited by CDK inhibitors (CKIs) and APC (Anaphase Promoting Complex).

When G1/S CDK activity reaches its peak, the start checkpoint of the cell cycle is passed and DNA synthesis (S phase) is committed to. G1/S CDK activity also causes APC and CKI degradation, thereby causing S CDK activity to commence. 146

This triggers the onset of S phase. As S phase progresses, G1/S CDKs promote their own inactivity by triggering G1/S cyclin destruction. S phase ends and the

G2 waiting period begins. At the onset of G2, M cyclins start to accumulate and form complexes with their CDK counterparts. There is no M CDK activity however due to inhibitory phosphorylation of the CDKs. At the onset of M phase, the inhibitory phosphorylation of the M CDKs is removed, and M CDK activity triggers the G2/M checkpoint. Early mitotic events such as spindle assembly lead to duplicated sister chromatid alignment on the mitotic spindle in metaphase. M

CDK activity also causes an increase in APC, which in turn triggers the metaphase/anaphase transition checkpoint. APC now stimulates destruction of the proteins that hold the sister chromatids together, and destruction of S and M cyclins. This leads to an inactivation of all CDK activities in late mitosis, as well as an increase in CKIs. Inactivation of CDKs leads to spindle disassembly and M phase (and hence cell cycle) completion.

4.5. Modeling of the cell cycle overview

There are nearly 40 mathematical models of the cell cycle control system published over the past 20 years. Most have focused on portions of the cell cycle:

1. minimal model of cdc2 activation [108],

2. cdc2/cyclin interactions [109],

3. cdc2/APC interactions [110],

4. general coupled phosphorylation-dephosphorylation cycle induced 147

oscillations [111],

5. cyclin B regulation [112],

6. the role of mdm2 [113],

7. models of a hypothetical reversibly binding cdk inhibitor [114].

Some groups have focused on specific cell cycle phases/transitions such as:

a. the G1 phase [115-117]

b. the G1/S transition and its associated “restriction point” [4,17,19,118-121]

c. the G2/M transition [122-126]

Novak and Tyson’s cell cycle models are instances of attempts at assembling comprehensive models of the entire cell cycle. Such attempts are more advanced in the case of simpler organisms such as yeast. A plethora of yeast cell cycle models are available [20,12,48-53,127]. Other species’ models include frog egg models [124] and mammalian models (Novak06, [17]), as well as generic eukaryotic models [11,128].

Most cell cycle models are highly context dependent. The context determines both the technique used for model implementation and the model purpose. The large majority are ODE models based on mass action assumptions. There are widely divergent models available in the literature however, such as:

 Stochastic [129]

 Time delay [130]

 Partial Differential Equation [131]

148

 Boolean logic [16]

CycD:Cdk4:Kip1 CycE:Cdk2 CycA:Cdk2 Kip CycB:Cdk1 1

CycE:Cdk2 CycA:Cdk2 CycB:Cdk1

CycE:Cdk2 CycE:Cdk2 CycA:Cdk2 CycB:Cdk1

Figure 4.5. The Novak04 model components. Black cartoon reproduced from publication. Red names and arrows were added to indicate modeled reactions not originally included in diagram

(Weis MC qualifier).

Our group initially considered three models for purposes of implementing the mammalian cell cycle model. John Tyson and Bela Novak have spent their careers modeling the cell cycle and their studies reflect this dedication by being the best reputed and most referenced in the field. Their models are the most 149

comprehensive, and detailed published, being among the few which attempt to model the complete cell cycle. Their two most recent papers [19,14] were initially considered for implementation. We decided to adopt the Novak04 model [19] because it is specifically a mammalian cell cycle model, as opposed to the

Csikasz-Nagy06 model [14] which by being generic is forced to be non-specific in some areas of the model. For instance, they model E2F as a generic Cyclin E/A transcription factor but do not model its regulator Rb. Additionally it is claimed that the mammalian implementation of this generic model is adopted from

Novak04, but the model structures are widely varying.

A recent model by Haberichter et al. was also a contender for our model implementation exercise. They did attempt to address several issues with the

Novak04 model, but entirely left out the S/G2 transition in their model, as well as the G2 and M phases. Hence it isn’t a complete cell cycle model either.

The dynamics as generated by simulation of the Novak04 model can be best understood by putting the model together with the dynamic time curves.

Very simply viewed, the core of this model, in keeping with our generic description in Figure 4.4, is CDK activity. With reference to Figure 4.5, growth factors initiate a switch-like mechanism (ERG/DRG interaction) that generates an on-pulse. This on-pulse, as it were, catalyzes the activation of CDK4-cyclin D complex, which in turn activates the CDK2-cyclin E complex. CDK2-cyclin E is a

G1/S CDK-cyclin complex, which when it reaches peak activity triggers the Start checkpoint. It also triggers APC and CKI degradation. In this model the CKI is

Kip1.

150

This APC and CKI degradation causes S phase to commence, primarily by removing the inhibitory influences on the S CDK-cyclin complexes. In mammalian cells, CDK2-cyclin A is the S CDK-cyclin complex, and its activity commences.

Simultaneously the G1/S CDK promotes its own inactivity by stimulating the degradation of its partner cyclin. In other words CDK2 promotes the degradation of cyclin E, as S phase commences.

The cell proceeds through S phase, and then G2 begins. At the onset of G2, the M cyclins begin to accumulate and form complexes with their CDK partners.

In this case the M cyclin is cyclin B, and its CDK partner is CDK1. CDK1-cyclin B complex activity is inhibited by phosphorylation however, and cyclin B steadily accumulates until the onset of M phase, when the inhibitory phosphorylation is suddenly removed, and this complex explodes into activity.

CDK1-cyclin B activity leads to an increase in APC, which in turn triggers the metaphase/anaphase transition checkpoint. APC now stimulates destruction of the proteins that hold the sister chromatids together, and destruction of S and

M cyclins (or cyclins A and B respectively). This leads to an inactivation of all

CDK activities in late mitosis, as well as an increase in CKIs. Inactivation of

CDKs leads to spindle disassembly and M phase and hence cell cycle completion.

4.6. Modified Tyson model calibration attempts

We present here our attempts at calibrating the modified Tyson cell cycle model presented in Section 4.5. Before attempting to calibrate this ODE model, 151

we must inquire into whether such an exercise can be done with minimal effort.

Systems theory provides us with means to do the same, in the form of parameter sensitivity analysis. Sensitivity analysis seeks to find out how sensitive the outputs of our system as with respect to different parameters. Varying the most sensitive parameters would allow us to cover a lot more of the output space, and see effects in our analysis far faster, and with less effort. This becomes increasingly important as the size and scope of the parameter estimation problem increases.

Systems theory also provides us with another powerful analytical method called parameter identification theory. This theory is concerned with a question that seeks to bridge the gap between effects (the outputs) and the causes that lead to them (system states, system parameters, etc.). This question is: Is it possible that the same set of outputs can be generated for the system at hand, using two (or more) sets of parameters? If so, is one of these parameter sets the

'most correct'? This question is usually asked after our calibration analysis has been carried out.

In this section, we show our attempts at:

 Replicating Tyson model outputs

 Calibrating the model with our cell cycle data

4.6.1. Replicating Tyson model outputs

The Tyson model of the mammalian cell cycle that was adapted by us had

18 differential equations and 76 parameters, and was solved using the MATLAB

152

ode15s subroutine. When the Cdh1 level crosses a certain threshold value, the model triggers an exit from the ODE solution, to replicate the end of G2 and the onset of mitosis. At this point the cell mass is halved, and the simulation continues to completion.

It must be noted that to achieve the initial conditions that were published, we ran the model through an equilibration phase until model states reached the published initial condition values.

To re-implement Tyson's published model, we started out by coding the

Novak04 model in MATLAB adopting their equations, parameters and simulation schemes. However since we could not match the published dynamics by doing this, we contacted the authors for guidance, but had no response. We next contacted another group who had published a partial corroboration of the

Novak04 model simulations [105]. They responded with their code and advice, and it was realized that their results were what had been hitherto obtained post- equilibration (as explained in the earlier paragraph).

We then proceeded to run a parameter sensitivity analysis which helped us rank-order the parameters in terms of how sensitive the model outputs were to the respective parameters. We chose the top three parameters and varied them and were able to reproduce most of the published results. This is shown in Figure

4.6.

153

Figure 4.6. Tyson model refit to match published output curves.

Having spent a considerable amount of time working with experimental data that reflected cell cycle proteins such as the cyclins, it was apparent to our group that even though replicated, these dynamic time curves did not agree with what we saw in our data. So the next step we took was to attempt to refit the Novak04 model with our experimental data of cyclins measured for a leukemic cell line

(K562 cells). More detailed treatment of this cell line and the work that was done in conjunction with dynamics generation is presented in Chapter 3.

4.6.2. Calibrating the model with our cell cycle data

The first step that was taken with our cell cycle data, in an attempt to fit the

Novak04 model to it, was to scale it both in amplitude and in time to match their corresponding Novak04 model curve maxima and to accommodate the fact that the Novak04 time axis was in hours, while our time axis was in percentage

(where 100% represented one cell cycle). Both the data and the model are qualitative in their nature, so this scaling did not introduce any additional

154

uncertainty.

The only point of importance that was maintained in the analysis was the fact that the cyclin A and B data was generated such that both time curves could be compared to one another. Hence we made an effort to preserve this ratio in the scaling that was done. To reiterate, we scaled the data in time to match the

Novak04 model units. We scaled the data in amplitude so that the maximum of the cyclin B matched that of the Novak04 model, but that of cyclin A was fixed according to the ratio determined by the data (cyclin A data/cyclin B data).

We also had to understand what our data meant when carrying out this exercise. Our data indicated that we had cyclin A2 and cyclin B1 concentrations.

However does this refer to free cyclins, or cyclin-CDK complexes, or to inactive cyclin-CDK-CKI trimers? Moreover how do we account for the cyclin isomers

(cyclin A1 vs cyclin A2; cyclin B1 vs cyclin B2). The latter question was solved easily, since it was determined that cyclin A2 and cyclin B1 were the only isomers that were of importance (statistically) in the mammalian cell cycle, and these were what we were measuring.

To answer the earlier question, about whether we are measuring cyclin monomers, dimers or trimers, we found out that flow cytometry fluorescent probes do not distinguish between the three. So what we are referring to as

Cyclin A, is in fact Cyclin A monomer + cyclinA-CDK dimer+cyclinA-CDK-Kip1 trimer. This led us to conclude that in the Novak04 model, we had to compare our measured cyclin A2 with cyclinA+ cyclinA-Kip1 from the model (CDK is not tagged, and the cyclin A term covers all the active cyclinA-CDK complexes as

155

well). We also compare the measured cyclin B1 with the model’s cyclin B, since they don’t include any terms for the dimer or the trimer.

Before performing model calibration, we started out with the comparisons shown in Figure 4.7.

Figure 4.7. Plots of the K562 data versus current (refit) Novak04 model output.

We next proceeded through multiple parameter sensitivity analysis steps, and converged on the two following best fits of the Novak04 model to the K562 data.

156

Figure 4.8. Two best fit solutions for a K562 cyclin B maximum (and corresponding cyclin A ratio) estimated as an additional scaling parameter. The top solutions with the scaling parameter being optimized to be 1.6 and the bottom ones it is optimized to 1.

Here we briefly ponder these model calibration results, and the related issue of parameter identification. Neither of the fits obtained are good enough. For cyclin B, there is no good fit using the parameter estimation methods that were applied. In such a light, while we may declare that one or the other of these fits is the best, how can we truly know that it is?

The results for the cyclin A fit, in contrast, are encouraging, but there is 157

plenty of scope for improvement. The avenues for future research here are abundant. The parameter estimation was rigorous, but not exhaustive. There is scope for more research in this direction. However, the investigator’s efforts may be better invested in gathering more experimental data of cell cycle proteins (e.g. cyclin E, Kip1 and so on). So also, testing the Novak04 model structure for integrity, and also attempting to somehow model it without an event based trigger are also potentially intriguing areas for research.

4.7. Cell population model preview

So far, the modified Tyson model that was discussed is at the intracellular level. This is essentially a control system based on the CDKs and the adjunct proteins which regulate their activities. If we seek to incorporate this system into a cell population model, then one way to do this is to start thinking of the cell cycle control system as an input-output system.

In a nutshell, we first aim to calibrate and validate the modified Tyson cell cycle model (chapter 4), and then calibrate and validate the upstream signaling model (chapter 5). Then we aim to calibrate and validate overall signaling model

(upstream and downstream connected). The computational model thus generated represents a median cell (due to the data used to calibrate it). We next can develop a procedure for statistically representing the biological cell population using an N cell collective in silico. The key idea is that when the computation model of the N cell collective is simulated, the overall response for all state variables would be similar in spirit to that presented for cyclin A2 in 158

Figure 3.13 (B). In essence the standard deviation at each point in the cell cycle

should be the “same” between the biological data ( data ) and the simulation of the

N cell collective ( Ncollective ) where the norm could be the L2 norm.

T min dt   data   Ncollective (4. 1) 0

One way of proceeding towards an implementation of Equation (4.1) is to recognize that certain model coefficients are fixed (they are either physical constants or simply the same over the cell population). Hence we can divide the

  model coefficients into CC  fixed, C bio_ var  , where Cfixed, is the fixed part and

Cbio_var is the biologically varying part. The art of dividing this is based on science, literature, and careful examination of the characteristics of each coefficient in the individual cell mathematical model. The N cell collective algorithm is as follows.

STEP 1: Define a set D corresponding to all points in the coefficient space with

Cbio_var as the center point corresponding to the median cell and a ‘reasonable’ range around each of the coefficients. For each point in this coefficient space D simulate the computational model of the individual cell using a Monte-Carlo approach and collect the output time profiles.

STEP 2: If this output time profile of the simulation of the cell is in the region of filtered measured responses represented by Figure 3.13 (B) then add this to a

‘feasible set’ E, essentially E  D. Otherwise discard the simulation and repeat till we have a large number of ‘feasible points in the coefficient space’.

159

STEP 3: Through an exhaustive search select F a subset of the ‘feasible set’ E such that the objective function in equation (4.1) is below a tolerable error, i.e.,

F E  D. simply the number of elements of the set F is the number N which is the number of cells in silico along with the corresponding coefficient when simulated would give an output very close to measured/filtered responses represented by Figure 3.13 (B).

The CytoSys software (Chapter 3) can be enhanced to handle the algorithm above. The computational burden is expected to be severe and will require ingenious solutions including parallelization, cloud computing and native code use.

Hierarchical Model: Using the cell population data we can then create a population level phenomenological model using data from cell population doubling times, cell sizes, dose responses, and other data. Concepts from hierarchical systems theory can be used to relate this model to the model of the

N cell collective.

Data preservation: Data (numerical data, mathematical models, experimental procedures and typical outputs) require archival in ‘open access’ databases. In a similar vein, models need to be standardized by, for example, conversion to

SBML format.

160

5. UPSTREAM MODELING

5.1. Overview

In this chapter we develop a mathematical model of the Flt3 ligand/Stem cell factor (SCF) induced signaling pathway in the context of hematopoietic malignancies, particularly leukemia. Model calibration and validation is left as future work.

5.2. Chapter Organization

We start out with an initial discussion on leukemia and its different characteristics, and what may be happening to cause it to manifest, in Section

5.3. This is followed by a detailed discussion of the biology of the Flt3 signaling pathway- which has been implicated by numerous studies in Acute Myeloid

Leukemia (AML) (Section 5.4). In Section 5.5, we present the modeling itself

(different parts of the ODE model we built for the Flt3 signaling pathway).

Complex Systems Biology techniques are then applied to this pathway- in order to understand its functioning and its dominant relationships better. This includes modularization of the pathway and the development of a signal flow diagram, as well as a partial calibration attempt to give quantitative rigor to the model. The chapter concludes with a list of tasks in progress.

5.3. Introduction

We are interested in exploring the role of the Flt3 signaling pathway in

161

Acute Myeloid Leukemia (AML). Flt3 (FMS-like tyrosine kinase 3) is a receptor tyrosine kinase that has a crucial role in normal hematopoiesis [23]. There has been considerable recent literature that details the role of Flt3 in hematopoietic malignancies [23]. One way of approaching this is to understand the role of Flt3 signaling in normal hematopoiesis, and then try to zero in on what exactly may be going wrong in the signaling- leading to AML. However, to frame the problem in a larger context, we first discuss AML and the cellular processes whose aberrance might be its cause.

According to the American Cancer Society, 44,270 new cases of leukemia are predicted in 2008 based on earlier statistics and growth projections. 13,290 of these will be AML, and 8,820 or 66.37 % of those with AML will die. The high mortality rate indicates the severity of the problem. Our choice of AML is because it is very suitable as a model system. Particularly in conjunction with flow cytometry, leukemia provides the perfect convergence of disease system and measurement technique. Flow cytometry will be discussed in detail in Chapter 5.

What makes it attractive in leukemic sample measurement is the fact that flow cytometry can resolve subpopulations of blood cells, and the growth of one subpopulation of blood cells at the expense of another is precisely how leukemia manifests.

The systemic nature of AML can be really understood, only by looking at the hematopoietic system, and its multilevel regulation that is so essential to our biology. Hanahan and Weinberg’s seminal paper [22] on the different changes to normal biology that must be affected in order for cancer to be able to elude the

162

organism’s corrective action system illustrates the complexity of the system at hand.

For the purposes of this discussion, we focus on the hematopoietic system and how its deregulation can occur. The mammalian hematopoietic system is a vastly distributed system, whose origin lies in the bone marrow. The bone marrow gives birth to what are variously referred to as long term hematopoietic stem cells [132] or self-renewing pluripotent hematopoietic stem cells [23]. These stem cells proceed through two distinct processes when alive- differentiation

(see Figure 5.1), which is the change of a stem cell (with more potential for change and less commitment to specific functions) to a mature cell (with less or no potential for change and more or complete commitment to specific functions that the mature cell is designed for), and cell proliferation, which is simply a cell

(at any stage of differentiation) going through the mitotic cell cycle and duplicating itself, possibly with certain changes. Normal hematopoiesis is a very well regulated orchestration of cell differentiation, cell proliferation and apoptosis

(programmed cell death). AML is defined as accumulation of blast cells

(immature cells) in the bone marrow. Clearly, the very definition of AML implies that somehow the above three processes are affected in concert.

163

I II

Figure 5.1. Hematopoietic hierarchies [23].

Figure 5.1 is explained below:

I. The hierarchy from [23]: Stirewalt and Radich. The maturation and differentiation of cells during normal hematopoiesis is shown, indicating how expression of FMS-like tyrosine kinase 3 (FLT3) is linked to this process (shown as +, –, +/– or ? (unknown)). FLT3 is mainly expressed by early myeloid and lymphoid progenitor cells, with some expression by the more mature monocytic lineage cells. Colony-forming units for the erythroid (CFU-E), megakaryocytic

(CFU-MK), granulocytic–monocytic (CFU-GM), basophilic (CFU-B), granulocytic

(CFU-G), monocytic (CFU-M), and dendritic (CFU-D) lineages are shown. NK cell, natural killer cell; RBC, red blood cell.

II. Two possible hierarchies from [132]: Rosenbauer and Tenen. a. Long-term

164

(LT) and short-term (ST) hematopoietic stem cells (HSCs) provide long-term

(more than 4 months) and short-term reconstitution in lethally irradiated mice.

According to the model established by the Weissman group, ST-HSCs produce multipotential progenitors (MPPs), which have lost all self-renewal potential but are still able to generate all hematopoietic cell lineages. The common lymphoid progenitors (CLPs) are thought to give rise to T and B cells, although more recent data point towards a progenitor that is distinct from CLPs giving rise to T cells. The common myeloid progenitors (CMPs) give rise to granulocyte/macrophage progenitors (GMPs), megakaryocyte/erythroid progenitors (MEPs), and mast-cell and basophil progenitors. A recent publication has shown the existence of shared macrophage and dendritic-cell progenitors

(MDPs) [133]. b. The view that CMPs are the shared origin of MEPs and GMPs has recently been challenged by Jacobsen and colleagues [134], who proposed that MEPs are the direct progeny of ST-HSCs, whereas all myeloid and lymphoid lineages are the progeny of lymphoid-primed multipotent progenitors (LMPPs).

The accumulation is due to uncontained proliferation of the blast cells.

The very fact that blast cells are able to participate in proliferation means that there was deregulation of the cell differentiation process. Also, the fact that proliferation is uncontained means that the programming of cell death is aberrant. We can view the processes of cell differentiation, proliferation and apoptosis as directly affecting distinct subsystems of the whole. These subsystems are necessarily intermeshed, and share a lot of components- so they 165

are not separate, but they are distinct subsystems (the certainty of their distinctness is based on their distinct functions).

(As an aside, it is interesting that the definitions of AML and CML (as well as other MPDs) include markedly different subsets of cells that proliferate. In AML, as mentioned earlier, it is immature blast cells that are overproduced. CML, on the other hand, is characterized by overproduction of mature (relatively well differentiated) cells [132].)

The blast cells that accumulate in AML would probably include, according to the Stirewalt and Radich hierarchy (Figure 5.1 (I)), committed stem cells, early myeloid progenitors, and colony forming units E, MK, GM and B, based on various definitions of AML.

We can draw a boundary on what we refer to as the whole system, based on how global and systemic we want our thinking to be. The boundary can be at the level of the hematopoietic system, or at the organism level, or we can expand it to include the environmental influences on the organism. This boundary depends on what questions we are interested in asking of the system, i.e. it is context based.

It should be noted that defining the problem as affecting three finely interwoven subsystems, where each of the subsystems is defined in relation to key cellular processes, is useful in starting the problem definition in a systemic manner. It is by no means complete. A complex plethora of changes affect the deregulation that is needed to bypass all the checks and balances, that otherwise serve so well to protect us against most forms of malignancy.

166

To understand exactly how this relates to what is being explored at the cell signaling level, we combine the top-down approach that was started above, with a bottom-up approach that is defined in the Cell Signaling section (Section 5.3), and try to link the two together. Again, it must be remembered that the terms ‘top- down’ and ‘bottom-up’ are context dependent.

5.3.1. Cancer stem cells

There is increasing evidence to show that malignancies are sustained by cancer stem cells- a tumor subpopulation that maintains the uncontrolled production of less malignant neoplastic daughter cells (blasts). Cancer stem cells seem to share the important functions in self-renewal, differentiation and long- term survival, with normal stem cells. Hence it is believed that a similar set of genes controls both normal and tumor stem cells.

Cancer stem cells are of great clinical relevance, since their ‘stemness’ properties probably help them evade conventional anticancer therapies that are designed to target rapidly cycling and highly proliferating cancer blasts. This inability to eradicate cancer stem cells might be responsible for the disease relapses that are frequently observed in patients with cancer.

AML was the first cancer type for which evidence for the existence of a cancer stem cell was generated. A clear stem cell like hierarchy was identified in mouse xenograft models [132]. Depending on the transforming event, cancer stem cells either arise from normal hematopoietic stem cells, or from committed progenitor cells. However the overall molecular events involved in the formation of cancer stem cells are still poorly understood. With respect the Flt3 signaling 167

pathway, all that can be said is that the block of differentiation that is seen in

AML is probably due to gene transcription downstream of the pathway, but not exclusive to it. Literature on the involvement of Flt3 signaling with important transcription factors such as PU.1 and c/EBP-alpha (both of which are intimately involved in the hematopoietic cell differentiation process) is still very incomplete.

[6] shows evidence that proves that Flt3 signaling is related to “pre-cancerous stem cells”. The paper discusses the theory that tumor cells themselves may be the progenitors for tumor vasculature, and shows that under the influence of cytokines including Flt3, pre-cancerous stem cells are involved in vasculogenesis

(which is required for angiogenesis).

5.4. Cell signaling

5.4.1. Flt3 receptor

The Flt3 protein is a tyrosine kinase receptor which is involved in the differentiation, proliferation and apoptosis of hematopoietic cells. It is mainly expressed by early myeloid and lymphoid progenitor cells. Many cells of the hematopoietic system produce the Flt3 ligand (FL), which promotes the dimerization and the activation of the Flt3 receptor [23]. The activated receptor then activates the PI3K-Akt-mTOR and the Ras-Raf- signaling cascades downstream [23,135,136].

A mutation of the Flt3 receptor, referred to as Internal Tandem Duplication

(ITD), is the most common genetic lesion in AML [133-139]. The mutation affects constitutive activation of the signaling pathway that is downstream of the

168

receptor. 300-400 base pairs in the exons encoding the juxtamembrane domain of the Flt3 receptor are duplicated in tandem, in Flt3 ITD. The protein that results from this faulty genetic code leads to constitutive activation of the downstream signaling pathway, and possibly AML. The first paper to describe it was by

Nakao, et al [137].

The structure of the Flt3 receptor, and its activation mechanism, are shown in Figure 5.1, in order to give a clearer picture of Flt3 biology. The Flt3 receptor has five immunoglobulin like domains (denoted by E in Figure 5.1.), one transmembrane domain (TM), one juxtamembrane domain (JM) and two tyrosine kinase domains (K) that are linked by a kinase insert (KI). (G represents a glycosyl group, which isn’t part of the receptor.)

Figure 5.2. Flt3 receptor structure and activation mechanism [23] 169

Most unstimulated Flt3 receptors exist as monomers in the plasma membrane. In this inactive state, the regulatory domains (in this case Flt3’s JM) of the receptor inhibit activity of its catalytic domains (i.e. its tyrosine kinase domains- or Ks). After stimulation with the Flt3 ligand (FL), membrane-bound Flt3 quickly changes conformation, forming a homodimer and exposing phosphoryl acceptor sites in its tyrosine kinase domain [138]. Dimerization stabilizes this conformational change, which further enhances the activation of the receptor

[139]. Phosphorylation of Flt3 occurs within 5-15 minutes of Flt3L binding [23].

The FL-Flt3 complex is then rapidly internalized (starts within 5 minutes and reaches a maximum after 15 minutes [138]. Degraded by-products of the FL-Flt3 complex are seen as early as 20 minutes after stimulation [138]. This shows how rapid the entire process of receptor activation and turnover is in response to FL.

It should be noted that a major difference in the activation mechanisms of

Flt3R and EGFR (Epidermal Growth Factor Receptor), is due to the fact that EGF exists as a monomer in its native state, but FL exists as a dimer. Hence, while

EGFR combines with an EGF monomer [35,140], and the ligand bound receptors then dimerize, the Flt3 receptor monomers and the FL dimer, all come together somehow (there is no mention in the literature as to how the dimerization/activation takes place). The precise activation mechanism is unclear, but it is different from EGFR activation. It is not known whether FL dimer first binds to one Flt3 receptor, somehow causing a change in conformation in the receptor, thereby attracting another unstimulated receptor in the vicinity, or whether both receptors come together with the FL dimer in one single step, or

170

whether there is some other mechanism.

The precise steps involved in the production of the Flt3 ligand are also not known. Flt3 ligand has very low concentration in normal adult serum, but it is very high in cells in the neighborhood of early progenitor hematopoietic cells [23]. This indicates that a paracrine feedback loop may control Flt3 activation. Additionally, co-expression of both Flt3 and Flt3 ligand in hematopoietic cells indicates that an autocrine feedback loop may function in some cells. The biology of Flt3 receptor activation is very incomplete at this point. Autocrine/paracrine regulatory mechanisms play a role in the pathophysiology of hematologic malignancies

[141], and should be studied in more detail for a better understanding.

The activated Flt3 receptor then activates two major pathways downstream- i.e., the Ras-Raf-MEK-ERK pathway, and the PI3K-Akt-mTOR pathway. Both pathways affect a whole host of downstream processes at the signaling level, and they also are implicated in many different cellular processes, including cell differentiation, proliferation and apoptosis.

5.4.2. Ras-Raf-MEK-ERK pathway

The Ras-Raf-MEK-ERK signaling cascade is activated by the phosphorylated Flt3 receptor. Flt3 interacts with Shc coupling protein and activates it. Active Shc recruits the GTP (Guanosine tri-phosphate) exchange complex Grb2/Sos, with components Grb2 (Growth factor receptor bound-2), an adaptor protein, and Sos (Son of Sevenless), a guanine nucleotide exchange factor. This results in the loading of the membrane bound Ras with a GTP molecule. Ras-GTP then recruits Raf to the membrane and affects its 171

phosphorylation [142]. Once Raf is active, the MAP kinase cascade has been set in motion (it consists of the three protein kinases Raf, MEK and ERK1/2 and each of them phosphorylates the next one via kinase activity). The SHIP molecule (SH2-domain-containing inositol phosphatase) has been shown to be a negative regulator of the MAPK activation, since it binds competitively to phosphorylated Shc proteins, and prevents Shc from associating with Grb2, for instance [143].

Transcriptional regulation

Figure 5.3. MAPK core processes.

The growth factor ligand (in our case cytokine) binds to its trans-membrane receptor and activates tyrosine kinases inherent in the receptor molecule. A series of protein-protein complexes form involving Shc, Grb2, and SOS. 172

Subsequent to this, GDP-Ras is converted to GTP-Ras which activates Raf, MEK and then MAP kinase. Map kinases phosphorylate a number of transcription factors including. Adapted from Riken Bioresource Center DNA Bank (by Dr.Koji

Nakade).

It is important to note that literature on the activation of Ras by the activated

Flt3 ligand-receptor tetramer, does not mention a Shc independent activation route. However, since the study of Flt3 signal transduction is still expanding, it was decided to model Flt3 activation of Ras with both the Shc-dependent

(discussed earlier) and the Shc-independent pathways. Another reason for this is that if at a later point in time, it is discovered that there is no Shc-independent activation of Ras, then all the rate constants characterizing the Shc-independent piece of the pathway can be converted to zero, and we would have a Shc- dependent only model. So, both from a modeling perspective, and from a biological uncertainty standpoint, this is a better modeling choice. One problem that can be foreseen with this approach, however, is that we would increase the sparseness of the system.

There are several lines of enquiry that can be followed downstream of ERK.

Specifically, we can study ERK transcription activity itself. Other candidates for study are the activation of the ELK transcription factor by ERK, or activation of

BAD-BclxL by ERK, or the activation of STAT5 by ERK, but while these are important, an exhaustive look at the downstream effects of ERK isn’t feasible. To show the immediate impact on cell proliferation and the cell cycle specifically, it was decided that the interaction of ERK with the key downstream protein RSK

173

would be investigated. RSK stands for Ribosomal S6 Kinase, which is a family of protein kinases involved in signal transduction. However, in the literature, RSK has come to mean, specifically, the 90 kDa (molecular weight is 90 kilo Dalton) subfamily of all RSKs. There is another 70 kDa subfamily of RSKs that is commonly referred to as S6 kinase or S6K (with two mammalian homologues

S6K1 and S6K2). From this point on, we will mean the 90 kDa protein, when we say RSK, and the 70 kDa protein, when we say S6K. Interestingly, while the 90 kDa RSK is involved in signaling downstream of MAPK, the 70 kDa S6K1 is involved in signaling downstream of PI3K-Akt-mTOR.

Once ERK activates RSK, RSK in turn activates the CREB (cyclic AMP

Response Element Binding) protein. CREB is a transcription factor which, among more than 5000 target genes, activates the gene for Cyclin A1 [144-146].

Increased Cyclin A1 means that the cell would have more rapid transition from

G1 to S phase of the cell cycle. So here we can see a direct chain of causation which starts with the activation of the Flt3 receptor, and proceeds via the MAPK cascade activation, to the activation of RSK and CREB, and finally leads to an increase in Cyclin A1, and hence faster exit from the G1 phase to the S phase.

This can be linked to increased cell proliferation.

There are a few caveats when taking the above signaling chain into account. First, the literature has conflicting information on the effects of CREB on

Cyclin A1. [Cheng, et al, 2007] [145] asserts that CREB activation causes an increase in Cyclin A1, and that Cyclin A1 has a CREB binding site. [Conkright, et al, 2005] [147] asserts that CREB does not have a Cyclin A1 binding site.

174

However [Conkright, et al, 2005] [147] also agrees that CREB over expressing cells have increased levels of Cyclin A1. So the effect on the phenotype in both hypotheses is the same- faster progression through the cell cycle, and hence more rapid cell proliferation.

The second caveat deals with the statement in some references [146], that

CREB* (activated CREB) transcription activity is alone insufficient to produce/induce AML, showing that there are many unknown factors involved.

5.4.3. PI3K-Akt signaling

The Phosphatidylinositol-3-Kinase (PI3K)/ Akt (Protein Kinase B or PKB)/ mammalian Target of Rapamycin (mTOR) signaling pathway plays a critical role in many functions that are elicited by extracellular stimuli, such as cell proliferation, cell growth (size), glucose metabolism, cell motility and angiogenesis [142]. It has been implicated in the pathogenesis and the progression of a wide variety of neoplasias [148].

PI3K is a family of lipid kinases that have been categorized into at least three distinct subtypes according to their substrate preference and their sequence homologies. We deal with only one of these three subtypes, referred to as Class IA. These are activated by growth factor receptors or as in the case of

Flt3, cytokine receptors, and have two subunits (p110, a catalytic subunit, and either p85 or p55, a regulatory subunit) [142]. We specifically are interested in

PI3K that is composed of p110 and p85, with respect to Flt3 signaling.

Wild type Flt3 does not bind directly to the p85 subunit of PI3K, but instead

175

forms complex associations with many proteins, including Grb2, GAB1, GAB2,

SHP2, SHIP, CBL and CBLB that ultimately act on p85 [23,149,150,143].

The PI3K-Akt pathway is relatively well characterized, due to its ubiquitous role in the cell. Akt is a serine/threonine protein kinase that has three closely related, highly conserved isoforms [142]. One of the domains of Akt interacts with the phosphorylated lipid products of PI3K, mainly PI(3,4,5)P3

(Phosphatidylinositol(3,4,5) Phosphate 3) and to a lesser extent PI(3,4)P2

(Phosphatidylinositol(3,4) Phosphate 2), synthesized at the plasma membrane

[142,151]. Recruitment of Akt at the plasma membrane leads to a conformational change that facilitates its phosphorylation at its Threonine 308 site by PDK1 and at its Serine 473 site by PDK2, not in that order. PDK1 (3’-phosphoinositide- dependent protein kinase 1) is well characterized, but the identity of PDK2 is controversial, and has recently been revealed to be mTORC2 or mTOR

(mammalian Target of Rapamycin) complex 2. There is no guarantee that it is the only PDK2 (in fact it is probably not). This will be discussed in detail in the mTOR discussion, and it illustrates how the mTOR protein participates in the functioning of two complexes, one of which is upstream of Akt (mTORC2), while the other is downstream of Akt (mTORC1).

Phosphorylated Akt migrates to both the cytosol and the nucleus. So, in fact, phospho Akt performs functions at the plasma membrane, in the cytosol (or cytoplasm) and in the nucleus. The relative contributions of Akt signaling in each location to the whole, remains to be determined [142]. The modeling effort presented in this report treats Akt as a single entity, which functions in the

176

cytosol.

The phosphatases PTEN and SHIP1 play very key roles in PI3K-Akt signaling. PTEN (Phosphatase with Tensin Homology) is a tumor suppressor lipid phosphatase and when initially discovered, it seemed likely that PTEN mutations would account for most of the PI3K pathway deregulations observed in tumors. However later analysis has shown that PTEN mutations account only for a fraction of the molecular changes occurring in the PI3K-Akt pathway [152].

PTEN is a negative regulator of PI3K signaling, and its role in that capacity has been explicitly modeled and will be presented in the comprehensive model. The negative regulation happens primarily by dephosphorylation of the PI(3,4,5)P3 molecule (Phosphatidylinositol 3,4,5-triphosphate) to PI(4,5)P2

(Phosphatidylinositol 4,5-biphosphate).

The phosphatase SHIP1 also dephosphorylates PI(3,4,5)P3, but removes the phosphate from the 5-position instead of the 3-position, creating PI(3,4)P2, which can still work to recruit Akt to the plasma membrane [153].

SHIP mutations are common in AML, and both PTEN and SHIP1 play critical roles in leukemogenesis that are only beginning to be understood.

5.4.4. mTOR pathway

A large 250 kDa protein was identified as the target of the drug Rapamycin through seminal studies in yeast and in mammals in the 1990s. In mammals, this protein was named mammalian Target of Rapamycin (mTOR). mTOR is affected by both growth factors and nutrients, and has emerged as a critical effector in

177

many cell signaling pathways that are commonly deregulated in human cancers

[154].

Figure 5.4. mTOR’s role in cell signaling [154]

mTOR’s involvement in PI3K-Akt signaling happens via participation in two complexes, mTORC1 and mTORC2. mTORC2 is referred to as the Rapamycin insensitive complex, and its existence explains why the drug Rapamycin is not effective in all cases. With reference to the Figure 5.4 (above figure), we can see that mTORC1 consists of the subunits RAPTOR, mLST8, PRAS40 and mTOR

[155-158]. RAPTOR positively regulates mTOR activity and functions as a scaffold for recruiting mTORC1 substrates [156,157,159]. PRAS40 negatively regulates mTOR activity [155,158]. The molecular function of mLST8 is still ambiguous. [160] asserts that mLST8 is not important for mTORC1 integrity, since the phosphorylation of S6K1 or 4EBP1, both of which are downstream of 178

mTORC1, is not impaired in cells without mLST8. However the field of mTOR study is still in its infancy and we cannot conclude that mLST8 has no function whatsoever at this time.

The discovery of the regulation of mTORC1 by TSC1-TSC2 (referred to as the tuberous sclerosis complex) was the first molecular link between mTOR and cancer [161]. mTORC1 combines with TSC1-TSC2, RHEB and activated Akt, yielding activated mTORC1 (mTORC1*) and activated RHEB (RHEB*). The precise mechanism of this reaction is again unknown. The literature is still investigating whether only Akt* (Akt active at Ser 473 site only) or only Akt** (Akt active at Ser 473 and Thr 308 sites) or both are involved in downstream activation.

mTORC1* now interacts with a whole host of proteins downstream, including a complex consisting of the proteins S6K1, eIF3 (eukaryotic translation

Initiation Factor 3), 4EBP1 (eukaryotic translation initiation factor 4E binding protein 1) and EIF4E (eukaryotic translation Initiation Factor 4E). Further exploring this interaction, it is found that mTORC1* goes through a complicated process where it binds again to GTP loaded RHEB (RHEB-GTP), and then causes the release and activation (not necessarily in that order) of S6K1, and eIF4E. Both of these are very interesting downstream signaling events that can be further investigated. We focus for the present on the translation initiation that eIF4E leads to. eIF4E leads to translation, among others, of Cyclin D1 [162].

A literature survey of Flt3 signal transduction, in an attempt to link the signaling to cellular processes, showed that the exploration of the biology was

179

still very incomplete. Moreover, there were many transcription factors that were activated as a result of Flt3 activation (e.g. CREB, STAT5, Bcl2, mTOR, ERK,

ELK, NFkB). Following just one of these, CREB, it was discovered that it had more than 5000 target genes. The ubiquity of the effects of the Flt3 signaling pathway, then led to the question- would it be possible to follow at least one cause and effect link- starting from the Flt3 receptor- all the way through to the transcription factor, and then at least one key gene that may provide a clue to

Flt3’s involvement in cellular processes. The causal link between Flt3 signaling and the gene encoding Cyclin A1, was followed up, in an attempt to gain a partial understanding of how Flt3 signaling would impact the mitotic cell cycle. This gene is a target of the CREB transcription factor, and this link shows direct evidence of how Flt3 activation can lead to increased G1/S and G2/M transitions. Perhaps this will help us decipher why cells that haven’t been repaired in the gap phases are allowed to go through DNA synthesis and mitosis, and proliferate.

There were a number of signaling events that had to be ignored in the interests of time, and since the system was getting simply too complex to be modeled properly. Some instances, which were studied and probably play very key roles are:

 Transcription of the gene c-myc (this was noted as being particularly

important in cell differentiation) [162].

 STAT5 involvement- STAT5a but not STAT5b, is activated by the Flt3

receptor, and Western Blot data was shown to prove this [23,163].

Transcription activity of STAT5a dimer is not taken into account in our

180

present modeling.

 Involvement of BAD-BclxL – RSK* activates BAD-BclxL [ [157,164,165]

and Alberts, Molecular Biology of the Cell- textbook [8]]

 ERKP and ERKPP activate the transcription factor ELK and its

transcription activity is not accounted for

 ERK transcription activity is not accounted for

These are some of the signaling events that were not modeled based on feasibility and time constraints. Furthermore, an effort was made initially to link the Flt3 signaling pathway with the transcription factors PU.1, c/EBP-alpha,

AML1 and RAR-alpha. However literature/investigations on these are almost completely absent.

5.5. Model building

5.5.1. Mathematical model of Flt3 signaling pathway

The mathematical model of the Flt3 signaling pathway was built from several building blocks. This is depicted in Figure 5.6. Schoeberl, et al, 2002 [35], had a very detailed mathematical model of MAPK signaling, in response to activation by the EGF receptor. Certain changes need to be made to this model to make allowance for the fact that the Flt3 ligand is a dimer, unlike EGF. Hence the Flt3 receptor activation mechanism is completely different from EGFR activation. However, for present purposes, it was noted that as long as the activated ligand bound receptor complex has exactly the same dynamics, the downstream signaling will be the same regardless of how the receptor signaling 181

is modeled. For the present, the Schoeberl model of EGFR is exactly replicated. The Schoeberl reaction scheme was taken and the entire pathway was modeled using first principles.

The PI3K-Akt model from Hatakeyama et al, 2003 [140] was adapted. The entire pathway was modeled using first principles. Certain portions of the pathway were modeled explicitly, when compared to Hatakeyama’s model (e.g. the de-activation of PI(3,4,5P)3 by phosphatases PTEN and SHIP1).

Mass-action principles were used in modeling. Even the Shc activation and deactivation, which are modeled on the Michaelis Menten assumption (assuming that the enzyme concentration is much lower than the substrate concentration, and hence that the concentration of the substrate bound enzyme, and the unbound enzyme, changes much more slowly than that of the substrate or the product), are simplified into single rate constant reactions. The different assumptions that went into this modeling effort are shown in Figure 5.5.

182

Pathway modeling assumptions

Across species

in vitro & in vivo

Across different cell types

No isoforms unless mentioned

Figure 5.5. Pathway computational modeling assumptions.

The way the model was put together is encapsulated in Figure 5.6.

Flt3 receptor

Schoeberl Hatakeyama Ras‐Raf‐MEK‐ adapted PI3K‐ ERK Akt Interaction between PI3K and Ras

mTOR signaling RSK eIF4E transcription factor CREB

NUCLEUS

Figure 5.6. Model conception.

Schoeberl’s MAPK model from [35] was put together with Hatakeyama’s

183

PI3K-Akt model [140]. The remaining portions of the model were done from the basic biology using mass-action principles.

However, what has been depicted above does not show the receptor internalization that was modeled for both the MAPK and the PI3K-Akt-mTOR portions of the pathway, to avoid additional confusing arrows. These models are detailed in the Figure 5.7.

Coated pit protein dependent Coated pit protein independent

(LR*)2‐complex (LR*)2‐complex Prot

(LR*)2‐complex Prot

Proti (LRi*)2‐complex (LRi*)2‐complex NUCLEUS

Figure 5.7. Biochemical reaction schemes of internalization, adapted from [35].

In Figure 5.7, Prot is the coated-pit protein, L is the ligand, R is the receptor and is internalized either through association with Prot or independent of it. An internalized entity’s name has an ‘i’ following it.

If the entire model was shown in a single diagram, it would have very poor visibility. Hence we show the MAPK and PI3K-Akt-mTOR pieces of the pathway separately. For the present, the Schoeberl MAPK [35] is taken as is. For our purposes, EGFR is replaced by Flt3R and EGF is replaced by Flt3L. Figure 5.8 shows the MAPK portion of the pathway, and Figure 5.9 shows the PI3K-Akt-

184

mTOR portion of the pathway.

The complete Flt3 signaling model has 364 one-way biochemical reactions

(reversible reactions were modeled as two irreversible reactions), 197 variables

(and hence 197 nonlinear ODEs and 197 initial conditions) and 364 rate constants. Of these, 224 reactions and 94 variables correspond to the MAPK arm of the pathway, and the remaining 140 reactions and 103 variables correspond to the PI3K-Akt-mTOR arm of the pathway.

The state space representation of the model is as discussed in Chapter 2,

Section 2.5.10.2.

A comprehensive version of model reactions can be found in Appendix 8.1.

Currently the internalized version of the Schoeberl reactions is not shown, and the Schoeberl reactions that are reversible are shown as single reversible reactions, although they were modeled as two irreversible reactions.

185

1 L LR GAP 14 2 3 4 v3 5 15(17) v13 v1 v2 v8 R LR LR2 LR*2 LR*2‐GAP v126b13 v127b v128b v131b v148b v16 v61 Lideg v14 v22 Shc GAP 23(18) Li LRi v136b 16 32(63) 10 LR*2‐GAP‐Grb2 6 v10 v11 11v12 8 v135b LR*2‐GAP‐Shc Ri LRi LRi2 LRi*2 v175b 22 v17 v132b v133b v134b 33(64) v37 v60 v62 LR*2‐GAP‐Shc* Shc* Grb2 24 87 Sos v138b 86 (LRi*2)deg v152b Rideg v24 Grb2 v170b 34(65) v39 39 LR*2‐GAP‐Shc*‐Grb2 Shc*‐Grb2 v25 v154b v178b v33 Sos LR*2‐GAP v180b v40 Sos LR*2‐GAP 35(66) v32 38 30 v171b v31 LR*2‐GAP‐Shc*‐Grb2‐Sos v168b Shc*‐Grb2‐Sos Grb2‐Sos v34 25(19) v166b Ras‐GDP LR*2‐GAP‐Grb2‐Sos 37(68) v18 26 LR*2‐GAP‐Shc*‐Grb2‐Sos‐Ras‐GTP LR*2‐GAP‐Shc*‐Grb2‐Sos‐Ras‐GDP v146b Ras‐GDP v21 v30 v164b v27 v158b v140b 29(21) 43(71) 42(70) 41 27(20) LR*2‐GAP‐Grb2‐Sos‐Ras‐GTP v28 Ras‐GTP Ras‐GTP* Raf‐Ras‐GTP Raf LR*2‐GAP‐Grb2‐Sos‐Ras‐GDP v29 v160b 44 28(69) v19 v162b Phosphatase1 v43 28(69) 46(73) Ras‐GTP* v42 Ras‐GTP 45(72) Raf*v183b Raf*‐P’ase1 P’ase1 v142b v187b 47v44 48(74)v45 v46 50(76) v47 MEK MEK‐Raf* MEK‐P MEK‐P‐Raf* Raf* v185b Raf* v51 54(79) v50 52(78) 51(77) P’ase2 MEK‐P‐P’ase2 MEK‐P MEK‐PP‐P’ase2 v48 MEK‐PP v49 49(75) v189b v191b 53 Phosphatase2 v52 55v193b 56(80) v53 57(81) v54 58(82) ERK ERK‐MEK‐PP ERK‐P ERK‐P‐MEK‐PP v55MEK‐PP v195b v59 62(85) MEK‐PP 61(84) 59(83) v58 v56 P’ase3 ERK‐P‐P’ase3 ERK‐P ERK‐PP‐P’ase3 ERK‐PP v57 v197b v199b 60 Phosphatase3 57 v227 CREB 99 Source ERKP v225 96 v227 97 100 104 ERKP‐RSK RSK* RSK*‐CREB v238 v233 CycA1 95 v226 v231 101 RSK v237 98 v232 CREB* 103 v229 ERKPP‐ v230 v228 RSK CycA1t 59 v234 v235 ERKPP v230 v236

Nucleus 102 CREBn*

Figure 5.8. Schoeberl MAPK model [35] combined with ERKP and ERKPP activation of RSK and downstream (activation of CREB transcription factor, and transcription of the Cyclin A1 gene).

The Schoeberl notation has variable numbers in navy blue (with internalized variables shown in brackets), and reaction numbers shown in dark green, with ‘v’ preceding the number.

186

2 22 107 31 109(187) 173 R Grb2 SHP2 106(185) 171 Shc 4 v3 5 v240 v242 v245 L2 LR2 LR*2 LR*2-Grb2 LR*2-Grb2-SHP2 LR*2-Grb2-SHP2-Shc v128 R v239 105(184) 170 v241 v244 v246 Modeled differently 2 118 117 116 115 108(186) 172 v243 Shc Shc* 40 132 CBLB CBL GAB2 GAB1 v248 110(188) 174 LR*2-Grb2-SHP2* LR*2-Grb2-SHP2*-Shc 120(192)178 v257 PI3K 119(191)177 v255 v247 LRC1-PI3K LRC1 112 113 111(189) 175 v249 v258 v256 v254 v252 121(193)179 PI v260 134(196)182 135(197)183 SHIP SHIP* LR*2-Grb2-SHP2*-Shc* 122 v261 v253 LRC1-PI3K* PI-LRC1-PI3K* PI(3,4)P2-LRC1-PI3K* v251 v259 v250 v262 114(190)176 v263 123 LR*2-Grb2-SHP2*-Shc*-SHIP* PI(3,4)P2 v264 125 v265 127 139 PI(3,4,5)P3 PI(3,4,5)P3-SHIP1 28 Akt v266 SHIP1126 v267 LRC1-PI3K* Ras-GTP v277 PI(3,4)P2 140 v278 v271 v272 138 Akt-PI(3,4,5)P3 v269 129 130(194)180 mTORC2* v285 v268 LRC1-PI3K*-Ras-GTP PI(3,4,5)P3-PTEN v286 143 Akt-PI(3,4,5)P3-mTORC2* 128 PTEN v270 v273 PI(4,5)P2 v292 132 v287 144 v270 124 144 152 Ras-GTP* mTORC2 Akt*-PI(3,4,5)P3 v291 43 146 Akt*-PI(3,4,5)P3 Akt* complex v293 v275 v289 v288 v274 PDK1* 145 150 151 148 150 153 149 137 Ras-GTP*-mTORC2 Akt**-PI(3,4,5)P3-PDK1* 147 TSC1-TSC2 RHEB mTORC1 TSC1-TSC2 RHEB* mTORC1* Akt**-PI(3,4,5)P3 Akt** complex 138 v294 v276 v290 147 154 mTORC2* Akt**-PI(3,4,5)P3 v295

159 140 128 mTORC1* S6K1-eIF3-4EBP1-eIF4E Akt-PI(3,4,5)P3 126 v297 155 156 160 PTEN v280 SHIP1 v300 v301 v302 v279 RHEB* RHEB-GTP RHEB-GTP-mTORC1* RHEB-GTP-mTORC1*-S6K1-eIF3-4EBP1-eIF4E v283 v282 141 v298 v299 Akt-PI(3,4,5)P3-PTEN Akt-PI(3,4,5)P3-SHIP1 161 v303 163 RHEB-GTP-mTORC1*-eIF3 162 4EBP1*-eIF4E 142 S6K1* v284 v281 167 v304 v306 S6 Akt-PI(4,5)P2 Akt-PI(3,4)P2 158 157 169 v307 165 164 S6K1*-S6 eIF4E* 4EBP1* v305 v308 eIF4E* Cyclin D1 161 S6*

Nucleus Unmodeled S6*

Figure 5.9. PI3K-Akt-mTOR pathway model. The receptor has to be changed to account for the

fact that the Flt3 ligand is a dimer. The notation used for the non-Schoeberl portion of the

pathway was light blue for the variable numbers, (with internalized variables shown in brackets,

and the variable associated with the coated-pit protein shown in orange font) and light green for

reaction numbers preceded by a ‘v’. S6* (activated S6 transcription activity) is unmodeled.

eIF4E* role in translation initiation of Cyclin D1 is shown in 1 step.

5.5.2. Modularization

Before analyzing the pathway further, modularization of the pathway based on biochemical contiguity, functionality and specific context is performed

(Appendix 8.2 shows all the possible methods in which modularization can be

187

performed).

Once more, to be able to visually depict the pathway, modularization is shown separately for the MAPK and the PI3K-Akt-mTOR portions, and the block diagram formed from the modules is assembled together.

1L LR GAP 14 2 3 4 5 15(17) v13 v1 v2 v3 v8 R LR LR2 LR*2 LR*2‐GAP v126b13 v127b v128b v131b v148b v16 v61 Lideg v14 v22 Shc GAP 23(18) Li16 LRi v136b 10 32(63) Shc/Ras LR*2‐GAP‐Grb2 6 v10 v11 11v12 8 v135b LR*2‐GAP‐Shc Ri LRi LRi2 LRi*2 22 v17 v132b v133b v134b 33(64) v175bv37 v62 Grb2 24 v60 Receptor 87 LR*2‐GAP‐Shc* Shc* v138b 86 v152b Sos Rideg (LRi*2)deg v24 Grb2 v170b module 1 34(65) v39 39 LR*2‐GAP‐Shc*‐Grb2 Shc*‐Grb2 v154b v178b v33 v25 Sos LR*2‐GAP v180b v40 Sos LR*2‐GAP Ras 35(66) v32 38 30 v171b v31 LR*2‐GAP‐Shc*‐Grb2‐Sos v168b Shc*‐Grb2‐Sos Grb2‐Sos v34 25(19) v166b Ras‐GDP LR*2‐GAP‐Grb2‐Sos 37(68) v18 26 LR*2‐GAP‐Shc*‐Grb2‐Sos‐Ras‐GTP LR*2‐GAP‐Shc*‐Grb2‐Sos‐Ras‐GDP v146b v30 Ras‐GDP v21 v164b v27 v158b v140b 29(21) 43(71) 42(70)v28 41 27(20) LR*2‐GAP‐Grb2‐Sos‐Ras‐GTP Ras‐GTP* Raf‐Ras‐GTP Raf Ras‐GTP LR*2‐GAP‐Grb2‐Sos‐Ras‐GDP v29 v160b 44 28(69) Phosphatase1 v19 v162b 46(73) v43 28(69) Ras‐GTP* v42 Ras‐GTP 45(72) Raf* v183b Raf*‐P’ase1 P’ase1 v142b v187b 47v44 48(74)v45 v46 50(76) v47 MEK MEK‐Raf* MEK‐P MEK‐P‐Raf* Raf* MAPK v185b v51 54(79) v50 Raf* 52(78) 51(77) P’ase2 MEK‐P‐P’ase2 MEK‐P MEK‐PP‐P’ase2 v48 MEK‐PP 49(75) v49 v191b 53 v189b Phosphatase2 v52 55v193b 56(80) v53 57(81) v54 58(82) ERK ERK‐MEK‐PP ERK‐P ERK‐P‐MEK‐PP v55MEK‐PP v195b v59 62(85) v58 MEK‐PP 61(84) v56 59(83) P’ase3 ERK‐P‐P’ase3 ERK‐P ERK‐PP‐P’ase3 ERK‐PP v57 v197b v199b 60 Phosphatase3 v227 57 CREB 99 Source ERKP v225 96 v227 97 100 104 ERKP‐RSK v238 RSK* RSK*‐CREB v233 95 v231 CycA1 RSK v226 101 v237 98 v232 RSK/CREB CREB* 103 v229 ERKPP‐ v230 CycA1t 59 v228 RSK ERKPP v230 v234 v235 v236

Nucleus 102 CREBn* Nucleus

Figure 5.10. MAPK modules. Biochemical contiguity, functionality and the specific pathway context are combined to form modules. A color is associated with each module, and this is reflected in the block diagram shown in Figure 5.12.

188

The PI3K-Akt-mTOR modules are shown in Figure 5.11.

2 22 107 31 109(187) 173 R 106(185) 171 4 v3 5 Grb2v240 SHP2 Shc v245 v242 L2 LR2 LR*2 LR*2-Grb2 LR*2-Grb2-SHP2 LR*2-Grb2-SHP2-Shc v128 v239 105(184) 170 v244 R v241Receptor v246 Modeled differently 2 118 117 116 115 Shc* 40 GAB1 108(186) 172 v243 Shc 132 CBLB CBL GAB2 module 2 v248 110(188) 174 120(192)178 PI3K LR*2-Grb2-SHP2* LR*2-Grb2-SHP2*-Shc v257 119(191)177 v255 v247 LRC1-PI3K LRC1 112 113 v249 v256 v254 v252 111(189) 175 121(193)179 v258 PI v260 SHIP SHIP* 122 134(196)182 v261 135(197)183 LR*2-Grb2-SHP2*-Shc* v253 LRC1-PI3K* PI-LRC1-PI3K* PI(3,4)P2-LRC1-PI3K* v251 v250 v259 PI3K v262 114(190)176 v263 123 LR*2-Grb2-SHP2*-Shc*-SHIP* PI(3,4)P2 v264 125 v265 127 139 PI(3,4,5)P3 PI(3,4,5)P3-SHIP1 28 Akt v266 SHIP1126 v267 LRC1-PI3K* Ras-GTP v277 PI(3,4)P2 140 v278 v271 Akt‐ v272 138 Akt-PI(3,4,5)P3 v269 130(194)180 mTORC2* v268 129 v285 LRC1-PI3K*-Ras-GTP PI(3,4,5)P3-PTEN mTORC1 v286 143 Akt-PI(3,4,5)P3-mTORC2* 128 PTEN v270 v292 v273 PI(4,5)P2 144 132 v287 144 152 v270 v291 124 Akt*-PI(3,4,5)P3 Akt* complex Ras-GTP* mTORC2 Akt*-PI(3,4,5)P3 v293 43 146 v289 v274 v275 PDK1* v288 150 151 148 150 153 149 137 145 147 TSC1-TSC2 RHEB mTORC1 TSC1-TSC2 RHEB* mTORC1* Ras-GTP*-mTORC2 Akt**-PI(3,4,5)P3-PDK1* Akt**-PI(3,4,5)P3 Akt** complex v294 154 138 v290 147 v276 mTORC2* Akt**-PI(3,4,5)P3 v295

159 128 140 126 mTORC1* S6K1-eIF3-4EBP1-eIF4E Akt-PI(3,4,5)P3 160 v280 v297 155 156 v302 PTEN SHIP1 v300 v301 v279 RHEB* RHEB-GTP RHEB-GTP-mTORC1* RHEB-GTP-mTORC1*-S6K1-eIF3-4EBP1-eIF4E v283 v282 141 v298 v299 161 Akt-PI(3,4,5)P3-PTEN Akt-PI(3,4,5)P3-SHIP1 v303 163 RHEB-GTP-mTORC1*-eIF3 162 4EBP1*-eIF4E 142 S6K1* v304 v284 v281 167 v306 S6 Akt-PI(4,5)P2 Akt-PI(3,4)P2 v307 158 157 169 165 164 eIF4E/S6 S6K1*-S6 eIF4E* 4EBP1*

v305 v308 Akt‐ eIF4E* Cyclin D1 161 S6* Translation mTORC2 Nucleus

Nucleus Unmodeled S6*

Figure 5.11. PI3K-Akt-mTOR modules.

The complete block diagram that is assembled from modules of the MAPK and the PI3K-Akt-mTOR pieces of the pathway is shown in Figure 5.12.

189

Ras-GTP Output 1 Cyc A1 Shc/Ras (LR*)2-GAP ERKP

Receptor MAPK RSK/CREB Cyc A1t module 1 ERKPP Input Ras CREB* L2 feedback? under study Nucleus (LR*)2-GAP-Shc*-Grb2-Sos-RasGDP

Ras-GTP

Akt*-PI(3,4,5)P3 RHEB-GTP- PIP3 mTORC1* L2 Receptor LRC1 Akt‐ Akt‐ PI3K eIF4E/S6 module 2 mTORC2 mTORC1

LRC1-PI3K* Akt**-PI(3,4,5)P3

Translation

Output 2 Cyc D1

Figure 5.12. Block diagram of entire Flt3 signaling pathway. Flt3 ligand dimer (L2) is the system input. Cyclin A1 and Cyclin D1 are the system outputs.

The different modules can be grouped into layers- which are being defined locationally and functionally. Layer 1 includes membrane bound modules, Layer

2 includes modules in the cytoplasm and Layer 3 includes the nucleus, but also the translation initiation machinery. It can be argued that while translation initiation takes place in the cytoplasm, the actual process being performed there is very distinct from signal transduction, and so it would belong to Layer 3, with the nucleus and the transcription machinery.

190

Ras-GTP Output 1 Cyc A1 Shc/Ras (LR*)2-GAP ERKP

Receptor MAPK RSK/CREB Cyc A1t module 1 ERKPP Input Ras CREB* L2 feedback? under study Nucleus (LR*)2-GAP-Shc*-Grb2-Sos-RasGDP

Ras-GTP

Akt*-PI(3,4,5)P3 RHEB-GTP- PIP3 mTORC1* L2 Receptor LRC1 Akt‐ Akt‐ PI3K eIF4E/S6 module 2 mTORC2 mTORC1

LRC1-PI3K* Akt**-PI(3,4,5)P3

Translation

Output 2 Cyc D1

Figure 5.13. Hierarchical representation of Flt3 signaling pathway modules.

5.5.3. Modeling automation

In dealing with a signaling pathway with a very large number of reactions, the programming effort would have to be automated to some extent. The ODE modeling technique was improved and automated in the many respects. Manual entry of reaction rates and manual entry of ODEs was eliminated. Currently, all reactions need to be assembled in a text file, where each discrete entity

191

(variable) is named consistently throughout the file, and enclosed in square brackets. With all reactions in this format, the code assigns variable numbers, and then calculates reaction rates and the right hand sides of the ODEs, both as vectors, which can then be used in the ODE model.

5.5.4. Future work

Future work includes model calibration. The logic that was used in model calibration is shown in the following flow chart. Model calibration is in progress at this point, and while the peak values for Schoeberl MAPK are similar, the dynamics are not being replicated at this time.

Is N reaction Check structure structure correct?

Y

N Are ICs Check correct? ICs

Y

Are rate N Check constants rate correct? constants

Y

Any other N errors? Rectify

Figure 5.14. Model calibration logic.

The curves for activated ligand receptor complex, and phosphor-Shc, 192

obtained from the model with current values is show in Figure 5.15 (A). The model simulations currently are NOT IN AGREEMENT WITH SCHOEBERL’S

MODEL.

A 4 4 x 10 (LR*) 2 x 10 Total Shc-P 5 12

4 10 8 3 6 2 4

MoleculesCell / 1 0 20 40 60 0 2 20 40 60 Time [min] Time [min] 0 0 0 20 40 60 0 20 40 60

A B

B

Figure 5.15. (A) Flt3 model MAPK simulations (B) Simulations from Schoeberl MAPK paper.

The different curves in the Schoeberl MAPK simulations are for different values of EGF input to the model. However, as is clear, the Flt3 signaling model’s

MAPK portion is clearly not in agreement with Schoeberl MAPK model at this point, although the peak values of the curves are comparable. The reason for the curves in Figure 5.15 (A) increasing (as in the case of (EGF-EGFR*)2, and staying constant (as in the case of Total Phospho SHC), is being investigated currently.

 Model calibration is in progress. 193

 Model validation

 Mapping ranges of parameters to a kernel of different time profiles, which

is then mapped to different behavior categories

 Coordination analysis, or the search for a coordinator or a coordination

process

o The coordinator may give us insight into the functioning of the

network

o Practically, finding the coordinator can help point out to the

experimenter, what proteins along the pathway are part of the

coordination process, and hence have higher significance, and

which ones can be safely ignored. This reduces the number of

molecules that the experimentalist would have to measure when

faced with a daunting pathway or network.

Hierarchically organizing all the entities in the pathway, this can be used to rank them in the order of the information of value that they provide.

194

6. CONCLUSION

This thesis presents a methodology to extract dynamic time curves from statically sampled flow cytometry data. The methodology is presented in the context of a hierarchical, complex systems approach to biology. Time profile data thus generated is used in the calibration of a model of the cell cycle control system.

6.1. Thesis review

Chapter 1 introduces the thesis, and deals primarily with some of the problems that need to be addressed in systems biology. This leads us to an understanding of the importance of system dynamics in the systems biology approach as proposed by us. Chapter 2 gives us the background for the systems approach. This includes discussions on modeling, and an in-depth discussion on the role of data in computational models. Included in this context are models of measurement processes (flow cytometry and Western Blotting). Chapter 3 introduces the upstream signaling pathway that actuates the cell cycle control system downstream. An extensive group of molecular reactions at the signal transduction level is converted into an ODE model and presented here. Model calibration/validation is pending due to time/data availability. Chapter 4 introduces a model of the cell cycle control system. Attempts at calibrating this model are also presented. Chapter 5 presents the methodology for extraction of dynamic information that is embedded in statically sampled cytometry data. The chapter also presents a software that helps harness this methodology. Chapter 6 195

then attempts to integrate the different strands of thought that were explored in the previous Chapters, and focuses on some of the key points that must be explored to continue the work, as well as some of the key issues in complex systems biology that are of immediate relevance. These include a discussion on methods to understand and model the relationship between different hierarchical levels (cross-level causality). There is also a mathematical statement of the problem of understanding the causal links between parameter variation and output variation. This chapter (7) aims to provide a summary of the entire thesis, and then present some conclusions and future work.

First, we understand that combining a top-down systems approach with the prevalent bottom-up (reductionist) approach is necessary. We also realize that research can either include conceptual advances gained through applying systems concepts to biology, or in attempting to update these concepts, or in attempting to update the tools that make such application feasible. This thesis attempts to do the first and last kinds of research.

6.2. Future work

The future directions for the work presented in this thesis can be discussed under different categories as presented below.

6.2.1. Methodology

Immediate future work includes further development of the data dynamics extraction methodology. The methodology has been tested against two flow cytometry datasets extensively. However exhaustive testing must be performed. 196

Moreover, the design of the methodology is generic enough that it is not wedded to the data gathering technique (flow cytometry in this instance). So long as a statistically significant sample of the data is available, the methodology is applicable.

6.2.2. Hierarchical modeling

Intracellular signal transduction pathways are only one piece of the puzzle.

The knowledge obtained by effectively modeling the pathway inside the cell must next be combined with the understanding that cells exist in populations, which in turn are organized into tissues. For example, once the intracellular signaling pathway inside blood cells has been modeled, how do we determine how these cells are organized in the circulatory system, and how do we combine this organization with the fact that they are at a certain level in the cell differentiation hierarchy. Moreover how do we account for cell-cell heterogeneity in our modeling?

6.2.3. Data measurement modeling and error quantification

The systems scientist invariably uses some form of mathematical modeling to approximate biological reality at different levels. Of course mathematical models are only as good as available data (which makes such models computational). This leads us to another key problem in this context - data measurement. It would be valuable, in this regard, to model the data measurement process itself.

Computational models suffer from many issues that are related to data

197

measurement. Data is scarcely available in computational biology models, and when it is, it is far removed from its source. We require absolute concentrations of certain proteins in calibrating most signaling pathway models. However in most biological measurement techniques, obtaining such absolute measures is quite labor intensive. When it is done, there are several assumptions made in the measurement process, and there is a certain level of error associated with the obtained data. To be able to accurately quantify this error, and use it to model the data gathering process is a very interesting next step, and also something that is vitally important to the growth of computational biology as a science. With this in mind, I am interested in modeling measurement processes (e.g. flow cytometry,

Western Blotting, ELISA, etc.).

6.2.4. Future directions

Future directions include further development and testing of the methodology for time profile data extraction that is the main contribution of this thesis. While this methodology has been rigorously tested for two cell lines, it would be very educational to use it on many more data sets. Additionally the methodology is not limited to flow cytometry data, but can be used wherever we have a statically sampled dataset where a cell cycle time element is embedded in the data. Testing the methodology on other such data is also an area for future research.

There is also the need for exhaustive process modeling, and understanding where noise enters the measurement process, and being able to precisely

198

quantify this noise.

Immediate future work also includes measurement of several proteins along the MAPK and the PI3K-Akt-mTOR pathways, and a full systems biology iterative cycle for this pathway (model calibration, validation, modularization, coordination analysis, and simulation of fresh biological scenarios).

There is also an urgent need to further understand the different types of cell population models. The work in this thesis restricts itself to models of intracellular signaling, but this is only one level in the hierarchy. So also, there is an urgent need to integrate the different levels in a multilevel hierarchical system, and to understand their coexistence and the flow of causality in them.

6.3. Additional thoughts

All of the flow data work involves cell populations. Isn't each of the distributions, somehow an expression of a cell population model? If the distribution of a protein over a population of cells is in a certain way, then what does this say about the interactions between the cells themselves? This is tantalizing area for immediate future research.

One way forward could be in trying to answer a related question: What is the one variable or unique or non-unique combination of variables that serves as a proxy for the cell itself when it comes to the context in which we are interested in cell-cell interaction? Is it possible for such a proxy to exist, or is this a dead end?

199

Another issue worth pondering is the relationship between parameter variation and output variation. Can we theoretically determine, for mass-action systems at least, how the variation in a parameter is ‘translated’ into variation in some chosen combination of state variables that has been declared as output? Is it at least possible to do this numerically? Is it possible for us to say that any time profile that is pulled from the data distribution ‘cloud’ represented in Chapter 5,

Figure 5.13(A) has specific bounds determined by that distribution? These are a few of the very interesting questions that need further investigation.

200

APPENDIX A. Model reactions

Mass action modeling as shown in the example from Chapter 2, Section 2.6.10.1 is used here (Schoeberl model reactions of internalization not currently shown)

Schoeberl MAPK

[R]+[L] <--> [L-R]

[L-R]+[L-R] <--> [(L-R)2]

[(L-R)2] <--> [(L-R*)2]

[(L-R*)2-GAP-Grb2]+[Prot] <--> [(L-R*)2-GAP-Grb2-Prot]

[(L-R*)2-GAP-Grb2-Prot] --> [ (L-Ri*)2-GAP-Grb2]+[Proti]

[R] <--> [Ri]

[(L-R*)2] --> [(L-Ri*)2]

[(L-R*)2]+[GAP] <--> [(L-R*)2-GAP]

[(L-R*)2-GAP] --> [(L-Ri*)2-GAP]

[Ri]+[Li] <--> [L-Ri]

[L-Ri]+[L-Ri] <--> [(L-Ri)2]

[(L-R)2i] <--> [(L-Ri*)2]

Source --> [R]

[(L-Ri*)2]+ [GAP] <--> [(L-Ri*)2-GAP]

[Proti] --> [Prot]

[(L-R*)2-GAP]+[Grb2] <--> [(L-R*)2-GAP-Grb2]

[(L-R*)2-GAP-Grb2]+[Sos] <--> [(L-R*)2-GAP-Grb2-Sos]

[(L-R*)2-GAP-Grb2-Sos]+[Ras-GDP] <--> [(L-R*)2-GAP-Grb2-Sos-Ras-GDP]

[(L-R*)2-GAP-Grb2-Sos-Ras-GDP] <--> [(L-R*)2-GAP-Grb2-Sos]+[Ras-GTP] 201

[Ras-GTP*]+[(L-R*)2-GAP-Grb2-Sos] <--> [(L-R*)2-GAP-Grb2-Sos-Ras-GTP]

[(L-R*)2-GAP-Grb2-Sos-Ras-GTP] <--> [(L-R*)2-GAP-Grb2-Sos]+[Ras-GDP]

[(L-R*)2-GAP]+[Shc] <--> [(L-R*)2-GAP-Shc]

[(L-R*)2-GAP-Shc] <--> [(L-R*)2-GAP-Shc*]

[(L-R*)2-GAP-Shc*]+[Grb2] <--> [(L-R*)2-GAP-Shc*-Grb2]

[(L-R*)2-GAP-Shc*-Grb2]+[Sos] <--> [(L-R*)2-GAP-Shc-Grb2-Sos]

[(L-R*)2-GAP-Shc*-Grb2-Sos]+[Ras-GDP] <--> [(L-R*)2-GAP-Shc*-Grb2-Sos-

Ras-GDP]

[(L-R*)2-GAP-Shc*-Grb2-Sos-Ras-GDP] <--> [(L-R*)2-GAP-Shc*-Grb2-Sos] +

[Ras-GTP]

[Raf]+[Ras-GTP] <--> [Raf-Ras-GTP]

[Raf-Ras-GTP] <--> [Raf*]+[Ras-GTP*]

[Ras-GTP*]+[(L-R*)2-GAP-Shc*-Grb2-Sos] <--> [(L-R*)2-GAP-Shc*-Grb2-

Sos-Ras-GTP]

[(L-R*)2-GAP-Shc*-Grb2-Sos-Ras-GTP] <--> [(L-R*)2-GAP-Shc*-Grb2-

Sos]+[Ras-GDP]

[(L-R*)2-GAP-Shc*-Grb2-Sos] <--> [(L-R*)2-GAP]+[Shc-Grb2-Sos]

[Shc*-Grb2-Sos] <--> [Grb2-Sos]+[Shc*]

[(L-R*)2-GAP-Grb2-Sos] <--> [(L-R*)2-GAP]+[Grb2-Sos]

[Grb2-Sos] <--> [Grb2] +[Sos]

[Shc*] <--> [Shc]

[(L-R*)2-GAP-Shc*] <--> [(L-R*)2-GAP]+[Shc*]

[Shc*]+[Grb2] <--> [Shc*-Grb2]

202

[(L-R*)2-GAP-Shc*-Grb2] <--> [(L-R*)2-GAP]+[Shc*-Grb2]

[Shc*-Grb2]+[Sos] <--> [Shc*-Grb2-Sos]

[(L-R*)2-GAP-Shc*] + [Grb2-Sos] <--> [(L-R*)2-GAP-Shc*-Grb2-Sos]

[Raf*]+[Phosphatase1] <--> [Raf*-Phosphatase1]

[Raf*-Phosphatase1] --> [Raf]+[Phosphatase1]

[MEK] + [Raf*] <--> [MEK-Raf*]

[MEK-Raf*] --> [MEK-P] +[Raf*]

[MEK-P]+[Raf*] <--> [MEK-P-Raf*]

[MEK-P-Raf*] --> [MEK-PP] + [Raf*]

[MEK-PP]+[Phosphatase2] <--> [MEK-PP-Phosphatase2]

[MEK-PP-Phosphatase2] --> [MEK-P] + [Phosphatase2]

[MEK-P]+[Phosphatase2] <--> [MEK-P-Phosphatase2]

[MEK-P-Phosphatase2] --> [MEK]+[Phosphatase2]

[ERK]+[MEK-PP] <--> [ERK-MEK-PP]

[ERK-MEKK-PP] --> [ERK-P]+[MEK-PP]

[ERK-P]+[MEK-PP] <--> [ERK-P-MEK-PP]

[ERK-P-MEK-PP] --> [ERK-PP]+[MEK-PP]

[ERK-PP]+[Phosphatase3] <--> [ERK-PP-Phosphatase3]

[ERK-PP-Phosphatase3] --> [ERK-P]+[Phosphatase3]

[ERK-P] + [Phosphatase3] <--> [ERK-P-Phosphatase3]

[ERK-P-Phosphatase3] --> [ERK]+[Phosphatase3]

[Ri] --> [Rideg]

[Li]--> [Lideg]

203

[(L-Ri*)2] --> [(L-Ri*)2deg]

PI3K-Akt-mTOR (complete list of reactions)

[(LR*)2]+[Grb2] --> [(LR*)2-Grb2]

[(LR*)2-Grb2] --> [(LR*)2]+[Grb2]

[(LR*)2-Grb2]+[SHP2] --> [(LR*)2-Grb2-SHP2]

[(LR*)2-Grb2-SHP2] --> [(LR*)2-Grb2]+[SHP2]

[(LR*)2-Grb2-SHP2] --> [(LR*)2-Grb2-SHP2]

[(LR*)2-Grb2-SHP2]+[Shc] --> [(LR*)2-Grb2-SHP2-Shc]

[(LR*)2-Grb2-SHP2-Shc] --> [(LR*)2-Grb2-SHP2]+[Shc]

[(LR*)2-Grb2-SHP2-Shc] --> [(LR*)2-Grb2-SHP2]+[Shc*]

[(LR*)2-Grb2-SHP2*]+[Shc] --> [(LR*)2-Grb2-SHP2*-Shc]

[(LR*)2-Grb2-SHP2*-Shc] --> [(LR*)2-Grb2-SHP2*]+[Shc]

[(LR*)2-Grb2-SHP2*-Shc] --> [(LR*)2-Grb2-SHP2*-Shc*]

[SHIP*]+[(LR*)2-Grb2-SHP2*-Shc*] --> [(LR*)2-Grb2-SHP2*-Shc*-SHIP*]

[(LR*)2-Grb2-SHP2*-Shc*-SHIP*] --> [(LR*)2-Grb2-SHP2*-Shc*]+[SHIP*]

[SHIP*] --> [SHIP]

[SHIP] --> [SHIP*]

[(LR*)2-Grb2-SHP2*-Shc*]+[GAB1]+[GAB2]+[CBL]+[CBLB] --> [LRC1]

[LRC1] --> [(LR*)2-Grb2-SHP2*-Shc*]+[GAB1]+[GAB2]+[CBL]+[CBLB]

[LRC1]+[PI3K] --> [LRC1-PI3K]

[LRC1-PI3K] --> [LRC1]+[PI3K]

[LRC1-PI3K] --> [LRC1-PI3K*]

204

[PI]+[LRC1-PI3K*] --> [PI-LRC1-PI3K*]

[PI-LRC1-PI3K*] --> [PI]+[LRC1-PI3K*]

[PI-LRC1-PI3K*] --> [PI(3,4)P2-LRC1-PI3K*]

[PI(3,4)P2]+[LRC1-PI3K*] --> [PI(3,4)P2-LRC1-PI3K*]

[PI(3,4)P2-LRC1-PI3K*] --> [PI(3,4)P2]+[LRC1-PI3K*]

[PI(3,4)P2-LRC1-PI3K*] --> [PI(3,4,5)P3]+[LRC1-PI3K*]

[PI(3,4,5)P3]+[SHIP1] --> [PI(3,4,5)P3-SHIP1]

[PI(3,4,5)P3-SHIP1] --> [PI(3,4,5)P3]+[SHIP1]

[PI(3,4,5)P3-SHIP1] --> [PI(3,4)P2]+[SHIP1]

[PI(3,4,5)P3]+[PTEN] --> [PI(3,4,5)P3-PTEN]

[PI(3,4,5)P3-PTEN] --> [PI(3,4,5)P3]+[PTEN]

[PI(3,4,5)P3-PTEN] --> [PI(3,4)P2]+[PTEN]

[LRC1-PI3K*]+[Ras-GTP] --> [LRC1-PI3K*-Ras-GTP]

[LRC1-PI3K*-Ras-GTP] --> [LRC1-PI3K*]+[Ras-GTP]

[LRC1-PI3K*-Ras-GTP] --> [LRC1-PI3K*]+[Ras-GTP*]

[Ras-GTP*]+[mTORC2] --> [Ras-GTP*-mTORC2]

[Ras-GTP*-mTORC2] --> [Ras-GTP*]+[mTORC2]

[Ras-GTP*-mTORC2] --> [Ras-GTP*]+[mTORC2*]

[Akt]+[PI(3,4,5)P3] --> [Akt-PI(3,4,5)P3]

[Akt-PI(3,4,5)P3] --> [Akt]+[PI(3,4,5)P3]

[Akt-PI(3,4,5)P3]+[SHIP1] --> [Akt-PI(3,4,5)P3-SHIP1]

[Akt-PI(3,4,5)P3-SHIP1] --> [Akt-PI(3,4,5)P3]+[SHIP1]

[Akt-PI(3,4,5)P3-SHIP1] --> [Akt-PI(3,4)P2]+[SHIP1]

205

[Akt-PI(3,4,5)P3]+[PTEN] --> [Akt-PI(3,4,5)P3-PTEN]

[Akt-PI(3,4,5)P3-PTEN] --> [Akt-PI(3,4,5)P3]+[PTEN]

[Akt-PI(3,4,5)P3-PTEN] --> [Akt-PI(4,5)P2]+[PTEN]

[Akt-PI(3,4,5)P3]+[mTORC2*] --> [Akt-PI(3,4,5)P3-mTORC2*]

[Akt-PI(3,4,5)P3-mTORC2*] --> [Akt-PI(3,4,5)P3]+[mTORC2*]

[Akt-PI(3,4,5)P3-mTORC2*] --> [Akt*-PI(3,4,5)P3]+[mTORC2*]

[Akt*-PI(3,4,5)P3]+[PDK1*] --> [Akt*-PI(3,4,5)P3-PDK1*]

[Akt*-PI(3,4,5)P3-PDK1*] --> [Akt*-PI(3,4,5)P3]+[PDK1*]

[Akt*-PI(3,4,5)P3-PDK1*] --> [Akt**-PI(3,4,5)P3]+[PDK1*]

[Akt*-PI(3,4,5)P3]+[TSC1-TSC2]+[RHEB]+[mTORC1] --> [Akt*-PI(3,4,5)P3-

TSC1-TSC2-RHEB-mTORC1]

[Akt*-PI(3,4,5)P3-TSC1-TSC2-RHEB-mTORC1]--> [Akt*-PI(3,4,5)P3]+[TSC1-

TSC2]+[RHEB]+[mTORC1]

[Akt*-PI(3,4,5)P3-TSC1-TSC2-RHEB-mTORC1] --> [Akt*-PI(3,4,5)P3]+[TSC1-

TSC2]+[RHEB*]+[mTORC1*]

[Akt**-PI(3,4,5)P3]+[TSC1-TSC2]+[RHEB]+[mTORC1] --> [Akt**-PI(3,4,5)P3-

TSC1-TSC2-RHEB-mTORC1]

[Akt**-PI(3,4,5)P3-TSC1-TSC2-RHEB-mTORC1] --> [Akt**-PI(3,4,5)P3]+[TSC1-

TSC2]+[RHEB]+[mTORC1]

[Akt**-PI(3,4,5)P3-TSC1-TSC2-RHEB-mTORC1] --> [Akt**-PI(3,4,5)P3]+[TSC1-

TSC2]+[RHEB*]+[mTORC1*]

[RHEB*] --> [RHEB-GTP]

[RHEB-GTP] --> [RHEB*]

206

[RHEB-GTP]+[mTORC1*] --> [(RHEB-GTP)-mTORC1*]

[(RHEB-GTP)-mTORC1*] --> [RHEB-GTP]+[mTORC1*]

[(RHEB-GTP)-mTORC1*]+[S6K1-eIF3-4EBP1-eIF4E] --> [(RHEB-GTP)- mTORC1*-S6K1-eIF3-4EBP1-eIF4E]

[(RHEB-GTP)-mTORC1*-S6K1-eIF3-4EBP1-eIF4E] --> [(RHEB-GTP)- mTORC1*]+[S6K1-eIF3-4EBP1-eIF4E]

[(RHEB-GTP)-mTORC1*-S6K1-eIF3-4EBP1-eIF4E] --> [(RHEB-GTP)- mTORC1*-eIF3]+[S6K1*]+[4EBP1*-eIF4E]

[4EBP1*-eIF4E] --> [4EBP1*]+[eIF4E*]

[eIF4E*] --> [CycD1]

[S6K1*]+[S6] --> [S6K1*-S6]

[S6K1*-S6] --> [S6K1*]+[S6]

[S6K1*-S6] --> [S6K1*]+[S6*]

[(LR*)2-Grb2]+[Prot]-->[(LR*)2-Grb2-Prot]

[(LR*)2-Grb2-Prot]-->[(LR*)2-Grb2]+[Prot]

[(LR*)2-Grb2-SHP2]+[Prot]-->[(LR*)2-Grb2-SHP2-Prot]

[(LR*)2-Grb2-SHP2-Prot]-->[(LR*)2-Grb2-SHP2]+[Prot]

[(LR*)2-Grb2-SHP2*]+[Prot]-->[(LR*)2-Grb2-SHP2*-Prot]

[(LR*)2-Grb2-SHP2*-Prot]-->[(LR*)2-Grb2-SHP2*]+[Prot]

[(LR*)2-Grb2-SHP2-Shc]+[Prot]-->[(LR*)2-Grb2-SHP2-Shc-Prot]

[(LR*)2-Grb2-SHP2-Shc-Prot]-->[(LR*)2-Grb2-SHP2-Shc]+[Prot]

[(LR*)2-Grb2-SHP2*-Shc]+[Prot]-->[(LR*)2-Grb2-SHP2*-Shc-Prot]

[(LR*)2-Grb2-SHP2*-Shc-Prot]-->[(LR*)2-Grb2-SHP2*-Shc]+[Prot]

207

[(LR*)2-Grb2-SHP2*-Shc*]+[Prot]-->[(LR*)2-Grb2-SHP2*-Shc*-Prot]

[(LR*)2-Grb2-SHP2*-Shc*-Prot]-->[(LR*)2-Grb2-SHP2*-Shc*]+[Prot]

[(LR*)2-Grb2-SHP2*-Shc*-SHIP*]+[Prot]-->[(LR*)2-Grb2-SHP2*-Shc*-SHIP*-

Prot]

[(LR*)2-Grb2-SHP2*-Shc*-SHIP*-Prot]-->[(LR*)2-Grb2-SHP2*-Shc*-

SHIP*]+[Prot]

[LRC1]+[Prot]-->[LRC1-Prot]

[LRC1-Prot]-->[LRC1]+[Prot]

[LRC1-PI3K]+[Prot]-->[LRC1-PI3K-Prot]

[LRC1-PI3K-Prot]-->[LRC1-PI3K]+[Prot]

[LRC1-PI3K*]+[Prot]-->[LRC1-PI3K*-Prot]

[LRC1-PI3K*-Prot]-->[LRC1-PI3K*]+[Prot]

[LRC1-PI3K*-Ras-GTP]+[Prot]-->[LRC1-PI3K*-Ras-GTP-Prot]

[LRC1-PI3K*-Ras-GTP-Prot]-->[LRC1-PI3K*-Ras-GTP]+[Prot]

[LRC1-PI3K*-Ras-GTP*]+[Prot]-->[LRC1-PI3K*-Ras-GTP*-Prot]

[LRC1-PI3K*-Ras-GTP*-Prot]-->[LRC1-PI3K*-Ras-GTP*]+[Prot]

[PI-LRC1-PI3K*]+[Prot]-->[PI-LRC1-PI3K*-Prot]

[PI-LRC1-PI3K*-Prot]-->[PI-LRC1-PI3K*]+[Prot]

[PI(3,4)P2-LRC1-PI3K*]+[Prot]-->[PI(3,4)P2-LRC1-PI3K*-Prot]

[PI(3,4)P2-LRC1-PI3K*-Prot]-->[PI(3,4)P2-LRC1-PI3K*]+[Prot]

[(LR*)2-Grb2]-->[(LRi*)2-Grb2]

[(LRi*)2-Grb2]-->[(LR*)2-Grb2]

[(LR*)2-Grb2-SHP2]-->[(LRi*)2-Grb2-SHP2]

208

[(LRi*)2-Grb2-SHP2]-->[(LR*)2-Grb2-SHP2]

[(LR*)2-Grb2-SHP2*]-->[(LRi*)2-Grb2-SHP2*]

[(LRi*)2-Grb2-SHP2*]-->[(LR*)2-Grb2-SHP2*]

[(LR*)2-Grb2-SHP2-Shc]-->[(LRi*)2-Grb2-SHP2-Shc]

[(LRi*)2-Grb2-SHP2-Shc]-->[(LR*)2-Grb2-SHP2-Shc]

[(LR*)2-Grb2-SHP2*-Shc]-->[(LRi*)2-Grb2-SHP2*-Shc]

[(LRi*)2-Grb2-SHP2*-Shc]-->[(LR*)2-Grb2-SHP2*-Shc]

[(LR*)2-Grb2-SHP2*-Shc*]-->[(LRi*)2-Grb2-SHP2*-Shc*]

[(LRi*)2-Grb2-SHP2*-Shc*]-->[(LR*)2-Grb2-SHP2*-Shc*]

[(LR*)2-Grb2-SHP2*-Shc*-SHIP*]-->[(LRi*)2-Grb2-SHP2*-Shc*-SHIP*]

[(LRi*)2-Grb2-SHP2*-Shc*-SHIP*]-->[(LR*)2-Grb2-SHP2*-Shc*-SHIP*]

[LRC1]-->[LRC1i]

[LRC1i]-->[LRC1]

[LRC1-PI3K]-->[LRC1i-PI3K]

[LRC1i-PI3K]-->[LRC1-PI3K]

[LRC1-PI3K*]-->[LRC1i-PI3K*]

[LRC1i-PI3K*]-->[LRC1-PI3K*]

[LRC1-PI3K*-Ras-GTP]-->[LRC1i-PI3K*-Ras-GTP]

[LRC1i-PI3K*-Ras-GTP]-->[LRC1-PI3K*-Ras-GTP]

[LRC1-PI3K*-Ras-GTP*]-->[LRC1i-PI3K*-Ras-GTP*]

[LRC1i-PI3K*-Ras-GTP*]-->[LRC1-PI3K*-Ras-GTP*]

[PI-LRC1-PI3K*]-->[PI-LRC1i-PI3K*]

[PI-LRC1i-PI3K*]-->[PI-LRC1-PI3K*]

209

[PI(3,4)P2-LRC1-PI3K*]-->[PI(3,4)P2-LRC1i-PI3K*]

[PI(3,4)P2-LRC1i-PI3K*]-->[PI(3,4)P2-LRC1-PI3K*]

210

APPENDIX B. Modularization methods

For modularization methods see [29].

211

APPENDIX C. Single color correction

Fit 1

A2 B1 450 400 y = 0.036x + 32.838 400 y = 0.0559x ‐ 23.574 350 R² = 0.9962 350 R² = 0.9959 300 300 250 250 200 200 COLOR 150 150 100 100 50 50

SINGLE 0 0 0 2000 4000 6000 8000 0 2000 4000 6000 8000 10000 MULTI COLOR

Fit 2

NOTE: Fit 1 was done using Excel and Fit 2 was done using the MATLAB

Curvefit toolbox.

212

APPENDIX D. CytoSys

Instructions

1. Drag the CytoSys folder onto your desktop or wherever you want to save it.

2. Start MATLAB and navigate to the CytoSys folder using the ‘browse for folder’

button (top right hand corner of window- look for ‘…’).

3. Type in efmenu INSTALL in the command window

4. Type CytoSys in the command window (or double click on CytoSys.fig) and

the CytoSys GUI will pop-up.

5. Select file type. For the first run with a particular data set, select ‘Text file’. For

faster processing of subsequent runs of that data set, select ‘Matlab file’. This

may take as long as 5-10 mins for .txt.

6. Click on ‘Assign Data Dir’ and a dialog appears that allows you to select the

Data Set directory. (It is assumed that the Data directory that was selected in

the previous step has the structure discussed in Fig…)

7. Click on ‘Load Multi Color’ and the Multi Color files will be loaded

8. Click on ‘Load Single Color’ and the Single Color files will be loaded

9. Once the loading is complete, you can select which time curves you want to

generate: Multi or Single, and the results should start showing up as MATLAB

figure files. Where Gaussian processing is required, click the following

buttons in the following sequence: Initialize, Process Now and Done. Make

sure you select the Manual Gfit radio button on top for the processing mode.

If this radio button is already selected – still make sure you select another

radio button and then reselect Manual Gfit). 213

10. Interpolated data can also be generated. If you do not enter the number of

time points for interpolation, the program takes the default as 5000 points,

and figures of the interpolated data populate the screen.

11. In the Single color correction section, clicking on the Fit button causes single

color vs multicolor plots to be generated

12. Now the EzyFit menu should show up on each figure. If it does not show up

on any figure, click on that figure, and then type efmenu in the command

window

13. Move the cursor onto ‘Show Fit’ and a drop down menu will appear and you

can select the type of fit you want. Perform this for ALL the single color vs

multicolor plots. Once done, click on Apply and the time curves with single

color correction will be displayed

14. The multicolor curves that have undergone single color correction are used

for generation of synthetic data. Clicking Generate in the Synthetic data

generation section throws up synthetic data.

15. All data matrices generated as results are saved as .mat files in the ‘Results’

directory when the ‘Backup Results’ button is clicked.

16. All data matrices generated as results are saved as .mat files in the ‘Backup’

directory when the user uses the ‘Done’ button to close CytoSys and chooses

‘Yes’ when prompted. Note: Using the X button to kill the window does not

save results and should not be done.

17. Currently the saving capabilities are limited. If you want to reload the

workspace from the immediately previous dataset, you can restart CytoSys

214

and select .mat file and proceed through all the steps. This doesn’t take much

time. However if you want to reload the workspace from a dataset that you

accessed before the previous dataset, you only have the text files to work

with at this point. The previous workspace feature is in the process of

development.

Precautions & Notes

1. DO NOT CHOOSE 2 GAUSSIANS IN THE GAUSSIAN PROCESSOR GUI.

Currently this does not work.

2. The multicolor variables which have single color counterparts MUST BE included first in the Variable list. We are in the process of providing greater flexibility.

3. It is recommended that after you create a phase_definition.txt using the GUI- you go in and examine the phase definition file. The same applies to

Variable_list.txt. If you are comfortable editing the text files directly, this is the most error proof way. However make sure you DO NOT EDIT ANY OF THE

DELIMITERS.

4. If your dataset does not have single color, do not click on any of the single color buttons.

5. If the GUI has been run for Dataset 1, and then you close it and restart

CytoSys- then choosing ‘.mat’ option for loading will load the files for Dataset 1.

However you will have to go through the steps in sequence again. This usually doesn’t take very long.

215

6. If the GUI has been run for Dataset 1, and you close it, and restart CytoSys- but want to use it now for Dataset 2- choose ‘.txt’ option instead and run through all the steps.

7. Currently, when working with a specific dataset, do not close CytoSys until you have done all you wanted. Closing the GUI saves results in the Backup folder. If you want to save results in the Results folder, you have to click ‘Save’ in the

‘Save to Results’ section.

Phase definition file

The phase_definition.txt file explanation is as follows. When you enter the phase definition file, you will see 4 lines of text, each of which looks like the following (it may show up in multiple lines, if ‘word wrap’ is checked in the Format menu of Notepad.

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:{

0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

Note that the delimiters above are very specific, and when making

changes to the file, only the entries should be changed. DO NOT

MAKE CHANGES TO ANY OF THE DELIMITERS. The delimiters are

explained below:

Semicolon (;)

Used to separate the 4 main portions of the data structure, i.e.:

1. Phase

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:{

216

0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

2. Subphase

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:

{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

3. Region

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:

{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

4. Processing

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}

:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

Comma (,)

Used to separate the subphases within a phase, or the regions within a

subphase, or the variables to be processed within processing i.e.:

1. Subphases within a phase

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:{

0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

2. Regions within a subphase

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:{

0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

3. Variables within processing

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:{

0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

Period (.)

217

Used to separate the definitions of subphase, and individual subphase-

wise processing modes, and individual region-wise processing modes

1. Definitions of subphase

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:{

0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

2. Individual subphase-wise processing modes

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:{

0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

3. Individual region-wise processing modes

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:{

0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

Colon (:)

Used to separate phase-wise processing, from subphase-wise

processing, and subphase-wise processing from region-wise

processing

G1;[G11,G12];{[R22,R23].[R24,R25]};[0]:{@.@}:{@.@.@.@},[2]:{2.2}:{

0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0},[2]:{2.2}:{0.0.0.0}

The other phase definition files (phase_definition1.txt (Corresponding

to the single color experiment for Cyclin A2), phase_definition2.txt

(Corresponding to the single color experiment for Cyclin B1) and

phase_definition3.txt (Corresponding to the single color experiment for

Cyclin B2) have the same format).

218

Variable list file

This file has as many lines of text as there are variables, and each has the

following format:

CycA2,9,SINGLE1,3

Here CycA2 gives the name notation for Cyclin A2, whether it is in the

multicolor experiment, or the single color experiment.

CycA2,9,SINGLE1,3

9 gives the column number of Cyclin A2 in the multicolor data file

CycA2,9,SINGLE1,3

SINGLE1 gives the name of the directory in which single color experiment

for Cyclin A2 is stored

CycA2,9,SINGLE1,3

3 gives the column number of Cyclin A2 in the single color data file

CycA2,9,SINGLE1,3

The delimiter used is comma (,)

CycA2,9,SINGLE1,3

The other Variable list files (Variable_list1.txt (Cyclin A2 single color),

Variable_list2.txt (Cyclin B1 single color) and Variable_list2.txt (Cyclin B2

single color))

219

BIBLIOGRAPHY

1 Sreenath, Sree N, Cho, Kwang-hyun, and Wellstead, Peter. Modeling the dynamics of signalling pathways. In Systems Biology: Essays in Biochemistry. Portland Press, Portland, 2008. 2 Morgan, David O. The cell cycle: Principles of control. Oxford University Press, 2007. 3 Nurse, P. A long twentieth century of the cell cycle and beyond. Cell, 100 (2000), 71--8. 4 Aguda BD, Tang Y. The Kinetic Origins of the Restriction Point in the Mammalian Cell Cycle. Cell Proliferation, 32 (1999), 321-335. 5 Bussell, Katrin. The dynamics of the cycle. Nature Reviews Molecular Cell Biology, 6, 190 (2005). 6 Darzynkiewicz, Zbigniew, Crissman, Harry, and Jacobberger, James W. Cytometry of the cell cycle: cycling through history. Cytometry. Part A, 58 (2004), 21--32. 7 Humphrey, Tim and Brooks, Gavin. Cell cycle control: mechanisms and protocols. Humana Press Inc., 2005. 8 Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. “MOLECULAR BIOLOGY OF THE CELL. Garland science, 2002. 9 Klipp, E and H. Systems Biology in Practice: Concepts, Implementation and Application. Wiley VCH, 2005. 10 Malumbres, Marcos and Barbacid, Mariano. Cell cycle, CDKs and cancer: a changing paradigm. Nature reviews. Cancer, 9 (2009), 153--66. 11 Qu Z, MacLellan WR, Weiss JN. Dynamics of the Cell Cycle: Checkpoints, Sizers, and Timers (2003), 3600-3611. 12 Chen, KC, Calzone, L, Csikasz-Nagy, A, Cross, FR, Novak, B, and Tyson, JJ. Integrative Analysis of Cell Cycle Control in Budding Yeast. Molecular Biology of the Cell, 15, 8 (2004), 3841-3862. 13 Christopher, Renee A, Yoshioka, Naohisa, Dhiman, Anjali, Miller, Robert, Haberichter, Thomas, and Ma, Britta. A systems biology dynamical model of mammalian G 1 cell cycle progression. Molecular Systems Biology (2007), 1--8.

220

14 Csikasz-Nagy A, Battogtokh D, Chen KC, Novak B and Tyson JJ. Analysis of a Generic Model of Eukaryotic Cell-Cycle Regulation. Biophysical Journal, 90, 12 (2006), 4361-4379. 15 Csikasz-Nagy, Attila. Computational systems biology of the cell cycle. Briefings in bioinformatics, 10 (2009), 424--34. 16 Faure A, Naldi A, Chaouiya C, Thieffry D. Dynamical Analysis of a Generic Boolean Model for the Control of the Mammalian Cell Cycle. Bioinformatics, 22, 14 (2006), e124-e131. 17 Haberichter, Thomas, Madge, Britta, Christopher, Renee A et al. A systems biology dynamical model of mammalian G 1 cell cycle progression. Molecular Systems Biology, 3 (2007), 1--8. 18 Ingolia, Nicholas T and Murray, Andrew W. The ups and downs of modeling the cell cycle. Current biology, 14 (2004), R771--7. 19 Novak, B and Tyson, J J. A model for restriction point control of the mammalian cell cycle. Journal of Theoretical Biology, 230 (2004), 563--579. 20 Novak, B and Tyson, JJ. Quantitative Analysis of a Molecular Model of Mitotic Control in Fission Yeast. (1995), 283-305. 21 Weston, Andrea D and Hood, Leroy. Systems Biology , Proteomics , and the Future of Health Care : Toward Predictive , Preventative , and Personalized Medicine Introduction : Paradigm Changes in Health Care. Journal of Proteome Research (2004), 179--196. 22 Hanahan D, Weinberg RA. The Hallmarks of Cancer. Cell, 100 (2000), 57--70. 23 Stirewalt, D L and Radich, J P. The role of FLT3 in haematopoietic malignancies. Nature Reviews. Cancer, 3 (2003), 650--665. 24 Barlogie, B, Raber, M N, Schumann, J et al. Flow cytometry in clinical cancer research. Cancer Research, 43 (1983), 3982--97. 25 Jacobberger, J W, Fogleman, D, and Lehman, M. Analysis of Intracellular Antigens by Flow Cytometry. Cytometry, 7 (1986), 356-- 364. 26 Kaleem, Zahid, Crawford, Eric, Pathan, M H et al. Flow Cytometric Analysis of Acute Leukemias Diagnostic Utility and Critical Analysis of Data MATERIAL AND METHODS. Archives of Pathology, 356-- 364.

221

27 Fell, D. Understanding the Control of Metabolism. Portland Press (1997), 21--32. 28 Savageau, M A. Biochemical : A Study of Function and Design in Molecular Biology. Addison-Wesley (1976), 261--78. 29 Soebiyanto, RP. A complex systems biology approach to understanding signaling pathways in cancer, PhD Thesis. Case Western Reserve University, Cleveland, Ohio, 2008. 30 Sreenath, S N, Soebiyanto, Radina P, Mesarovic, M D, and Wolkenhauer, O. Coordination of crosstalk between MAPK-PKC pathways: an exploratory study. IET Systems Biology, 1 (2007), 33-- 40. 31 Guldberg, C M and Waage, P. Uber die chemische Affinit'at. Prakt. Chem., 19 (1879), 21--32. 32 Aldridge, Bree B, Burke, John M, Lauffenburger, Douglas a, and Sorger, Peter K. Physicochemical modelling of cell signalling pathways. Nature Cell Biology, 8 (Nov. 0, 2006), 1195--203. 33 Saucerman, Jeffrey J and McCulloch, Andrew D. Mechanistic systems models of cell signaling networks: a case study of myocyte adrenergic regulation. Progress in biophysics and molecular biology, 85 (2004), 261--78. 34 Farina, Marcello, Findeisen, Rolf, Bullinger, Eric, Bittanti, Sergio, Allgower, Frank, and Wellstead, Peter. Results Towards Identifiability Properties of Biochemical Reaction Networks. In Proceedings of the 45th IEEE Conference on Decision and Control ( 2006), IEEE, 2104- -2109. 35 Schoeberl, B. et al, "Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors". Nature Biotechnology, Vol., 20 (2002), 6400--6411. 36 Janes KA, Albeck JG, Gaudet S, Sorger PK, Lauffenburger DA, Yaffe MB. A Systems Model of Signaling Identifies a Molecular Basis Set for Cytokine-Induced Apoptosis (2005), 1646-1653. 37 Bhalla, U S and Iyengar, R. Emergent properties of networks of biological signaling pathways. Science, 283, 5400 (1999), 381--387. 38 Chen WW, Schoeberl B, Jasper PJ, Niepel M, Nielsen UB, Lauffenburger DA, Sorger PK. Input-output behavior of ErbB signaling pathways as revealed by a mass action model trained against dynamic data. Molecular Systems Biology, 5 (2009), 2391-- 19. 222

39 Soebiyanto, Radina P, Sreenath, Sree N, Qu, Cheng-Kui, Loparo, Kenneth a, and Bunting, Kevin D. Complex systems biology approach to understanding coordination of JAK-STAT signaling. Bio Systems, 90 (2007), 830--42. 40 Hendriks, B S, Griffiths, G J, Benson, R et al. Decreased internalisation of ErbB1 mutants in lung cancer is linked with a mechanism conferring sensitivity to gefitinib. Engineering and Technology, 457--466. 41 Hengl, S, Kreutz, C, Timmer, J, and Maiwald, T. Data-based identifiability analysis of non-linear dynamical models. Bioinformatics (Oxford, England), 23 (2007), 2612--8. 42 Sigal, Alex, Milo, Ron, Cohen, Ariel et al. Variability and memory of protein levels in human cells. Nature, 444 (2006), 28--31. 43 Wolkenhauer, O. Systems Biology: Dynamic Pathway Modeling. Unpublished draft, 2010. 44 Givan AL. Flow Cytometry: An Introduction. In Flow Cytometry Protocols. Humana Press, 2004. 45 Jacobberger JW, Sramkoski RM, Wormsley SB, Bolton WE. Estimation of Kinetic Cell-Cycle-Related Gene Expression in G1 and G2 Phases From Immunofluorescence Flow Cytometry Data. Cytometry, 35 (1999), 2612--8. 46 Schilling, M, Maiwald, T, Bohl, S, Kollman, M, Kreutz, C, Timmer, J, and Klingmuller, U. Computational processing and error reduction strategies for standardized quantitative data in biological networks. FEBS J., 272 (2005), 6400--6411. 47 Lee, Jamie A, Spidlen, Josef, Boyce, Keith et al. NIH Public Access. Cytometry, 73 (2009), 926--930. 48 Novak, B and Tyson, JJ. Modeling the Control of DNA Replication in Fission Yeast. (1997), 9147-9152. 49 Novak, B, Csikasz-Nagy, A, Gyorffy, B, Chen, K, and Tyson, JJ. Mathematical Model of the Fission yeast Cell Cycle with Checkpoint controls at the G1/S, G2/M and Metaphase/Anaphase Transitions. (1998), 185-200. 50 Chen, KC, Csikasz-Nagy, A, Gyorffy, B, Val, J, Novak, B, and Tyson, JJ. Kinetic Analysis of a Molecular Model of the Budding Yeast Cell Cycle. Molecular Biology of the Cell, 11 (2000), 369-391.

223

51 Novak, B, Pataki, Z, Ciliberto, A, and Tyson, JJ. Mathematical Model of the Cell Division Cycle of Fission Yeast. (2001), 277-286. 52 Ciliberto, A, Novak, B, and Tyson, JJ. Mathematical Model of the Morphogenesis Checkpoint in Budding Yeast. Journal of Cell Biology, 163, 6 (2003), 1243-1254. 53 Sveiczer, A, Tyson, JJ, and Novak, B. Modelling the Fission Yeast Cell Cycle (2004), 298-307. 54 Avva J, Weis MC, Soebiyanto RP, Jacobberger JW, Sreenath SN. CytoSys: A Tool for Extracting Cell Cycle-Related Expression Dynamics from Static Data. In Kalyuzhny AE, ed., Signal Transduction Immunohistochemistry: Methods and Protocols. Humana Press, 2011. 55 Heidebrecht, F, Heidebrecht, a, Schulz, I, Behrens, S-E, and Bader, a. Improved semiquantitative Western blot technique with increased quantification range. Journal of immunological methods, 345 (2009), 40--8. 56 Hood, L. Systems Biology and New Technologies Enable Predictice and Preventative Medicine (2004). 57 Jones PA, Baylin SB. The fundamental role of epigenetic events in cancer. Nature Reviews Genetics, 3 (2002), 415-428. 58 Mesarovic M. Systems theory and biology—view of a theoretician. In Mesarovic M, ed., Systems Theory and Biology. Springer-Verlag, New York, 1968. 59 Wolkenhauer, O., Ullah, M., Kolch, W., and Cho, K.-H. Modeling and simulation of intracellular dynamics: choosing an appropriate framework. IEEE Trans Nanobioscience , 3, 3 (2004), 200-7. 60 Mesarovic M. Systems theory and biology. Springer-Verlag, New York, 1968. 61 Rosen R. A means towards a new holism. Science, 161, 3836 (1968), 34-35. 62 Wolkenhauer, O and Mesarovic, M. Feedback dynamics and cell function: Why systems biology is called Systems Biology. Mol. BioSyst., 1 (2005), 14--16. 63 Kitano H. Systems Biology : A Brief Overview. Science, 1662 (2009), 356--364. 64 Kitano, H. Foundations of Systems Biology. MIT Press, 2001.

224

65 Barabasi, Albert-Laszlo and Oltvai, Zoltan N. Network biology: understanding the cell's functional organization. Nature Reviews Genetics, 5 (2004), 101--13. 66 Deckard, A. Preliminary studies on the in silico evolution of biochemical networks. ChemBiochem, 5 (2004), 1423--1431. 67 Sauro, H M. Quantitative analysis of signaling networks. Prog in Biophys and Mol Biol, 86 (2004), 5--43. 68 Deutsch, A. Mathematical and Theoretical Biology: A European Perspective. Science Careers (2004). 69 Mesarovic, M D, Sreenath, S N, and Keene, J D. Search for organising principles: understanding in systems biology. IEE Syst Biol (Stevenage, 1 (2004), 119--27. 70 Nature. Omics Gateway. NPG. 2010. 71 Mesarovic, MD, Macko, D, and Takahara, Y. Theory hierarchical multilevel systems. Academic Press, 1970. 72 Mesarovic MD, Takahara Y. General systems theory: Mathematical foundations. Mathematics in Science and Engineering, Academic Press, New York, 1975. 73 Asthagiri AR, Lauffenburger DA. A computational study of feedback effects on signal dynamics in a mitogen-activated protein kinase (MAPK) pathway model. Biotechnology progress, 17 (2001), 227-- 39. 74 Niepel M, Spencer SL, Sorger PK. Non-genetic cell-to-cell variability and the consequences for pharmacology. Current Opinion in Chemical Biology, 13, 5-6 (2009), 556-561. 75 Altschuler SJ, Wu LF. Cellular Heterogeneity: Do Differences Make a Difference? Cell, 141, 4 (2010), 559-563. 76 Harper JV, Brooks G. The Mammalian Cell Cycle. In CELL CYCLE CONTROL: Methods in Molecular Biology. Humana Press, 2005. 77 Amonlirdviman K, Khare NA, Tree DR, Chen W-S, Axelrod JD, Tomlin CJ. Mathematical modeling of planar cell polarity to understand domineering nonautonomy. Science, 307, 5708 (2005), 423--6. 78 Csikasz-Nagy A, Battogtokh D, Chen KC, Novak B and Tyson JJ. Analysis of a Generic Model of Eukaryotic Cell-Cycle Regulation. Biophysical Journal, 90, 12 (2006), 4361-4379.

225

79 van Riel, NAW. Parameter estimation in models combining signal transduction and metabolic pathways: The dependent input approach. IET Systems Biology (2006). 80 Ellner SP, Guckenheimer J. Dynamic models in Biology. Princeton University Press, 2006. 81 Crampin, E J. Mathematical and computational techniques to deduce complex biochemical reaction mechanisms. Prog in Biophys and Mol Biol, 86 (2004), 77--112. 82 Keener, J P. Spatial modeling In Computational Cell Biology. Springer-Verlag, 2002. 83 Birtwistle MR, Hatakeyama M, Yumoto N, Ogunnaike BA,Hoek JB, Kholodenko BN. Ligand-Dependent Responses of the ErbB Signaling Network: Experimental and Modeling Analyses. Molecular Systems Biology, 3, 144 (2007). 84 Ihekwaba AEC, Broomhead D. Sensitivity Analysis of Parameters Controlling Oscillatory Signalling in the NFkB Pathway: The Roles of IKK and IkBa. Systems Biology, 1 (0 0, 2004), 93--103. 85 Zwolak JW, Tyson J. Globally Optimized Parameters for a Model of Mitotic Control in Frog Egg Extracts. Systems Biology, IEE Proceedings, 152 (2005), 81--92. 86 Panning TD, Watson L. Deterministic Parallel Global Parameter Estimation for a Model of the Budding Yeast Cell Cycle. Journal of Global Optimization, 40 (2008), 719--738. 87 Rodriguez-Fernandez M, Mendes P. A Hybrid Approach For Efficient and Robust Parameter Estimation in Biochemical Pathways. Biosystems, 83 (2006), 28--265. 88 Balsa-Canto E, Peifer M. Hybrid Optimization Method with General Switching Strategy for Parameter Estimation. BMC Systems Biology, 2, 26 (2008). 89 Voit, EO. Computational Analysis of Biochemical Systems. Cambridge University Press, 2000. 90 JacobbergerJW, Frisa PS. Cell Cycle-Related Cyclin B1 Quantification. PLoS One (2009). 91 Kreutz C, Bartolome Rodriguez MM, Maiwald T, Seidl M, Blum HE. An error model for protein quantification. Bioinformatics, 23, 20 (2007), 2747–2753.

226

92 Melamed MR, Lindmo T, Mendelsohn ML. Flow Cytometry and Sorting. Wiley-Liss, 1990. 93 Shapiro HM. Practical Flow Cytometry. Wiley-Liss, 2003. Online at http://www.coulterflow.com. 94 Watson JV. Introduction to Flow Cytometry. Cambridge Press, 1991. 95 Darzynkiewicz Z, Robinson JP, Crissman HA. Methods in Cell Biology: Cytometry. Academic Press, 2001. 96 Darzynkiewicz Z, Robinson JP, Crissman HA. Methods in Cell Biology: Cytometry. Academic Press, 2001. 97 Gray JW, Darzynkiewicz Z. Techniques in Cell Cycle Analysis. Humana Press, 1987. 98 Cubeddu R, Comelli D, D'Andrea C, Taroni P, Valentini G. Time resolved fluorescence imaging in biology and medicine. Journal of Physics D: Applied Physics, 35 (2002), R61. 99 Schilling M, Maiwald T, Bohl S, Kollmann M, Kreutz C, Timmer J, Klingmuller U. Computational processing and error reduction strategies for standardized quantitative data in biological networks. FEBS J., 272, 24 (2005), 6400-6411. 100 Jacobberger JW, Frisa PS, Sramkoski RM, Stefan T, Shults KE, Soni DV. A New Biomarker for Mitotic Cells. Cytometry Part A, 73A (2008), 5-15. 101 Jacobberger JW. Intracellular antigen staining. Quantitative immunofluorescence Methods, 2 (1991), 207-218. 102 Jacobberger JW. Flow cytometric analysis of intracellular protein epitopes. In Stewart C, Nicholson K, ed., Immunophenotyping. Wiley-Liss, Inc, 2000. 103 Jacobberger JW. Stoichiometry of immunocytochemical staining reactions. In Methods in cell biology: Cytometry, Third Edition. Academic Press, 2001. 104 Jacobberger JW, Sramkoski RM, Stefan T. Multiparameter cell cycle analysis. In Hawley TS, Hawley RG, ed., Methods in Molecular Biology: Flow Cytometry Protocols. Springer (Humana Press), In Press.

227

105 Nayak S, Salim S, Luan D, Zai M, Varner JD. A Test of Highly Optimized Tolerance Reveals Fragile Cell-Cycle Mechanisms are Molecular Targets in Clinical Cancer Trials. PLoS One, 3, 4 (2008), e2016. 106 Tyson JJ, Novak B. Temporal Organization of the Cell Cycle (2008), R759-R768. 107 McKinley M, O'Loughlin V. Human Anatomy. McGraw-Hill Science/Engineering/Math, 2007. 108 A, Goldbeter. A Minimal Cascade Model for the Mitotic Oscillator Involving Cyclin and Cdc2 Kinase. PNAS, 88 (1991), 9107-9111. 109 Tyson JJ. Modeling the Cell Division Cycle: Cdc2 and Cyclin Interactions (1991), 7328-7332. 110 Pomerening JR, Kim SY, Ferrell Jr. JE. Systems-Level Dissection of the Cell-Cycle Oscillator: Bypassing Produces Damped Oscillations (2005), 565-578. 111 Gonze D, Goldbeter A. A Model for a Network of Phosphorylation- Dephosphorylation Cycles Displaying the Dyanics of Dominoes and Clocks. Journal of Theoretical Biology (2001), 167-186. 112 Tyson JJ, Novak B. Regulation of the Eukaryotic Cell Cycle: Molecular Antagonism, Hysteresis and Irreversible Transitions (2001), 249-263. 113 Obeyesekere MN, Tecarro E, Lozano G. Model predictions of MDM2 Mediated Cell Regulation (2004), 655-661. 114 Gardner TS, Dolnik M, Collins JJ. A Theory for Controlling Cell Cycle Dynamics Using a Reversibly Binding Inhibitor. (1998), 14190- 14195. 115 Obeyesekere MN, Herbert JR, Zimmerman SO. A Model of the G1 Phase of the Cell Cycle Incorporating CyclinE/Cdk2 Complex and Retinoblastoma Protein. (1995), 1199-1205. 116 Obeyesekere MN, Zimmerman SO, Tecarro ES, Auchmuty G. A Model of Cell Cycle Behavior Dominated by Kinetics of a Pathway Stimulated by Growth Factors (1999), 917-934. 117 Bai S, Goodrich D, Thron Cd, Tecarro E, Obeyesekere M. Theoretical and Experimental Evidence for hysteresis in Cell Proliferation. Cell Cycle, 2, 1 (2003), 46-52.

228

118 KW, Kohn. Functional Capabilities of Molecular Network Components Controlling the Mammalian G1/S Cell Cycle Phase Transition (1998), 1065-1075. 119 Hatzimanikatis V, Lee KH, Bailey JE. A Mathematical Description of Regulation of the G1-S Transition of the Mammalian Cell Cycle (1999), 631-637. 120 Qu Z, Weiss JN, MacLellan WR. Regulation of the Mammalian Cell Cycle: A Model of the G1-to-S Transition (2003), C349-C367. 121 Swat M, Kel A, Herzel H. Bifurcation Analysis of the Regulatory Modules of the Mammalian G(1)/S Transition (2004), 1506-1511. 122 Aguda BD. A Qualitative Analysis of the Kinetics of the G2 DNA Damage Checkpoint System. PNAS, 96 (1999), 11352-11357. 123 Aguda BD. Instabilities in Phosphorylation-dephosphorylation Cascades and Cell Cycle Checkpoints. Oncogene, 18 (1999), 2846- 2851. 124 Novak B, Tyson JJ. Numerical Analysis of a Comprehensive Model of M-phase Control in Xenopus Oocyte Extracts and Intact Embryos (1993), 1153-1168. 125 Thron CD. Mathematical Analysis of a Model of the Mitotic Clock (1991), 122-123. 126 Thron CD. Bistable Biochemical Switching and Control of the Events of the Cell Cycle (1997), 317-325. 127 Sveiczer A, Csikasz-Nagy A, Gyorffy B, Tyson JJ, Novak B. Modeling the Fission Yeast Cell Cycle: Quantized Cycle Times in Wee1/Cdc25Δ Mutant Cells (2000), 7865-7870. 128 Qu Z, Weiss JN, MacLellan WR. Coordination of Cell Growth and Cell Division: A Mathematical Modeling Study (2004), 4199-4207. 129 Steuer R. Effects of Stochasticity in Models of the Cell Cycle: From Quantized Cell Cycle Times to Noise-Induced Oscillations (2004), 293-301. 130 Srividhya J, Gopinathan MS. A Simple Time Delay Model for Eukaryotic Cell Cycle (2006), 617-627. 131 Yang L, Han Z, MacLellan WR, Weiss JN, Qu Z. Linking Cell Division to Cell Growth in a Spatiotemporal Model of the Cell Cycle (2006), 120-133.

229

132 Rosenbauer, F and Tenen, D G. Transcription factors in myeloid development: balancing differentiation with transformation. Nature Reviews Cancer, Vol., 7 (2007). 133 Fogg DK, et al. A clonogenic bone marrow progenitor specific for macrophages and dendritic cells. Science, 311 (2006). 134 Adolfsson J, Månsson R, Buza-Vidas N, Hultquist A, Liuba K, Jensen CT, Bryder D, Yang L, Borge O-J, Thoren LAM, Anderson K, Sitnicka E, Sasaki Y, Sigvardsson M, Jacobsen SEW. Identification of Flt3+ lympho-myeloid stem cells lacking erythro-megakaryocytic potential a revised road map for adult blood lineage commitment. Cell, 121, 2 (2005), 295-306. 135 Dosil M, Wang S, Lemischka IR. Mitogenic signalling and substrate specificity of the Flk2/Flt3 receptor tyrosine kinase in fibroblasts and interleukin 3-dependent hematopoietic cells. Mol. Cell. Biol, 13, 10 (1993), 6572--6585. 136 Rottapel R, Turck CW, Casteran N, Liu X, Birnbaum D, Pawson T, Dubreuil P. Substrate specificities and identification of a putative binding site for PI3K in the carboxy tail of the murine Flt3 receptor tyrosine kinase.”, Oncogene, 9, 1994, 1755-65. 137 Nakao M, Yokota S, Iwai T, Kaneko H, Horiike S, Kashima K, Sonoda Y, Fujimoto T, Misawa S. Internal tandem duplication of the flt3 gene found in acute myeloid leukemia. Leukemia, 10 (1996), 1911–1918. 138 Turner AM, Lin NL, Issarachai S, Lyman SD, Broudy VC. FLT3 Receptor Expression on the Surface of Normal and Malignant Human Hematopoietic Cells. 139 Weiss A, Schlessinger J. Switching signals on or off by receptor dimerization (1998), 277-80. 140 Hatakeyama M, Kimura S, Naka T, Kawasaki T, Yumoto N, Ichikawa M, Kim J-H, Saito K, Saeki M, Shirouzu M, Yokoyama S, Konagaya A. A computational model on the modulation of mitogen-activated protein kinase (MAPK) and Akt pathways in heregulin-induced ErbB signaling (2003), 451–463. 141 Janowska-Wieczorek A, Majka M, Ratajczak J, Ratajczak MZ. Autocrine/Paracrine Mechanisms in Human Hematopoiesis (2001), 99-107.

230

142 Martelli AM, Evangelisti C, Chiarini F, Blalock WL, Papa V, Fala F. The Phosphatidylinositol 3-Kinase/Akt/Mammalian target of rapamycin signaling network as a new target for acute myelogenous leukemia therapy, 309-330. 143 Zhang S, Broxmeyer HE. p85 subunit of PI3 kinase does not bind to human Flt3 receptor, but associates with SHP2, SHIP, and a tyrosine-phosphorylated 100-kDa protein in Flt3 ligand-stimulated hematopoietic cells (1999), 440-5. 144 Shankar DB, Cheng JC, Sakamoto KM. Role of Cyclic AMP Response Element Binding Protein in Human Leukemias (2005), 1819-24. 145 Cheng JC, Kinjo K, Judelson DR, Chang J, Wu WS, Schmid I, Shankar DB, Kasahara N, Stripecke R, Bhatia R, Landaw EM, Sakamoto KM. CREB is a critical regulator of normal hematopoiesis and leukemogenesis. BLOOD, 111, 3 (2007), 1182–1192. 146 Siu YT, Jin DY. CREB – a real culprit in oncogenesis (2007), 3224- 32. 147 Conkright, Michael D and Montminy, Marc. CREB : the unindicted cancer co-conspirator. Trends in Cell Biology, 15, 9 (2005), 457-459. 148 Martelli AM, Nyåkern M, Tabellini G, Bortul R, Tazzari PL, Evangelisti C, Cocco L. Phosphoinositide 3-kinase/Akt signaling pathway and its therapeutical implications for human acute myeloid leukemia (2006), 911-28. 149 Zhang S, Mantel C, Broxmeyer HE. Flt3 signaling involves tyrosyl phosphorylation of SHP-2 and SHIP and their association with Grb2 and Shc in Baf3/Flt3 cells (1999), 372-80. 150 Zhang S, Broxmeyer HE. Flt3 ligand induces tyrosine phosphorylation of gab1 and gab2 and their association with shp-2, grb2, and PI3 kinase (2000), 195-9. 151 McCubrey JA, Steelman LS, Franklin RA, Abrams SL, Chappell WH, Wong EW, Lehmann BD, Terrian DM, Basecke J, Stivala F, Libra M, Evangelisti C, Martelli AM. Targeting the RAF/MEK/ERK, PI3K/AKT and P53 pathways in hematopoietic drug resistance (2007), 64-103. 152 Cully M, You H, Levine AJ, Mak TW. Beyond PTEN mutations: the PI3K pathway as an integrator of multiple inputs during tumorigenesis. Nature Reviews Cancer, 6 (2006), 184-192.

231

153 Steelman LS, Abrams SL, Whelan J, Bertrand FE, Ludwig DE, Bäsecke J, Libra M, Stivala F, Milella M, Tafuri A, Lunghi P, Bonati A, Martelli AM, McCubrey JA. Contributions of the Raf/MEK/ERK, PI3K/PTEN/Akt/mTOR and Jak/STAT pathways to leukemia (2008), 686-707. 154 Guertin DA, Sabatini DM. Review Defining the Role of mTOR in Cancer. Cancer Cell, 12, 1 (2007), 9-22. 155 Haar EV, Lee SI, Bandhakavi S, Griffin TJ, Kim DH. Insulin signaling to mTOR mediated by the Akt/PKB substrate PRAS40 (2007), 316- 23. 156 Hara K, Maruki Y, Long X, Yoshino K, Oshiro N, Hidayat S, Tokunaga C, Avruch J, Yonezawa K. Raptor, a binding partner of target of rapamycin (TOR) mediates TOR Action (2002), 177-89. 157 Kim DH, Sarbassov DD, Ali SM, King JE, Latek RR, Erdjument- Bromage H, Tempst P, Sabatini DM. mTOR Interacts with Raptor to Form a Nutrient-Sensitive Complex that Signals to the Cell Growth Machinery (2002), 163-75. 158 Sancak Y, Thoreen CC, Peterson TR, Lindquist RA, Kang SA, Spooner E, Carr SA, Sabatini DM. PRAS40 is an Insulin-Regulated Inhibitor of the mTORC1 Protein Kinase (2007), 903-15. 159 Schalm SS, Fingar DC, Sabatini DM, Blenis J. TOS Motif-Mediated Raptor Binding Regulates 4E-BP1 Multisite Phosphorylation and Function (2003), 797-806. 160 Yang Q, Guan K-L. Expanding mTOR signaling (2007), 666–681. 161 Crino PB, Nathanson KL, Henske EP. The Tuberous Sclerosis Complex. N Engl J Med, 355, 13 (2006), 1345-56. 162 Mamane Y, Petroulakis E, Rong L, Yoshida K, Ler LW, Sonenberg N. eIF4E – from translation to transformation (2004), 3172-9. 163 Zhang S, Fukuda S, Lee Y, Hangoc G, Cooper S, Spolski R, Leonard WJ, Broxmeyer HE. Essential Role of Signal Transducer and Activator of Transcription (Stat)5a but Not Stat5b for Flt3- Dependent Signaling (2000), 719–728. 164 Yang X, Liu L, Sternberg D, Tang L, Galinsky I, DeAngelo D, Stone R. The FLT3 Internal Tandem Duplication Mutation Prevents Apoptosis in Interleukin-3-Deprived BaF3 Cells Due to Protein Kinase A and Ribosomal S6 Kinase 1-Mediated BAD Phosphorylation at Serine 112 (2005), 7338-47.

232

165 Zeng Z, Samudio IJ, Zhang W, Estrov Z, Pelicano H, Harris D, Frolova O, Hail Jr. N, Chen W, Kornblau SM, Huang P, Lu Y, Mills GB, Andreeff M, Konopleva M. Simultaneous Inhibition of PDK1/AKT and Fms-Like Tyrosine Kinase 3 Signaling by a Small-Molecule KP372-1 Induces Mitochondrial Dysfunction and Apoptosis in Acute Myelogenous Leukemia”, Cancer Research, 66, 2006., 3737. 166 Network biology: understanding the cell's functional organization. Nature reviews. Genetics, 5 (Feb. 0, 2004), 101--13. 167 Wolkenhauer, Olaf and Shibata, Darryl. LINKING STEMNESS TO TISSUE FATES REVEALS (2010), 1--32. 168 Weinberg, Steven. LINKING STEMNESS TO TISSUE FATE REVEALS (2010), 1--18. 169 Sadava, D, Heller, HC, Hillis, DM, and Berenbaum, M. Life: The Science of Biology (2009). 170 Morris, Melody K, Saez-Rodriguez, Julio, Sorger, Peter K, and Lauffenburger, Douglas A. Logic-based models for the analysis of cell signaling networks. Biochemistry, 49 (2010), 3216--24. 171 Hubbard, Stevan R. JUXTAMEMBRANE AUTOINHIBITION IN RECEPTOR TYROSINE KINASES. Group, 5 (0 0, 2004), 464--470. 172 Hornberg, Jorrit J, Binder, Bernd, Bruggeman, Frank J, Schoeberl, Birgit, Heinrich, Reinhart, and Westerhoff, Hans V. Control of MAPK signalling : from complexity to what really matters. Oncogene (0 0, 2005), 5533--5542. 173 Holz, M K, Ballif, B A, Gygi, S P, and Blenis, J. mTOR and S6K1 mediate assembly of the translation preinitiation complex through dynamic protein interchange and ordered phosphorylation events. Cell, 123 (0 0, 2005), 569--580. 174 Henske, Elizabeth P. The Tuberous Sclerosis Complex. Clinical Trials (2008), 1345--1356. 175 Hay, Nissim and Sonenberg, Nahum. Upstream and downstream of mTOR Upstream and downstream of mTOR. Genes & Development (2004), 1926--1945. 176 Hannah, A L. Kinases as Drug Discovery Targets in Hematologic Malignancies. Current (2005), 625--642.

233

177 Guan, Lingjie, Song, Kyung, Pysz, Marybeth A et al. Protein Kinase C-mediated Down-regulation of Cyclin D1 Involves Activation of the Translational Repressor 4E-BP1 via a Phosphoinositide 3-Kinase / Akt-independent , Protein Phosphatase 2A-dependent Mechanism in Intestinal Epithelial Cells *. Journal of Biological Chemistry, 282 (2007), 14213--14225. 178 Gschwind, Andreas, Fischer, Oliver M, and Ullrich, Axel. FOCUS ON TARGETED THERAPIES kinases : targets for cancer therapy. Physiology, 4 (2004), 1--10. 179 Goutsias, John and Kim, Seungchan. A Nonlinear Discrete Dynamical Model for Transcriptional Regulation : Construction and Properties. Biophysical Journal, 86 (2004), 1922--1945. 180 Gonfloni, Stefania, Weijland, Albert, Kretzschmar, Jana, and Superti- furga, Giulio. letters Crosstalk between the catalytic and regulatory domains allows bidirectional regulation of Src. America (2000), 281-- 286. 181 Giles, Francis J and Albitar, Maher. Mammalian Target of Rapamycin as a Therapeutic Target in Leukemia. Current (2005), 653--661. 182 Ghoshal, Sampa, Baumann, Heinz, and Wetzler, Meir. Epigenetic regulation of signal transducer and activator of transcription 3 in acute myeloid leukemia. Leukemia (2008), 1--10. 183 Geer, Peter V, Hunter, Tony, and Lindberg, Richard A. RECEPTOR PROTEIN-TYROSINE KINASES AND THEIR SIGNAL TRANSDUCTION PATHWAYS. Molecular Biology (1994), 251--337. 184 Fujioka, A, Terai, K, Itoh, R E et al. Dynamics of the Ras/ERK MAPK cascade as monitored by fluorescent probes. J. Biol. Chem, 281 (2006), 8917--8926. 185 Fro, Morten. Role and regulation of 90 kDa ribosomal S6 kinase ( RSK ) in signal transduction. Molecular and Cellular Endocrinology, 151 (1999), 65--77. 186 Friedman, Alan D. NIH Public Access. Blood Cells, 39 (2007), 340-- 343. 187 Friday, Bret B and Adjei, Alex A. Advances in T argeting the Ras / Raf / MEK / Erk Mitogen-Activated Protein Kinase Cascade with MEK Inhibitors for Cancer Therapy. Clinical Cancer Research, 14 (2008), 342--346.

234

188 Follo, Matilde Y, Mongiorgi, Sara, Bosi, Costanza et al. The Akt / Mammalian Target of Rapamycin Signal Transduction Pathway Is Activated in High-Risk Myelodysplastic Syndromes and Influences Cell Survival and Proliferation. Cancer Research, 1 (2007), 4287-- 4294. 189 Foley, Catherine and Mackey, Michael C. Mathematical Biology Dynamic hematological disease : a review (2008), 458--466. 190 Etten, R A. Aberrant cytokine signaling in leukemia. Oncogene (2007), 6738--6749. 191 Ekberg J, Holm C, Jalili S, Richter J, Anagnostaki L, Landberg G, Persson JL. Expression of cyclin A1 and cell cycle proteins in hematopoietic cells and acute myeloid leukemia and links to patient outcome. European Journal of Haematology, 75, 2 (2005), 106--115. 192 Eichhorn, M E, Kleespies, A, and Angele, M K. Angiogenesis in cancer : molecular mechanisms , clinical impact. Langenbecks Archive Of Surgery (2007), 371--379. 193 Du, Keyong and Tsichlis, Philip N. Regulation of the Akt kinase by interacting proteins. Oncogene (2005), 7401--7409. 194 Dohner, Konstanze and Dohner, Hartmut. Acute Myeloid Leukemia Implication of the Molecular Characterization of Acute Myeloid Leukemia. Hematology, 93, 7, 976--982. 195 Doepfner, Kathrin T, Boller, Danielle, and Arcaro, Alexandre. Targeting receptor tyrosine kinase signaling in acute myeloid leukemia. Critical Reviews in Oncology/Hematology, 63 (2007), 215-- 230. 196 Dikic, I, Szymkiewicz, I, and Soubeyran, P. Cellular and Molecular Life Sciences Cbl signaling networks in the regulation of cell function. Cellular and Molecular Life Sciences, 60 (2003), 1805--1827. 197 Corradetti, M N and Guan, K-l. Upstream of the mammalian target of rapamycin : do all roads pass through mTOR ? Oncogene (2006), 6347--6360. 198 Corey, Seth J, Minden, Mark D, Barber, Dwayne L, Kantarjian, Hagop, Wang, Jean C, and Schimmer, Aaron D. Myelodysplastic syndromes : the complexity of stem-cell diseases. Cancer, 7 (2007), 118--129.

235

199 Choudhary, Chunaram, Brandts, Christian, Schwable, Joachim et al. Activation mechanisms of STAT5 by oncogenic Flt3-ITD Brief report Activation mechanisms of STAT5 by oncogenic Flt3-ITD. Blood (2008), 370--374. 200 Cheong, Raymond and Levchenko, Andre. Wires in the soup : quantitative models of cell signaling. Trends in Cell Biology, 18, 3 (2008), 112-8. 201 Chase, Andrew and Cross, Nicholas C. Signal transduction therapy in haematological malignancies : identification and targeting of tyrosine kinases. Clinical Science, 111 (2006), 233--249. 202 Chang, F, Steelman, L S, Lee, J T et al. MOLECULAR TARGETS FOR THERAPY ( MTT ) Signal transduction mediated by the Ras / Raf / MEK / ERK pathway from cytokine receptors to transcription factors : potential targeting for therapeutic intervention. Leukemia, 17 (2003), 1263--1293. 203 Cells, Ligand-stimulated H, Zhang, Shuli, and Broxmeyer, Hal E. p85 Subunit of PI3 Kinase Does Not Bind to Human Flt3 Receptor , but Associates with SHP2 , SHIP , and a Tyrosine-Phosphorylated 100- kDa Protein in Flt3 ligand-stimulated hematopoietic cells. Biochemical and Biophysical Research Communications, 254, 2 (1999), 440--445. 204 Carrera, Ana C. TOR Signaling in Mammals. Journal of Cell Science, 117 (2004), 4615--4616. 205 Carpenter, Graham. Nuclear localization and possible functions of receptor tyrosine kinases. Current Opinion in Cell Biology, 15, 2 (2003), 143--148. 206 Cai, Long, Friedman, Nir, and Xie, X S. Stochastic protein expression in individual cells at the single molecule level. Nature, 440 (2006), 233--240. 207 Cabrita GJ, Ferreira BS, da Silva CL, Gonçalves R, Almeida-Porada G, Cabral JM. Hematopoietic stem cells : from the bone to the bioreactor. Trends in biotechnology, 21, 5 (2003), 233--240. 208 Bordignon, Claudio. Stem-cell therapies for blood diseases. Nature, 441 (2006), 5889--5892. 209 Blechman, Janna M, Lev, Sima, Barg, Jacob, Eisenstein, Miriam, Vaks, Baruch, Vogel, Zvi, and Y., Yarden. The Fourth Immunoglobulin Domain of the Stem Cell Factor Receptor Couples Ligand Binding to Signal Transduction. Cell, 80, 1 (1995), 103--113.

236

210 Bhatia, Mickie. Hematopoiesis from Human Embryonic Stem Cells. Annals Of The New York Academy Of Sciences, 1106 (2007), 219-- 222. 211 Benedetti, Arrigo D and Graff, Jeremy R. eIF-4E expression and its role in malignancies and metastases. Oncogene, 23, 18 (2004), 3189--3199. 212 Banga, Julio R. Optimization in computational systems biology. Optimization, 7, 47 (2008), 1--7. 213 Baker, S J, Rane, S G, and Reddy, E P. Hematopoietic cytokine receptor signaling. Oncogene, 26 (2007), 6724--6737. 214 Appelbaum, Frederick R, Rowe, Jacob M, Radich, Jerald, and Dick, John E. Acute Myeloid Leukemia. Hematology, 1 (2001), 76--81. 215 Advani, Anjali S. FLT3 and Acute Myelogenous Leukemia : Biology , Clinical Significance and Therapeutic Applications. Clinical Trials, 11, 26 (2005), 3449--3457. 216 Abbott, Alison. Cancer: The root of the problem. Nature, 442 (2006), 742--744. 217 Irish JM, Kotecha N, Nolan GP. Mapping normal and cancer cell signalling networks : towards single-cell proteomics. Group, 6 (2006), 146--155. 218 Iwasaki, Hiromi, Mizuno, Shin-ichi, Arinobu, Yojiro et al. specification of hematopoietic lineages The order of expression of transcription factors directs hierarchical specification of hematopoietic lineages. Genes & Development (2006), 464--470. 219 Oltvai, Zoltan N and Barabasi, Albert-Laszlo. Life's Complexity Pyramid. October, 298, 5594 (2002), 763--764. 220 Wolkenhauer, O, Sreenath, S N, Wellstead, P, Ullah, M, and Cho, K. 221 Wolkenhauer, O, Ullah, M, Wellstead, P, and Cho, K H. The dynamic systems approach to control and regulation of intracellular networks. FEBS Lett., 579 (2005), 1846--1853. 222 Siendones E, Barbarroja N, Torres LA, Buendia P, Velasco F, Dorado G, Torres A, Lopez-Pedrera C. Inhibition of Flt3-activating mutations does not prevent constitutive activation of ERK / Akt / STAT pathways in some AML cells : a possible cause for the limited effectiveness of monotherapy with small-molecule inhibitors. Hematological Oncology, 25, 1 (2007), 30--37.

237

223 Shen R, Ye Y, Chen L,Yan Q, Barsky SH, Gao J-X. Precancerous Stem Cells Can Serve As Tumor Vasculogenic Progenitors (2008). 224 Kiyoi H, Towatari M, Yokota S, Hamaguchi M, Ohno R, Saito H, Naoe T. Internal tandem duplication of the FLT3 gene is a novel modality of elongation mutation which causes constitutive activation of the product. Leukemia, 12 (1998), 1333–1337. 225 Kiyoi H, Naoe T, Yokota S, M Nakao M, Minami S, Kuriyama K, Takeshita A, Saito K, Hasegawa S, Shimodaira S, Tamura J, Shimazaki C, Matsue K, Kobayashi H, Arima N, Suzuki R, Morishita H, Saito H, Ueda R, Ohno R. Internal tandem duplication of FLT3 associated with leukocytosis in acute promyelocytic leukemia (1997), 1447–1452. 226 Stirewalt DL, Kopecky KJ, Meshinchi S, Appelbaum FR, Slovak ML, Willman CL, Radich JP. FLT3, RAS and TP53 mutations in elderly patients with acute myeloid leukemia (2001), 3589–3595. 227 Meshinchi S, Woods WG, Stirewalt DL, Sweetser DA, Buckley JD, Tjoa TK, Bernstein ID, Radich JP. Prevalence and prognostic significance of FLT3 internal tandem duplication in pediatric acute myeloid leukemia (2001), 89–94. 228 Schnittger S, Schoch C, Dugas M, Kern W, Staib P, Wuchter C, Löffler H, Sauerland CM, Serve H, Büchner T, Haferlach T, Hiddemann W. Analysis of FLT3 length mutations in 1003 patients with acute myeloid leukemia: correlation to cytogenetics, FAB subtype, and prognosis in the AMLCG study and usefulness as a marker for the detection of minimal residual disease (2002), 59–66. 229 Thiede C, Steudel C, Mohr B, Schaich M, Schäkel U, Platzbecker U, Wermke M, Bornhäuser M, Ritter M, Neubauer A, Ehninger G, Illmer T. Analysis of FLT3-activating mutations in 979 patients with acute myelogenous leukemia: association with FAB subtypes and identification of subgroups with poor prognosis (2002), 4326–4335. 230 Xu F, Taki T, Yang HW, Hanada R, Hongo T, Ohnishi H, Kobayashi M, Bessho F, Yanagisawa M, Hayashi Y. Tandem duplication of the FLT3 gene is found in acute lymphoblastic leukaemia as well as acute myeloid leukaemia but not in myelodysplastic syndrome or juvenile chronic myelogenous leukaemia in children (1999), 155– 162.

238

231 Kiyoi H, Naoe T, Nakano Y, Yokota S, Minami S, Miyawaki S, Asou N, Kuriyama K, Jinnai I, Shimazaki C, Akiyama H, Saito K, Oh H, Motoji T, Omoto E, Saito H, Ohno R. Prognostic implication of FLT3 and N-RAS gene mutations in acute myeloid leukemia. (1999), 3074–3080. 232 Kottaridis PD, Gale RE, Frew ME, Harrison G, Langabeer SE, Belton AA, Walker H, Wheatley K, Bowen DT, Burnett AK, Goldstone AH, Linch DC. The presence of a FLT3 internal tandem duplication in patients with acute myeloid leukemia (AML) adds important prognostic information to cytogenetic risk group and response to the first cycle of chemotherapy: analysis of 854 patients from the United King (2001), 1752–1759. 233 Manning BD, Cantley LC. AKT/PKB signaling: navigating downstream (2007), 1261-1274. 234 Abu-Duhier FM, Goodeve AC, Wilson GA, Gari MA, Peake IR, Rees DC, Vandenberghe EA, Winship PR, Reilly JT. FLT3 internal tandem duplication mutations in adult acute myeloid leukaemia define a highrisk group. Br. J. Haematol., 111 (2000), 190–195. 235 Asnaghi L, Bruno P, Priulla M, Nicolin A. mTOR : a protein kinase switching between life and death. Journal of Antibiotics (English ed.), 50, 6 (2004), 545--549. 236 Baselga J. Targeting Tyrosine Kinases in Cancer . Science, 312, 5777 (2006), 6724--6737. 237 Akashi K. Cartography of Hematopoietic Stem Cell Commitment Dependent upon a Reporter for Transcription. Ann N Y Acad Sci., 1106 (2007), 76--81. 238 Biermann, S, Uhrmacher, A M, and Schumann, H. Supporting Multi- Level Models in Systems Biology by Visual Methods. In Proceedings of the 18th European Simulation Multiconference (Magdeburg, Germany 0 0, 2004), 1--32. 239 Blenis, John. Signal transduction via the MAP kinases : Proceed at your own RSK. Review Literature And Arts Of The Americas, 90 (1993), 5889--5892. 240 Brandts, Christian H, Berdel, Wolfgang E, and Serve, Hubert. Oncogenic Signaling in Acute Myeloid Leukemia. Curr Drug Targets, 8, 2 (2007), 237--246.

239

241 Butcher, Eugene C, Berg, Ellen L, and Kunkel, Eric J. Systems biology in drug discovery. Nature Biotechnology, 22 (2004), 1253-- 1259. 242 Cooper, S. Rethinking synchronization of mammalian cells for cell cycle analysis. Cellular and Molecular Life Sciences , 60 (2003), 1-- 8. 243 Ferrajoli A, Faderl S, Ravandi F, Estrov Z. The JAK-STAT Pathway : A Therapeutic Target in Hematological Malignancies. Cancer (2006), 671--679. 244 Ferreira R, Ohneda K, Yamamoto M, Philipsen S. GATA1 Function , a Paradigm for Transcription Factors in Hematopoiesis. Molecular and Cellular Biology, 25 (2005), 1215--1227. 245 Fingar DC, Blenis J. Target of rapamycin ( TOR ): an integrator of nutrient and growth factor signals and coordinator of cell growth and cell cycle progression. Oncogene (2004), 3151--3171. 246 Fitzgerald JB, Schoeberl B, Nielsen UB, Sorger PK. Systems biology and combination therapy in the quest for clinical efficacy. Nature Chemical Biology, 2 (2006), 458--466. 247 Houslay MD. PERSPECTIVE A RSK ( y ) Relationship with Promiscuous PKA. Sci STKE, 349 (2006), pe32. 248 Shukla, S. Nuclear Factor-κB/p65 (Rel A) is constitutively active in human prostate adenocarcinoma and correlates with disease progression. Neoplasia, 6 (2004). 249 IT, Jolliffe. Principal Component Analysis. Springer, 2002. 250 T, Dijkstra. Some comments on maximum likelihood and partial least squares methods. Journal of Econometrics, 22 (1983), 67-90. 251 Geladi P, Kowalski B. Partial least-squares regression: A tutorial. Analytica Chimica Acta, 185 (1986), 1-17. 252 The Cell as a Machine, NSF Workshop Report. (Arlington VA 2007). 253 Sreenath SN, Mesarovic MD, Soebiyanto RP, Wolkenhauer O, Loparo KA. Coordination Principles in Complex Systems Biology. In IEEE Transactions in Systems Biology (). 254 Aldridge, Bree B, Burke, John M, Lauffenburger, Douglas A, and Sorger, Peter K. Physicochemical modelling of cell signalling pathways. Nature cell biology, 8 (2006), 1195--1203.

240

255 Ciliberto A, Csikasz-Nagy A, Novak B, Westerhoff HV, Snoep JL, Conradie R, Bruggeman FJ. Restriction point control of the mammalian cell cycle via the cyclin e/cdk2:p27 complex. FEBS Journal, 277, 2 (2010). 256 Thron CD, Tecarro E, Obeyesekere M, Bai S, Goodrich D. Theoretical and experimental evidence for hysteresis in cell proliferation.. Cell Cycle. , 2, 1 (2003), 46-52. 257 Arooz T, Yam CH, Siu WY, Lau A, Li KK, Poon RY. On the concentrations of cyclins and cyclin-dependent kinases in extracts of cultured human cells. Biochemistry, 39, 31 (2000), 9494-501. 258 Satyanarayana A, Kaldis P. Mammalian cell-cycle regulation: several Cdks, numerous cyclins and diverse compensatory mechanisms. Oncogene, 28, 33 (2009), 2925-39. 259 Pagano M, Pepperkok R, Verde F, Ansorge W, and Draetta G. Cyclin A is required at two points in the human cell cycle. EMBO J., 11, 3 (1992), 961-971. 260 Soni DV, Sramkoski RM, Lam M, Stefan T, Jacobberger JW. Cyclin B1 is rate limiting but not essential for mitotic entry and progression in mammalian somatic cells. Cell Cycle, 7 (2008), 1285-1300. 261 DV, Soni. Studies on regulation of mitotic transition by Cyclin B1/CDK1. Cleveland, OH, 2005. 262 Chow S, Hedley D, Grom P, Magari R, Jacobberger JW, Shankey TV. Whole blood fixation and permeabilization protocol with red blood cell lysis for flow cytometry of intracellular phosphorylated epitopes in leukocyte subpopulations. Cytometry A, 6 (2005). 263 Cross FR, Archambault V, Miller M, Klovstad M. Testing a Mathematical Model of the Yeast Cell Cycle. Molecular Biology of the Cell, 13, 1 (2002), 52-70. 264 Ciliberto A, Tyson JJ. Mathematical Model for Early Development of the Sea Urchin Embryo. Bulletin of Mathematical Biology, 62 (2000), 37-59. 265 Calzone L, Thieffry D, Tyson JJ. Dynamical Modeling of Syncytial Mitotic Cycles in Drosophilia Embryos. Molecular Systems Biology, 3 (2007), 131. 266 Marquardt D. An Algorithm for Least-Squares Estimation of Nonlinear Parameters. SIAM J Appl Math, 11 (1963), 431-441.

241

267 Schmidt H, Jirstrand M. Systems Biology Toolbox for MATLAB: A Computational Platform for Research in Systems Biology. Bioinformatics, 22, 4 (2005), 514-515. Available at (www.sbtoolbox2.org). 268 Hoops S, Sahle S, Gauges R, Lee C, Pahle J, Simus N, Singhal M, Xu L, Mendez P, Kummer U. COPASI – a COmplex PAthway SImulator. Bioinformatics, 22 (2006), 3067–74. 269 Hucka M, Finney A, Sauro H, et al. The Systems BiologyMarkupLanguage (SBML): A medium for representation and exchange of biochemical network models. Bioinformatics, 4 (2003), 524--531. 270 Funahashi A, Tanimura N, Morohashi M, Kitano H. CellDesigner:a process diagram editor for gene-regulatory and biochemical networks. BIOSILICO, 1 (2003), 159–162. 271 Albeck J, Burke J, Spender S, Lauffenburger DA, Sorger PK. Modeling a Snap-Action, Variable-Delay Switch Controlling Extrinsic Cell Death. PLoS Biology , 6, 12 (2008), e299. 272 Frisa PS, Jacobberger JW. Cytometry of chromatin bound Mcm6 and PCNA identifies two states in G1 that are separated functionally by the G1 restriction point. Bmc Cell Biology, 11 (2010), 26. 273 Jacobberger JW, Sramkoski RM, Frisa PS, Ye PP, Gottlieb MA, Hedley DW, Shankey TV, Smith BL, Paniagua M, Goolsby CL. Immunoreactivity of Stat5 phosphorylated on tyrosine as a cell-based measure of Bcr/Abl kinase activity. Cytometry Part A, 54, 2 (2003), 75-88. 274 Novak B, Tyson JJ. Regulation of the eukaryotic cell cycle: Molecular antagonism, hysteresis and irreversible transitions. Journal of Theoretical Biology, 210, 2 (2001), 249-263. 275 MA, Savageau. Biochemical Systems Analysis: A Study of Function and Design in Molecular Biology. Addison-Wesley, 1976. 276 Soni DV, Jacobberger JW. Inhibition of cdk1 by alsterpaullone and thioflavopiridol correlates with increased transit time from mid G2 through prophase. Cell Cycle, 3 (2004), 349-357. 277 R, Steuer. Effects of Stochasticity in Models of the Cell Cycle: From Quantized Cell Cycle Times to Noise-Induced Oscillations. Journal of Theoretical Biology, 228 (2004), 293-301. 278 CD, Thron. Bistable Biochemical Switching and Control of the Events of the Cell Cycle. Oncogene, 15 (1997), 317-325. 242

279 CD, Thron. Mathematical Analysis of a Model of the Mitotic Clock. Science, 254 (1991), 122-123. 280 Kim KA, Spencer SL, Albeck JG, Burke JM, Sorger PK, Gaudet S, Kim H. Systematic calibration of a cell signaling network model. BMC Bioinformatics (2010), 356--364. 281 Kitano H. Computational systems biology. Nature, 420 (2002), 356-- 364.

243