MASARYK UNIVERSITY FACULTY OF INFORMATICS

COMPREHENSIVE MODELING PLATFORM FOR PHOTOSYNTHETIC ORGANISMS

THESIS

MATEJ KLEMENT

2012 Contents

1 Introduction4 1.1 Objectives...... 5

2 State of the art6 2.1 Data Exchange formats...... 8 2.1.1 SBML...... 8 2.1.2 CellML...... 9 2.1.3 BioPAX...... 9 2.1.4 PSI-MI...... 10 2.1.5 SBGN...... 10 2.1.6 Format of Matlab...... 11 2.1.7 Format Octave...... 11 2.2 Data Exchange and modeling tools...... 11 2.2.1 Biomodels.net...... 12 2.2.2 CellML.org...... 13 2.2.3 Copasi...... 13 2.2.4 Vcell...... 13 2.2.5 E-cell...... 14 2.2.6 ProMot...... 14 2.2.7 PaxTools...... 15 2.2.8 Matlab...... 15 2.2.9 Octave...... 16 2.2.10Scilab...... 16 2.2.11BioUML...... 16 2.3 Annotation ontology...... 17 2.3.1 Gene Ontology...... 17 2.3.2 KEGG...... 17 2.3.3 SBO...... 17 2.4 Photosynthesis modeling...... 17

3 Aims 18 3.1 Theoretical aims...... 18 3.2 Practical aims...... 19 3.3 Methodology...... 19 3.4 Progression schedule...... 20

2 3.5 Expected Outputs...... 20

4 Results 21 4.1 Design and specification...... 21 4.1.1 Ontology tree...... 21 4.1.2 Model structure...... 21 4.1.3 Connecting ontology and model...... 22 4.1.4 Annotation database...... 22 4.1.5 Ontology and model annotation...... 23 4.2 Implemented system...... 24 4.3 Conclusion...... 27

5 Publications 28

3 Chapter 1

Introduction

In last decades a great number of computer driven sciences has emerged which was caused by the fast development in microchip technology. One of those sciences is systems biology which is new field in biology aiming at system-level understanding of biological systems[14]. At the beginning was researching biological systems and did remarkable progress in this area but recently is focusing on identifi- cation of genes and functions of their products which are components of systems. Next major task is to understand components of biological systems revealed by molecular biology at the system level. Systems biology was established to achieve this long-term task. While systems biology covers all aspects of analyzing behavior of system models computational systems biology aims only at the narrower part of this research. Compu- tational systems biology targets at understanding of system level of biological systems by analyzing biological data using computational techniques[16]. The latest enormous advance of sequencing projects, microarrays, and moved this field forward giving more powerful tools and knowledge to discover re- lations and behavior among data. With systems biology in mind new sophisticated computational methods are being developed to analyze the data generated by that technology in systematic way deciphering complex and networked biological processes and phenomena taking place in cells, tissues and organisms. Latest development in information technology, cheap and accessible computer power, global networks and databases become widely accessible for mathematical modeling and simulation of com- plex biological systems. Simulation and modeling combines the use of different system analysis tools like discrete mathematics, stochastics, differential equations, complex system simulation with model-database integration architectures. Creating and test- ing of quantitative models unraveling hierarchical and non-linear character of cellular system will be feasible through cooperative work of theoretical and experimental bi- ologists working together with system analysts, computer scientists, mathematicians, engineers and physicists. These long-run efforts demand comprehensive tools to share knowledge and data among participating capacities. As a result of the latest trends moving from extremely reduced models and analyses which is caused by possibility of cooperation of large teams of scientists around the world, there are starting to be large amount of simulated data from thousands of com- ponents like mRNA or proteins. Connection of these simulated data creates compact

4 blocks of cellular machinery in action. Dynamic models describing these processes can be created from these blocks. These comprehensive models explicitly represents large amount of biochemical reactions at relatively high level of detail. But mentioned dynamic models present another challenge which originates from transcription of non- linear systems to models. This problem is estimation of numerical parameters which can be solved in inverse fashion where simulated data are compared to experiments by sophisticated software for searching of local and global minimum in multidimensional space. Last decade was fruitful for systems biology and formats, languages and tools han- dling these formats. Thanks to this development many tools were created and are used to present. All tools aimed on this field are mostly of general nature. This means there are not any tools dedicated for photosynthesis, its modeling and research. De- spite the fact photosynthesis research can bring solution of renewable fuels or artificial oxygen production main aims of current biology is research of DNA, mRNA, proteins, etc. Another pullback is that photosynthesis belongs to another field which is physics because of character of several reactions. This was reason of formation of CyanoTeam project which aims at solving of problems of photosynthesis. This project is done with cooperation with PSI company and Global Change Research Centre AS CR, v.v.i. The second chapter describes the current development of systems biology, more precisely ways of handling biological models, tools handling these models and annota- tions integrating these models in broader context. Third chapter deals with aims and objectives to be reached as well as steps which are necessary to undertake to reach this aims. Fourth chapter contains current results and state of work as well as described implementation. In fifth chapter are described achieved publications.

1.1 Objectives

The main objective of this work is to create a tool and methodology providing for par- ticipating sides in photosynthesis research place for exchange and maintenance of dynamic models and knowledge about this process. Nedbal et al. created concept called Comprehensive Modeling Space[18] which describes fundamental ideas of photo- synthesis models specification. This work should propose solutions for Comprehensive Modeling Space conception and introduce methodology providing set of rules for correct encoding of modular models, data composition, suggest naming convention for indi- vidual model components (called Comprehensive Modeling Platform) and also should contain practical output in form of implemented application covering matters of visual- ization, sharing, exploring, maintenance, annotation and dynamic analysis of models on generally available platform running in the web environment. Mentioned models should support top-down and bottom-up modeling strategies. Moreover, the solution should support communication with common available tools and formats. The main benefit of the whole concept should be its domain specific aim which should bring pos- sibility to describe and understand better the given area of interest than general tools and approaches.

5 Chapter 2

State of the art

In last decade systems biology went through a large improvement[14] caused by over- all progress in information technology. This fact facilitates better and faster sharing of information about particular examined processes. As a result of being systems biology new science, primarily researched areas are those of common interest which includes mostly DNA, gene profiling and protein-protein interactions. Computational systems biology concerns with subgroup of problems addressed by systems biology putting stress primarily on data analysis in systematic manner which originates from improvement of new technologies. It is necessary to have ability to exchange this newly discovered knowledge among specialists. Several languages for systems biology were created to share knowledge and models with this intention in mind. Systems biology models are mostly those of dynamic type what means they describe dynamics of modeled system in time and mainly are aimed for population development of examined process. The best-known and most common languages include the format SBML[7] along with the format CellML[21] while both these formats were created as subset of XML[8]. Advantage of these formats is keeping structure and lucidity of models necessary for cooperation of various teams and for passing of knowledge. There are more formats similar to those mentioned above like BioPax[5], PSI-MI[11] or SBGN[25] which area aimed for narrower part of scope. BioPax was developed for intention of exchange of in- formation about biological pathways while PSI-MI was created for proteomics exchange and SBGN is a bit different from previous two and is dedicated to graphical representa- tion of biological processes. More general formats like MatLab[10] or Octave[6] can be focused on mathematical essence of biological problems and representation of specific problem expressed by a group of differential equations. Another alternatives for modeling languages are Stochastic Pi Calculus[26], BlenX[4] eventually Kappa Calculus[3] or BioAmbiens[28]. However, area of interest of these lan- guages or formats belongs to stochastic modeling which is not appropriate for problem solving introduced by CMS. With diverse amount of representation languages tools using these languages and formats have been also developed. These tools can be split to several categories while first category consists of database tools mostly of web-based nature. This category contains for example Biomodels.net[20], Cellml.org[21] or Vcell[23] (only a web pre- sentation). These tools are mostly represented as web portals publishing biological

6 models. They also support other functions like simulation, export in several formats, displaying of reaction network, etc. Second category involves tools devoted for modifi- cation and development of models. Examples of these tools are Copasi[29], VCell[23], E-Cell[32], ProMot[22], PaxTools[5] and BioUML[15]. Basic functionality of theses tools includes development of models, visualization, simulation with graphical output, pa- rameter scan, experiment comparison and others. Third category consists of simulation tools which main task is to evaluate numerical simulation. These tools are Matlab[10], SciLab[31], octave[6] and other. These applications are mainly devoted to numerical simulation of differential equations. However there are plug-ins for Matlab tool called SimBiology or SBToolBox[30] which enhance support of systems biology, add support for few mentioned languages, supports stochastic simulation and because of this we can also include this tool in second category. Different set of issues must be solved when we want to assign models to larger units like cells or organisms where they belong with universality in mind. This kind of problems tries to solve the format MIRIAM[24] (Minimum Information Required In the annotation of models) which bestows upon identifying sufficient information for inte- gration of model eventually its part to ontology tree. There are several public databases which can be used for annotation to systems ontology. Among the best-known belong GO (Gene Ontology)[2], KEGG (Kyoto Encyclopedia of Genes and )[13] and SBO (Systems Biology Ontology)[19]. Models enriched with annotation information according to standard MIRIAM are easily integrated to a larger context. Photosynthesis as an elementary vital process generating oxygen is still not fully explored. Despite this fact the process of photosynthesis is not researched much in the area of systems biology. After lookup through several public databases of biological models are models of photosynthesis very rare among the results and most of time equal to zero. This fact can be caused by complexity of given process which extends over several time scales or by the fact it does not involve only systems biology but physics or chemistry what is due to behavior of some reactions (e.g. fixation of photon in photosystem II and its transformation to electron), which participate in this process. Inability to describe all models of photosynthesis forced Nedbal et al. to propose concept of CMS (Comprehensive modeling space) which should facilitate storing of these models and thus provide an environment for exchange of models among researchers exploring photosynthesis. CMS describes a summary of rules and suggestion on how to create system for an information exchange. Fundaments should be composed of speci- fication which would describe process of storing of dynamic models which components require possibility of storing of annotations describing relation between a real-world representation and a model. There must be also possible to link models to ontology structure of photosynthesis. Another important part of this system should be imple- mentation facilitating visualization of dynamic models with stress on maintenance and sharing. Except the possibility of displaying models hierarchy visualization includes possibility of integration of models to real-life system with ability of simulation and comparison with experiments.

7 2.1 Data Exchange formats

In this section will be thorough fully described the current state of chosen formats in the area of systems biology. These formats / standards were chosen from all formats because of their eligible properties in solving of informatics problems and level of de- velopment which is not high enough for rule-based languages. Rule-based languages were pretermitted despite the fact that they could solve part of the CMS issues. Another advantage of chosen standards is their awareness and usage among community which means easier acceptance in case of compatibility with them. Described formats:

• SBML

• CellML

• BioPax

• PSI-MI

• SBGN

• Matlab

• Octave

2.1.1 SBML

SBML[7] (Systems biology markup language) is a standard format for exchange of com- putation models of biochemical networks and is developed in cooperation with com- munity of modelers which is necessary for meeting their requirements as much as possible. Origin of SBML was caused by nonuniformity and impossibility of informa- tion exchange among researchers using different tools working with different formats for exchange of models. As a result of these conditions was the data exchange among different tools almost impossible. SBML is developed in levels while each level adds new functions to already defined language. This approach facilitates that development of tools depending on language is not affected by further updates in this language. In first level of SBML was the mathematic formulae expressed by textual string which was replaced in second level replaced by expression in MathML[33] format. This upgrade allows complicated equation than previous level. In the second version there was added the possibility of metadata definition for model which enhanced the lucidity of models. Third version will add possibility to use models as component in another model development. There should be also added the ability to describe states and interaction of model components by rules instead of enumeration of all possible combinations. Additional features should be 2d and 3d geometry, model layer and other.

8 Conclusion about SBML

Unfortunately level 3 is still in state of design and despite the massive improvement from level 2 it doesn’t provide wide-enough variety of functions necessary for definition of CMP models. One of these functions is for example possibility to define membrane structure. Another drawback is impossibility of rule-based modeling however this will be added in level 3.

2.1.2 CellML

Similarly as SBML[7], CellML[21] belongs to group of most used formats in systems biology. Its main aim is storage an exchange of mathematical computer-based models of function of biological cells. Despite its original aim it is possible to use this language for modeling in any other area. It is primarily designed to formulate models with real numbers and ODE[9] or DAE[17] equations. Model structure is based on modular architecture meaning it is possible to create new model by combining several old models or use them as part of new model. This behavior represents major advantage against SBML Level 2, which does not support this feature. Similarly as SBML, CellML does not support rule-based models and this feature is not even mentioned in discussion.

Conclusion about CellML

Despite the fact CellML[21] is more general than SBML given language does not cover all requirements given by concept of CMS. Moreover, it does not support volumeless membrane structure similarly as SBML. Also it does not support rule-based models which are necessary in part of photosynthesis models.

2.1.3 BioPAX

Acronym BioPAX[5] contains definition for Biological Pathway Exchange which defines aims of this language. BioPax is language aiming for integration, exchange, visual- ization and analysis of biological pathway data while it is mainly targeting exchange of data between groups of data pathways and by this it reduces complexity of data exchange between different formats of information. BioPax provides standard with well-defined semantics for pathway representation providing more effective communi- cation for tools using this format. Its ontology is based on Web Ontology Language (OWL)[1] with respect to automatic evaluation and integration of information contained inside of biological pathways. Its main utilization is in addressing of problems solving semantics of heterogeneity among data sources.

Conclusion about BioPax

BioPAX[5] language is a powerful tool for connecting and sharing data regarding bio- logical pathways. Unfortunately, the specific focus is not appropriate for the exchange of biological models of photosynthesis. It is possible to describe the basic processes with this language but as the previous ones also this language has limitations which are membrane structures and rule-based models.

9 2.1.4 PSI-MI

Another language based on XML[8] is PSI-MI[11] (Proteomics Standards Initiative Molec- ular Interaction). However, its determination is much more specific meaning it is fo- cused on information exchange regarding interactions of molecules. The advantage of this language over the previous one is the possibility of direct binding to OBO[12] format for identification of the ontologies. Though PSI-MI supports XML representa- tion of format it also facilitates clear text format and the possibility of the storing of experimental data. Language PSI-MI was designed and developed by consortium of molecular interaction data including members from academic and industrial areas. It is covered by universities of Bielefeld, Bordeaux, Cambridge and others. One of its major drawbacks is that it does not represent reaction pathways or reaction networks and as a result it is incapable of ODE[9] model saving.

Conclusion about PSI-MI

Language PSI-MI[11] is well supported by the scientific community in the field of pro- teomics but unfortunately it is therefore also focused on this area and so it is not possible to store structure of some biological models focused on the reaction pathway. It contains the option for storing information about the experiments conducted with the transferred data, which is a big advantage over all the previously mentioned languages.

2.1.5 SBGN

Despite the fact that other languages are representing structure of models in biological fashion is SBGN[25] (Systems Biology Graphical Notation) aimed at graphic recording of cellular and biological processes. SBGN defines a comprehensive set of symbols with precise meaning, along with their detailed descriptions of their incorporation to the larger entity. This language consists of three blocks: Process Description, Entity Relationship and Activity Flows. Process Description defines the temporary pathways of biochemical interactions in a network and can be used to display all molecular entities with the possibility of recurrence of entities within the diagram. Entity Relationship indicates the relationship under which entity is involved, regardless of its temporary aspects. Relationships can be understood as rules describing the impact of nodes on other relations. Part Activity Flow depicts flow of information between the biochemical entities in the network not including information about the state of entities transitions and it is especially suitable for representation of the perturbations effects of genetic and environmental nature. This language does not describe its direct representation in XML or another format, but there is implementation of SBGN language called SBGN- ML, which is based on XML and fills the hole in language specification.

Conclusion about SBGN

SBGN[25] language is different from all previous languages because it focuses mainly on graphical representation. It is not possible to store biological models described by CMP in this language but SBGN contains some interesting properties in sharing and storage of models which should be included in implementation of CMP.

10 2.1.6 Format of Matlab

The format Matlab[10] can be used in sharing of biological models built on the basis of differential equations. These models do not represent the relationships among entities and their reaction pathways directly but using differential equations. This is done with intention to write a numerical simulation. Thus models created in this format are hardly editable because of the lucidity and complexity of notation.

Conclusion about Format of Matlab

This format is not suitable for the needs of CMP but can be used in part representing models intended for simulation.

2.1.7 Format Octave

Similarly to format of Matlab is format of application Octave[6] designed to directly for sharing of biological models in terms of CMP. Models built on the basis of differential equations do not represent relationships or reaction pathways but are intended for numerical simulation. Similarly as in the format of Matlab are models in Octave format hardly maintainable because of the lucidity and complexity of the notation.

Conclusion about format of Octave

Octave[6] format is not suitable for the needs of CMP but could be used as part repre- senting models intended for simulation.

Conclusion to languages and formats

All mentioned formats fulfill the needs given by the CMP only partially and therefore can not be used directly. But because of their publicity in the scientific community it would be appropriate to maintain compatibility with at least some of them. Among the best-known and most suited belong SBML[7], SBGN[25] and Octave[6] (or Matlab) because of general usability in research and sharing of models.

2.2 Data Exchange and modeling tools

Due to the large number of formats dedicated to the exchange of biological and cellular models eventually molecular interactions was created a large number of tools that provides simulation, management, development and sharing of models and comparison of experimental data with a model. Each of these tools is intended for other part of the scope of features mentioned as requirements for CMP and thus they can be divided into 3 groups: a database of models, modeling tools and tools for simulation. Because of the functions scope of every tool, all tools will be described providing certain characteristics that meet the requirements placed by CMP. Tools described:

• Biomodels.net

11 • Cellml.org

• Copasi

• Vcell (web platform and desktop tool)

• E-Cell

• ProMot

• Scilab

• PaxTools

• Matlab

• Copasi

• Biouml

2.2.1 Biomodels.net

Biomodels.net[20] is probably the largest and best known database of peer-reviewed published computational models, and these mathematical models are mainly in the field of systems biology. This database has the character of a web portal that pro- vides biologists with storing, searching, and obtaining mathematical models. It also facilitates generation of sub-models, online simulation and offers conversion between different formats. Biomodels.net contains over 350 curated models and nearly 400 other models. It is possible for every model to display it online as an addition to already mentioned features where you can see detailed information as an author of model, pub- lications belonging to the model, inclusion in various databases and view model entities and reactions among them. You can also view a list of static parameters for the model and possibly evaluation given by curator. Models can be simulated in several ways, allowing selection of the number of steps, simulated time or even simulated parameters values. Afterwards you can see the simulated values in the chart. It is also possible to display a reaction network of model for better understandability. For all reaction and entities are also available annotations with the link to database of annotations.

Conclusion to Biomodels.net

Although the biomodels.net[20] portal contains a wide variety of functions which are provided to its users, it lacks some functions required for the implementation of the CMP. Some missing features include: models cannot be bound directly to the onto- logical tree and this feature is only possible through annotations. Also, there is no possibility of better visualization of the model because general reaction network is not sufficient for the non-expert in the field.

12 2.2.2 CellML.org

Like Biomodels.net[20], CellML.org[21] is also a database containing mathematical models of biological systems. This database provides functionality to display the speci- fied models and search models by keyword or category while containing over 500 models available for regular users. Other features are available after registration and logging in which are creation, management and publishing of ”workspace” which allows further work with the model using mercurial client.

Conclusion about CellML.org

Despite the similar focus of Biomodels.net[20] and CellML.org is CellML.org only a simple database without any extensional features that simplify the work of researcher eventually explained models to the general public.

2.2.3 Copasi

Copasi[29] tool belongs to the second mentioned category. It is a tool which after installation on your local machine provides the ability to create and manage biological models. Created models can contain compartments, species and global parameters and for those it is possible to define a reaction according to predefined functions eventually it is also possible to define custom functions. It is also possible to display differential equations and show stoichiometric matrix of the model. Afterwards, it is possible to examine the behavior of a model using various methods of analysis. These include analysis of steady-state, stoichiometric analysis, simulation in time and more. There is also possibility to perform several operations for a given model like a parameter scan, model optimization, parameter estimation and sensitivity analysis. Processed model can be displayed in a report or interactive graph. For sharing of models can be used export to SBML[7] format eventually generation ODE[9].

Conclusion about Copasi

Copasi[29] is a user-friendly tool with many features designed to develop and manage biological model. Despite this fact it lacks features that are required by CMP. These include the possibility of visualization of the model, reaction pathways and other fea- tures. Also, it lacks the possibility of annotation for individual components and support for membranous structures.

2.2.4 Vcell

Vcell[23] tool is a unique computational environment for modeling and simulation of cell biology. Vcell was designed for a wide range of scientists from the experimentally to theoretically oriented and consists of two parts: the Web, which belongs to the first category of tools and the second one belonging to desktop oriented tools category for model development. VCell Web portal contains a database of models published on this site. Models can be published directly from the desktop application but unfortunately can also be accessed only from this application. Application in addition to cooperation

13 with proprietary database of models allows direct downloading of models from Biomod- els.net database[20]. There are supported two types of models: a biological model and mathematical model. Despite the fact application is not very user friendly it is possible to define models with the support for membranes containing compartments, species and their reactions. Vcell also contains visualization tool for representing the structure of the reaction network. Created models are possible to simulate while the simulation runs on a remote server. Similarly, it is possible to do a parameters scan. In addition to standard time series simulation it is possible do perform stochastic simulation which gives more options of model analysis.

Conclusion about Vcell

VCell[23] tool meets almost all the needs of CMP but as all previously mentioned tools, each of the tools has some limitations that makes it unusable. Despite the support of membrane structures compartments must be defined in order to use a given membrane. Also it contains very simple visualization, which is unfortunately not available for the public use.

2.2.5 E-cell

E-cell[32] is another tool for modeling and simulation of biochemical and genetic pro- cesses. It also facilitates the definition of the protein functions, protein-protein inter- actions and other functions. All these rules generate differential equations which can observe the dynamic development of the cell. Software is primarily designed to simulate the whole cell.

Conclusion about E-cell

E-cell[32] is a tool similar to Copasi[29]. It has very similar features and functions. Similarly to Copasi it does not support volumeless membranes or possibility of detailed visualization.

2.2.6 ProMot

ProMot[22] is acronym of Process Modeling Tool. This tool is designed for modeling of complex technical processes and biological systems. This tool is packed with ex- tension called Diana which allows the numerical simulation models. ProMot together with this extension provides the development, simulation and analysis models built on differential algebraic equations. In addition to the internal format for storing data and models, SBML[7] format can be used for sharing within the community. It also supports a good model visualization which facilitates definition of the structure of the modeled component.

14 Conclusion about ProMot

Promot[22] similarly to other tools is primarily intended for development and simulation of biological models. Its extended support for a graphical representation of models is a big advantage but is still not sufficient for requirements given by the CMP.

2.2.7 PaxTools

In case of PaxTools, it is not a tool as such but a library providing support for other soft- ware developers to include BioPax[5] language in their programs. PaxTools is written in Java, which gives wide interoperability.

Conclusion about PaxTools

PaxTools is a library designed for further development of software and therefore is not directly suitable for modeling. But in case of BioPax format, it can be used to support this format in exports.

2.2.8 Matlab

Matlab[10] is a program for processing high-level language designed for computation- ally intensive tasks faster than with C, C++ or Fortran[27] and therefore falls within the third category of tools providing biological simulation. This tool provides in terms of biological models their numerical simulations. Models are encoded in the form of differential equations, which is not a format suitable for easy management of models. It is possible to display simulated models in a chart, which is also very simple. Be- cause Matlab is not primarily designed for a group of biologists but its users are from wide variety of people there are many extensions enhancing functionality. For this reason, there is also the extension developed by creators of Matlab called SimBiology. This extension adds the possibility of modeling, simulation and analysis of dynamical systems but also includes a stochastic solver and import of SBML[7] models. Other notable features are the determining parameters and sensitivity analysis. The second competitive extension is called SBToolBox[30] which is unlike SimBiology open source and easily extensible. Functions added by SBToolBox are roughly the same as those in SimBiology. Despite these extensions is Matlab more suitable for numerical simulation than for the development and maintenance models.

Conclusion about Matlab

Matlab[10] tool is one of the best tools in field of numerical simulations, but unfor- tunately other requirements such as storing and working models at a higher level is not possible. And even after the addition of paid extension SimBiology is the given functionality at an average level. Despite the fact this tool is not directly suitable for the development of models, it can be used for the numerical simulation.

15 2.2.9 Octave

Similarly as Matlab[10] is also Octave[6] intended for numerical simulation and belongs into the third group of tools. Its disadvantage is that it is much slower in solving math- ematical problems than its competitor. But this deficiency can be easily overlooked because of its price which is zero. In terms of numerical simulation this tool provides same functionality as Matlab. Also models can be only encoded in the form of differen- tial equations which is also not suitable for management and development of biological models.

Conclusion about Octave

Similarly as Matlab[10], Octave[6] does not provide the functionality required by CMP, but it can be used as a tool for evaluation ODE.

2.2.10 Scilab

Scilab[31] is a direct competitor to Matlab[10] and this tool is also intended for nu- merical simulation. Thus it can be used to simulate biological models. Its functions are very similar to those of Matlab and its only disadvantage is that it is slower in the calculations. This problem can be easily overlooked because it is an open source and free tool.

Conclusion about Scilab

Similarly to the previous two mathematical tools is Scilab[31] unsuitable for direct model editing but can be considered as an appropriate alternative in the search for tool for numerical simulation.

2.2.11 BioUML

BioUML[15] is a platform to create virtual cells and virtual human physiology. It in- cludes a wide range of functions such as access to experiments database, tools for the formal description of a structure and functions of biological organisms but tool also includes methods to visualize, simulate, parameter fitting and parameter analysis. Bi- oUML consists of two parts: a server, which allows sharing of data and methods for the analysis between clients and client that allows you to work with a particular model. There is also a second version of the client for web platform, which provides most of functionality given by standard client tool. In simple comparison, we can say that Bi- oUML tool is a combination of database BioModels.net[20] and Copasi[29] application.

Conclusion about BioUML

BioUML[15] platform is probably the closest in meeting the requirements of the CMP system. But also like all previous tools it does not provide lucid visualization and displaying for client accessing through web interface.

16 2.3 Annotation ontology

Despite the wide range of languages ??and tools to allow modeling of biological pro- cesses it is not possible to integrate these models into larger units, reuse or eventually compare. This missing part of scope is trying to solve MIRIAM[24] standard. MIRIAM defines a basic set of rules for marking of quantitative biological models which deter- mines procedures for encoding and annotation of models in machine-readable form. Also addition of this information should allow greater confidence that the models accu- rately reflect how they are described, and also allow re-use of existing models. Several publicly available databases are involved in collection of information needed for the ontology annotation.

2.3.1 Gene Ontology

Abbreviated as GO[2] by which are annotation linking to this database marked is one of the major bioinformatics databases standardizing representation of genes and their products across species and databases. Gene Ontology provides a controlled vocabu- lary of terms for description and annotation of genes.

2.3.2 KEGG

KEGG[13] is a collection of online databases containing genes and enzymatic routes also allowing the inclusion of biological models in a broader context.

2.3.3 SBO

Systems Biology Ontology[19] is database of controlled dictionary of terms used in sys- tems biology and especially in computer modeling. The database consists of 7 different branches: the roles, quantitative parameters, mathematical expressions describing the systems, modeling frameworks, branch describing entity, types of interactions and branch providing definition for metadata. SBO term can be used to create standard semantics in biological models.

2.4 Photosynthesis modeling

Although there are countless languages, allowing recording of biological models and even larger number of tools to modify or share them, there is only a very small amount of photosynthesis models or models of its parts. Among the few existing and well prepared for simulation belong the following three: model of Holzwarth et al. of photo- system II, and model of Lazar et al. of OJIP and model of Laisk et al. of photosynthetic apparatus. Each of these models represents a different problem from the mathemat- ical point of view. The first deals with the probability of transition from one state to another, the second simulates the biological system at a very rapid level and third is describing system with relatively slow processes in whole photosynthetic apparatus. Maybe also this problem leads to the fact that photosynthesis is not so widespread in the community of systems biology.

17 Chapter 3

Aims

Proposed thesis objectives consist of theoretical and practical part. Theoretical part fo- cuses on designing of comprehensive modeling platform including complete data model, algorithms for analyzing and consistency checking of models and definition of naming convention for unified annotation ontology. The practical part is devoted to implemen- tation of methodology proposed in theoretical part.

3.1 Theoretical aims

• Analysis of photosynthesis domain resulting with requirements for data model, naming convention for annotation ontology and for appropriate modeling and analysis techniques

• Specification and design of comprehensive modeling platform reflecting require- ments

– Data model describing photosynthesis models – Data model describing annotation ontology

• Specification and design of functionality considered in requirements

– Visualization techniques including GUI for hierarchical navigation through models or ontology and displaying of simulation, experimental data and their comparison – Analysis techniques providing fast and interactive simulation methods, meth- ods for comparing of models simulation with respect to experimental data, parameter scanning and parameter estimation which requires adapting ex- isting algorithms for online deployment and developing of new ones for com- parison functionality – Sharing capability comprising public availability of models including tech- nology for exporting models to most used formats solving non-trivial problem of encoding semantically richer models to less expressive languages while minimizing loss of information

18 – Maintenance of contained models targeting security and version control for collaborative model manipulation

Figure 3.1: Expected structure of specification

3.2 Practical aims

• Use of publicly available open-source platform with appropriate language (e.g. Apache + PHP)

• Use of publicly available database engine and computational engine (e.g. MySQL and Matlab)

• Possibility for cluster layout of hardware for computational engine

• Implementation will be available for general public

• Hierarchical layout of models with possibility of visualization of whole models and their parts

• Maintenance of models with possibility of annotation

• Simulation of models with graphical output

• Exports of models in all well-known languages (e.g. SBML)

3.3 Methodology

Theoretical part of this work will be consulted with experts in covered fields by topic of this work (e.g. researchers from Photon Systems Instruments or CyanoTeam). All ideas and remarks will be presented at meetings held every week and given feedback will be included in next version of work. Because of the large amount of standards these will be used instead of developing new ones and this will include discussion with communities developing existing standards.

19 Practical part will be released for public after reaching appointed milestones while partial task will be split to bachelor and masters thesis as their content. This work will include leading of small team of students.

3.4 Progression schedule

• Continuous work on the thesis topic including publications in relevant interna- tional conferences, workshops and journals

• Presentation of results at Systems biology seminar (Sybila laboratory)

• Participation at regular meetings of CyanoTeam

• Ph.D. final examination – May 2012

• Final version of thesis – May 2014

3.5 Expected Outputs

• Publications in relevant international journals or conferences (at least 2, e.g. ECCB, CMSB, Transactions on and bioinformatics, BioSys- tems, Database – Oxford journals)

• Implemented software published after each milestone

• Enlistment of implemented software on official list of tools (e.g. SBML list of tools)

• Tutoring at least 2 bachelor or master thesis regarding implementation of selected parts of system

20 Chapter 4

Results

As a result of objectives there was designed draft of CMP and also part of system was implemented containing basic function. Theoretical part of work consists of data structure of CMP necessary for storing of models and their annotations or ontology. Practical part contains description of already implemented part of system.

4.1 Design and specification

Despite large amount of databases there was no database containing ontology struc- ture of photosynthesis at necessary level of detail. To solve this problem it was decided to create own ontology database similar to other public databases aiming for photosyn- thesis apparatus. Storing of models and their components in range given by CMP was resolved by multi-layer table structure. For annotation purposes were created table structure with possibility to connect annotation to any meaningful object.

4.1.1 Ontology tree

Ontology module provides functions necessary for unified hierarchical structure of models introduced by CMP. Database structure for ontology tree to match current version of CMP has following structure. Ontology tree is represented by table ep3 specie and provides functionality to create several ontology trees if necessary. For all species it is possible to define graphical representation displayed in client interface which also contains links to other species (parent or child). But this definition is hold aside of database.

4.1.2 Model structure

Module for model definition can store models of hierarchical structure. Direct model structure is stored in two tables, while first one contains definition of model itself, the other one stores information about model compartments and components. Also model structure can form several levels deep tree. Table for species contains definition of global parameters of model which defines influence to model from outside.

21 Figure 4.1: Data model of ontology tree

To create reactions for model there must be available functions which represent tem- plates of transformations amongst model species. These templates are stored in table ep3 function which holds rules for transformation and table ep3 function item which holds number and types of species involved in transformation. Table ep3 reaction holds reactions for model which are used to generate ODE[9]. To specify what element of reaction represents which model specie, table ep3 reaction item facilitates creation of this relation. This structure allows free-form definitions of photosynthetic models. It is possible for all model species to define graphical representation displayed in client interface which also contains links to other model species or species. But this definition is hold aside of database in XML format[8].

4.1.3 Connecting ontology and model

While model tree and ontology tree are different structures its necessary to define relation between them. For this purpose was defined table ep3 model specie to specie which facilitates multi-binding relation.

4.1.4 Annotation database

Module dedicated for annotation definition was implemented by Jana Pospı´sˇilova´but as an administrator of system I am continuing to maintain the module and develop new functions. While table ont terms contains definition for terms available for annotations, table ont relations defines their interrelation. Table ont xrefs defines relation between pho- tosynthesis ontology tree and ontology databases available for public. Because there are usually more words used to label the same things, table ont synonyms contains definitions of those synonyms. Large effort is given to maintenance and development of these terms and to cover latest changes.

22 Figure 4.2: Data model of model structure

4.1.5 Ontology and model annotation

This module assigns meaning to all components available in system used to build model or ontology tree while they can be annotated from outer source or proprietary anno- tation database. Table ep3 annotation contains information describing component properties while ep3 annotation assign connects components which are described by annotation.

23 Figure 4.3: Data model of relation table

Figure 4.4: Data model of annotation database

4.2 Implemented system

Implemented system contains basic features necessary for model browsing, simulation and export but we decided at several other features to simplify and clarify working with CMS. First is well-arranged interface with tree-like structure of models and their underlying components. While every level from navigation tree contains a graphical representation embodying displayed object to wider content. Second is for better un-

24 Figure 4.5: Data structure of annotation module

derstanding of relations among models which highlights displayed models with the same active component. As visible in the picture, user can browse models while there is always visible struc- ture of model and included components. For every displayed detail of model, specie, parameter, component etc. there is the detail of selected object with assigned annota- tion and connections to other objects available. User can also access parameters and simulation tab from every level of the detail. This was decided due to easier accessibil- ity of these functions. For every model’s simulation there are several simulation and display profiles providing fast swapping among configurations. From simulation tab there is also available a model export button which provides function of exporting to SBML format.

25 Figure 4.6: Preview of user interface

Figure 4.7: Preview of simulation screen

Model administration is available from administration interface which also contains definition of ontology tree, annotations tree or annotation assignment. In this interface it is possible to manage the behavior of whole information system.

26 Figure 4.8: Preview of administration interface

4.3 Conclusion

Current version of CMP, its terminology and implementation is the result of many meetings and is a big step towards universality of the whole system. Despite the fact that it was a very difficult task to agree on one final version of CMP, there are already a lot of functions implemented in the current version. But many are still missing which are being worked on.

27 Chapter 5

Publications

David Sˇafra´nek, Jan Cˇ ervenı, Matej Klement, Jana Pospı´sˇilova´, LubosˇBrim, Dusˇan Laza´r, Ladislav Nedbal. E-photosynthesis: Web-based platform for modeling of complex photosynthetic processes, Biosystems, Volume 103, Issue 2, February 2011, Pages 115-124, ISSN 0303-2647, 10.1016/j.biosystems.2010.10.013. (http://www. sciencedirect.com/science/article/pii/S030326471000198X), Keywords: Biomodels re- pository; Computational models; Photosynthesis; Systems biology; Web platform

• My portion on this work was 40% ( includes formal definition of the model specification language and its realization by complete implementation and also part of text about technological issues)

• Content of the article is attached as an appendix of the thesis proposal

28 Bibliography

[1] G. A. And, G. Antoniou, and F. Van Harmelen. Web Ontology Language: OWL. In Handbook on Ontologies in Information Systems, pages 67–92, 2003.9

[2] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1):25– 29, May 2000.7, 17

[3] V. Danos and C. Laneve. Formal molecular biology. Theoretical Computer Science, 325(1):69 – 110, 2004.6

[4] L. Dematte´, C. Priami, and A. Romanel. Modelling and simulation of bi- ological processes in blenx. SIGMETRICS Perform. Eval. Rev., 35:32–39, March 2008.6

[5] E. Demir, M. P. Cary, S. Paley, K. Fukuda, C. Lemer, I. Vastrik, G. Wu, P. D’Eustachio, C. Schaefer, J. Luciano, F. Schacherer, I. Martinez-Flores, Z. Hu, V. Jimenez-Jacinto, G. Joshi-Tope, K. Kandasamy, A. C. Lopez- Fuentes, H. Mi, E. Pichler, I. Rodchenkov, A. Splendiani, S. Tkachev, J. Zucker, G. Gopinath, H. Rajasimha, R. Ramakrishnan, I. Shah, M. Syed, N. Anwar, O. Babur, M. Blinov, E. Brauner, D. Corwin, S. Donaldson, F. Gibbons, R. Goldberg, P. Hornbeck, A. Luna, P. Murray-Rust, E. Neu- mann, O. Reubenacker, M. Samwald, M. van Iersel, S. Wimalaratne, K. Allen, B. Braun, M. Whirl-Carrillo, K.-H. Cheung, K. Dahlquist, A. Finney, M. Gillespie, E. Glass, L. Gong, R. Haw, M. Honig, O. Hubaut, D. Kane, S. Krupa, M. Kutmon, J. Leonard, D. Marks, D. Merberg, V. Petri, A. Pico, D. Ravenscroft, L. Ren, N. Shah, M. Sunshine, R. Tang, R. Whaley, S. Letovksy, K. H. Buetow, A. Rzhetsky, V. Schachter, B. S. Sobral, U. Do- grusoz, S. McWeeney, M. Aladjem, E. Birney, J. Collado-Vides, S. Goto, M. Hucka, N. L. Novere, N. Maltsev, A. Pandey, P. Thomas, E. Wingen- der, P. D. Karp, C. Sander, and G. D. Bader. The BioPAX community standard for pathway data sharing. Nature Biotechnology, 28(9):935–942, Sept. 2010.6,7,9, 15

29 [6] J. W. Eaton, D. Bateman, and S. Hauberg. GNU Octave Manual Version 3. Network Theory Ltd, 1997.6,7, 11, 16

[7] A. Finney and M. Hucka. Systems Biology Markup Language: Level 2 and Beyond, volume Volume 31, part 6 of Biochemical Society Transactions. 2003.6,8,9, 11, 13, 14, 15

[8] C. F. Goldfarb and P. Prescod. XML handbook. Prentice Hall, 2000.6, 10, 22

[9] H. Gutfreund. Kinetics for the life sciences :receptors, transmitters and catalysts. Cambridge University Press, 1995.9, 10, 13, 22

[10] D. Hanselman and B. C. Littlefield. Mastering MATLAB 5: A comprehensive tutorial and reference. Prentice Hall, 1997.6,7, 11, 15, 16

[11] H. Hermjakob et al. The HUPO PSI’s molecular interaction format–a com- munity standard for the representation of protein interaction data. 2004.6, 10

[12] I. Horrocks. OBO Flat File Format Syntax and Semantics and Mapping to OWL Web Ontology Language. Technical report, University of Manchester, Mar. 2007. 10

[13] M. Kanehisa and S. Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Oxford University Press, 2000.7, 17

[14] H. Kitano, editor. Foundations of Systems Biology. The MIT Press, 2001. 4,6

[15] F. A. Kolpakov. BioUML - framework for visual modelling and simulation biological systems. In Proceedings of the International Conference on Bioin- fermatics of Genome Regulation and Structure, 2002.7, 16

[16] A. Kriete and R. Eils, editors. Computational Systems Biology. Elsevier, 2006.4

[17] P. Kunkel and V. Mehrmann. Differential-Algebraic Equations: Analysis and Numerical Solution (EMS Textbooks in Mathematics). European Mathe- matical Society, Feb. 2006.9

[18] A. Laisk, L. Nedbal, and Govindjee. Photosynthesis in silico. Understand- ing complexity from molecules to ecosystems. Springer Science+Business Media, 2009.5

[19] N. Le Novere, M. Courtot, and C. Laibe. Adding semantics in kinetics models of biochemical pathways. 2007.7, 17

30 [20] C. Li, M. Donizelli, N. Rodriguez, H. Dharuri, L. Endler, V. Chelliah, L. Li, E. He, A. Henry, M. I. Stefan, J. L. Snoep, M. Hucka, N. Le Novere,` and C. Laibe. BioModels Database: An enhanced, curated and annotated re- source for published quantitative kinetic models. BMC Systems Biology, 4:92, Jun 2010.6, 12, 13, 14, 16 [21] C. M. Lloyd, M. D. Halstead, and P. F. Nielsen. CellML: its future, present and past. Progress in biophysics and molecular biology, 85(2-3):433–450, July 2004.6,9, 13 [22] S. Mirschel, K. Steinmetz, M. Rempel, M. Ginkel, and E. D. Gilles. Pro- mot: modular modeling for systems biology. Bioinformatics, 25(5):687–689, 2009.7, 14, 15 [23] I. I. Moraru, J. C. Schaff, B. M. Slepchenko, M. L. Blinov, F. Morgan, A. Lakshminarayana, F. Gao, Y. Li, and L. M. Loew. Virtual Cell modelling and simulation software environment. IET Systems Biology, 2(5):352–362, Sept. 2008.6,7, 13, 14 [24] N. L. Novere, A. Finney, and M. Hucka. Minimum information requested in the annotation of biochemical models (MIRIAM). 2005.7, 17 [25] N. L. Novere, M. Hucka, et al. The systems biology graphical notation. 2009.6, 10, 11 [26] A. Phillips and L. Cardelli. Efficient, correct simulation of biological pro- cesses in the stochastic pi-calculus. In CMSB, pages 184–199, 2007.6 [27] W. H. Press, S. A.Teukolsky, and W. T. Vetterling. Numerical recipes: the art of scientific computing. Cambridge University Press, 2007. 15 [28] A. Regev, E. M. Panina, W. Silverman, L. Cardelli, and E. Shapiro. Bioam- bients: an abstraction for biological compartments. Theoretical Computer Science, 325(1):141 – 167, 2004.6 [29] R. G. S. Hoops, S. Sahle. COPASI — a COmplex PAthway SImulator. Oxford University Press, 2006.7, 13, 14, 16 [30] H. Schmidt and M. Jirstrand. Systems Biology Toolbox for MATLAB: a computational platform for research in systems biology. 2005.7, 15 [31] Scilab Consortium. Scilab: The free software for numerical computation. Scilab Consortium, Digiteo, Paris, France, 2011.7, 16 [32] M. Tomita, K. Hashimoto, K. Takahashi, T. S. Shimizu, Y. Matsuzaki, F. Miyoshi, K. Saito, S. Tanida, K. Yugi, J. C. Venter, and C. A. Hutchi- son. E-cell: software environment for whole-cell simulation. Bioinformat- ics, 15(1):72–84, 1999.7, 14 [33] C. W3. Mathematical Markup Language (MathML) Version 2.0. W3C Rec- ommendation, Feb. 2001.8

31 List of Figures

3.1 Expected structure of specification...... 19

4.1 Data model of ontology tree...... 22 4.2 Data model of model structure...... 23 4.3 Data model of relation table...... 24 4.4 Data model of annotation database...... 24 4.5 Data structure of annotation module...... 25 4.6 Preview of user interface...... 26 4.7 Preview of simulation screen...... 26 4.8 Preview of administration interface...... 27

32