<<

The Pennsylvania State University

The Graduate School

The Huck Institute of the Life Sciences

FORMAL METHODS FOR GENOMIC DATA INTEGRATION

A Thesis in

Integrative Biosciences

by

Nigam Shah

 2005 Nigam Shah

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

August 2005

ii The thesis of Nigam Shah was reviewed and approved* by the following:

Nina V. Fedoroff Willaman Professor of Life Sciences and Evan Pugh Professor Acting Co-Director, Integrative Biosciences Graduate Program Huck Institutes of the Life Sciences Thesis Advisor Chair of Committee

Mark D. Shriver Associate Professor of Anthropology and Genetics

Wojciech Makalowski Associate Professor of Biology

Francesca Chiaromonte Associate Professor of and Health Evaluation Sciences

Gustavo A. Stolovitzky Manager, Functional Genomics & Systems Biology IBM T.J. Watson Research Center Special Member

*Signatures are on file in the Graduate School

iii ABSTRACT

The rapid growth of life sciences research and the associated literature over the past decade, the rapid expansion of biological , and invention of high throughput techniques that permit collection of data on many genes and proteins simultaneously have created an acute need for new computational tools to support the biologist in collecting, evaluating and integrating large amounts of information of many disparate kinds. This thesis presents methods for the representation, manipulation and conceptual integration of diverse biological data with prior biological knowledge to facilitate both, interpretation of data and evaluation of hypotheses. We have developed a tool (called CLENCH) that assists in the interpretation of gene-lists resulting from microarray data analysis, by integrating and visualizing Gene Ontology (GO) annotations and transcription factor binding site information with gene expression data. During the development of CLENCH, it became evident that developing a unified framework for representing prior knowledge and information can increase our ability to integrate new data with existing knowledge. In subsequent work, we developed the HyBrow (Hypothesis Browser) system as a prototype tool for designing hypotheses and evaluating them for consistency with existing knowledge. HyBrow consists of a conceptual framework with the ability to represent diverse biological information types, an ontology for describing biological processes at different levels of detail, a to query information in the ontology, and programs to design, evaluate and revise hypotheses. We demonstrate the HyBrow prototype using the galactose gene network in Saccharomyces cerevisiae as a test system. Along with the increase in available information, knowledgebases, which provide structured descriptions of biological processes, are proliferating rapidly. In order to support computer-aided information integration tools like HyBrow, a knowledgebase should be trustworthy and it should structure information in a sufficiently expressive manner to represent biological systems at multiple scales. We extend and adapt the conceptual framework underlying HyBrow and use it to verify the trustworthiness and usefulness of the Reactome knowledgebase. iv TABLE OF CONTENTS

LIST OF FIGURES vi LIST OF TABLES vii ACKNOWLEDGEMENTS viii

Chapter 1: Introduction 1

Chapter 2: Managing and interpreting large scale gene expression data. 2 Managing high volume microarray data 3 Using the Gene Ontology for interpreting microarray expression datasets: 8 Signaling pathways as an organizing framework for expression data 17 Summary 19

Chapter 3: Towards a unified formal representation for genomics data 21 Challenges for developing a unified formal representation 23 Description of relevant related efforts 28

Chapter 4: A novel conceptual framework 32 Extensions to the conceptual framework 34 Comparison with other conceptual frameworks 36

Chapter 5: Prototype implementation of HyBrow 42 Hypothesis ontology 43 Inference rules and constraints 48 Database and information gathering 52 User interfaces 54 The hypothesis evaluation process 54 Test runs with sample hypotheses 57

Chapter 6: Lessons learned from the prototype 60 Revision of the hypothesis ontology 61 Bottleneck for structuring data and role of knowledgebases 62

Chapter 7: Comparison with related efforts 65 The Riboweb project 66 Modeling biological processes as workflows 66 Pathway 67 Summary 68 v Role of Knowledgebases 68

Chapter 8: Proofreading the Reactome knowledgebase 70 Background 71 Methods 72 Results 76 Summary 83

Chapter 9: Summary and Future directions. 85 Future directions 86

References 88

Appendix A – of the hypothesis grammar 95

Appendix B – Using the GUI 96 vi LIST OF FIGURES

Figure 1 Flow-chart showing the microarray data preprocessing pipeline. 6 Figure 2 The types of plots that can be made by ProcessGprfile.pl. 7 Figure 3 Visualizing the expression, annotation and TF binding site data. 11 Figure 4 Directed acyclic graph showing the relationships among GO categories 12 Figure 5 A sample row from the CLENCH result table. 13 Figure 6 Components of a formal representation. 27 Figure 7 Examples of different types of ontology specifications. 46 Figure 8 An overview of the ontology. 47 Figure 9 Outline of the binds to prompter rule. 51 Figure 10 Screen shots of the visual and widget interfaces. 54 Figure 11 The hypothesis evaluation process. 56 Figure 12 Screen shot of the result page 57 Figure 13 Properties of agents in the revised ontology. 61

vii LIST OF TABLES

Table 1: A comparison of the properties of different conceptual frameworks 41 Table 2: Numbers of Well-Formed Pathways 83 Table 3: Property comparison for the latest releases of Reactome 83

viii ACKNOWLEDGEMENTS

First of all I would like to acknowledge my advisor, Nina Fedoroff, for her mentoring and support throughout my graduate studies. She has had the most profound influence on the way I think (and write!) about science and my approach to research in general. I feel privileged to have studied under a scientist of her stature. I would also like to acknowledge Stephen Racunas, my colleague and a very dear friend, for making my graduate studies at Penn State a memorable and enriching experience. I feel honored to know and work with someone like him.

I am also very grateful to Dilip Desai, a close family friend, who along with my parents (Haresh and Chaula Shah) has played a very major role in shaping my personality and outlook towards life. Finally and most importantly, I am grateful to my wife Prachi for always being with me and for her unconditional love during the ups and downs of graduate life.

1

Chapter 1: Introduction

With the advent of high-throughput technologies, molecular biology is undergoing a revolution in terms of the amount and types of data available to the scientist. On the one hand there is an abundance of individual data types such as gene and protein sequences, gene expression data, protein structures, protein interactions and annotations. On the other hand there is a shortage of tools and methods that can handle this deluge of information and allow a biologist to draw meaningful inferences. A significant amount of time and energy is spent in merely locating and retrieving information rather than thinking about what that information means. In this situation it becomes extremely difficult to integrate current knowledge about the relationships within biological systems and formulate hypotheses about a large number genes and proteins[1]. It becomes difficult to determine whether the hypotheses are consistent internally or with data, to refine inconsistent hypotheses and to understand the implications of complicated hypotheses[2]. It is obvious that this situation needs to be rectified and tools need to be developed that allow repetitive tasks to be automated and that allow to query and interpret the information at hand[3]. My thesis work is focused on developing methods for integrating large data sets with prior biological knowledge to facilitate their interpretation. My initial efforts were focused on interpreting results from microarray expression data using the gene ontology and known biological pathways. During this work, which is described in the next chapter, it became evident that explicitly structuring prior knowledge and formally representing current information facilitates the integration of new data with prior knowledge by increasing our ability to fit the new data into the big picture. Subsequently, in collaboration with Stephen Racunas, an engineering doctoral student, we developed a prototype system for integrating biological data and existing knowledge in an environment that supports the formulation and evaluation of alternative hypotheses about biological systems. This work is described starting from chapter three. 2

Chapter 2: Managing and interpreting large scale gene expression data.

Microarray technology is a high-throughput method of measuring the expression level of thousands of genes in parallel. It is also the most widely used method among the several high-throughput technologies for collecting data on the levels of various biological entities such as mRNA and proteins in cells. My efforts to manage genomic data were focused on preprocessing, analyzing and interpreting microarray gene expression data. I developed programs for rapid preprocessing of raw microarray data and interpreting gene-groups that result from analyzing those data. While developing these methods for interpreting microarray data, it became evident that explicit articulation of prior knowledge is needed for using it during microarray data interpretation. Currently, the main resource of structured knowledge on the function of gene products is the Gene Ontology. The goal of the Gene Ontology project is to produce a controlled vocabulary that can unambiguously describe knowledge about the role of genes and proteins in cells. The notion of “pathways” serves as an alternative, more rigorous, mechanism to structure prior knowledge about the functions and interactions of gene products in cells. Such structuring of knowledge allows it to be used computationally while analyzing expression data and serves as a reference framework for integrating new data with prior knowledge. Having such a framework also allows formulation of more focused queries to the available data, which return more specific results and facilitate interpretation by increasing our ability to fit the results into the “big picture”. Specifically, focused queries provide formal mechanisms to evaluate statements such as “The ABC signaling pathway is-involved-in the causation of cancer X”, using high throughput data such as microarray expression data. It would be extremely valuable to be able to make such statements or sets of statements, comprising a hypothesis about biological processes, and systematically query a wide variety of datasets for evaluating them. In this chapter I will describe my efforts to interpret large microarray datasets and how they lead to the more general idea of formally representing both the data and the set of statements we wish to verify using the data.

3

Managing high volume microarray data

Microarrays, which measure genome-wide RNA expression levels, have become an established technology to measure the expression level of thousands of genes in parallel. Microarray experiments produce huge matrices of numbers that represent the transcript level, either relative or absolute, of the genes in the genome of an organism. Microarray experiments are complex and have a large number of procedural steps and hence require stringent quality control and data preprocessing methods to ensure that the data produced from those experiments are reliable. However, these quality control and preprocessing steps do not constitute ‘research’, though they consume a lot of time, are repetitive and boring, so need to be automated. I developed a group of perl programs that allow rapid preprocessing of raw microarray data. These programs comprise a group of six scripts[4], which support rapid preprocessing of Genepix (a common microarray image analysis application) result files to flag bad data points with respect to a number of parameters and normalize data in one of 3 different ways[5]. Separate scripts compile result files into comprehensive datasets that can be analyzed by packages such as cluster[6] and SAM[7]. Additional scripts evaluate the quality of technical and biological replicates and perform row and/or column centering. The FlagAndNormalize program accepts the gpr file output from Genepix and a list of positive spiking controls and negative controls. The user decides whether to use the mean or median pixel intensities in calculating the expression ratio, using the following formula: [FG − BG]Treatmentchannel X = [FG − BG]Controlchannel Where X = gene expression ratio, FG = foreground mean or median pixel intensity for a spot and BG = background mean or median intensity for a spot. The program then flags 1) spots called 'bad' by the Genepix program; 2) spots with a low signal to noise ratio (SNR) in either channel (SNR = FG- (BG + 2SD)); 3) spots whose intensity is low in both channels; and 4) spots for which the FG-mean & FG-median intensities differ by more than 20%[8]. The script calculates a spike-normalization factor based on spiking 4 control spots that were called good and a whole-chip normalization factor based on all spots that were not flagged. The two normalization factors are then used to scale the expression ratio (X) for each gene and adjust for differences in red and green channel intensities. The normalized ratio columns are appended to the gpr file. Existing programs such as Gp3[9] do not allow for spiking normalization, which is essential for specialized microarrays[10]. The ColumnPicker program accepts a list of files made by the previous program and a list of column headings to compile a dataset in which each column (or group of columns) represents a microarray slide and the rows represent spots on the slide. This is the most common format used as input for expression analysis programs. The HandleTechReplicates program accepts the dataset created by the ColumnPicker program, which has two columns for each microarray slide: a data column and a flag column. It writes an output file of averaged values from replicate spots, and provides information that is useful in identifying the rows to be kept for further analysis. It counts the flags for each replicate spot and appends a column showing the percentage of slides on which replicate spots were flagged, as well as one that gives the number of slides on which the ratios for the two replicate spots differed by more than 20%. It also removes the data rows for the positive and negative control spots. The program provides the following additional statistics for time course data: 1) the correlation between the profiles of the replicate spots for a gene and 2) the mean-ratios for the rows of replicate spots for each gene. The output of this program can easily be manipulated in Excel to achieve the desired row filtering. Once the selection of "good spots" or "good rows" is made, the following additional scripts can be used. The AverageBioReps program accepts the outputs of the HandleTechReps program and averages the biological replicate expression values for each gene or row. It also reports the correlation between the expression profiles for each gene (or row), an average correlation value for all the rows analyzed and a statistic showing the number of instances (or columns) in which the values from the two replicates differed by more than 20%. The CenterByRowOrColumn program takes in the output of the HandleTechReps program or a tab delimited text file where rows are genes and columns contain expression values from one experiment and centers the data matrix by row, column or both. The 5 user specifies the number of iterations while performing the centering. Figure-1 provides an overview of the analyses pipeline formed by these scripts. We found that using six scripts as changeable components of an analysis pipeline, though a very flexible approach, was not very user friendly for experimental biologists. Therefore, I have rewritten the six scripts as one script called ProcessGprFiles.pl. All the subroutines that perform the flagging, normalization, filtering and averaging of replicates are encoded as a perl module – Gprfile.pm, which is used by the ProcessGprFiles.pl script. This script will accept a list of gpr files and perform flagging, normalization, averaging and filtering of replicate spots to compile a dataset from the input files that can be used in clustering and analysis programs. It provides the option of imputing missing values, resulting from bad spots, with the row average and for log transforming the data. The script can also generate plots (shown in Figure 2) such as the histogram of ratio distribution and Ratio Intensity-plots for each input gpr file to assess the quality of the input data [5] as well as cross-check each input gpr file with the gene array list file used to generate the result files. The script and the Gprfile.pm module are available for download at http://wartik19.biotec.psu.edu/Gprfile.pm

6

Generate .gpr files from Use FlagAndNormalize program the microarray images to flag bad spots and to calculate using Genepix normalized gene expression ratios.

Run ColumnPicker to get the desired ratio (and/or intensity) column from the normalized files. Use the HandleTechreps program to assess quality of technical replicate spots

Use the AverageBioreps to average the data files from biological replicates and assess the correlation between the replicates. Use CenterByRowOrColumn to center the data matrix by row, column or both.

Final dataset

Use Existing Packages to identify “significantly changing genes”

Form a list of “interesting genes”

Figure 1 Flow-chart showing the microarray data preprocessing pipeline formed by the scripts. 7

(A) (B)

(C) (D)

Figure 2 The types of plots that can be made by ProcessGprfile.pl.

(A) Ratio-intensity plot showing raw expression data in red and normalized data in black. The horizontal green lines show 2-fold cutoffs. (B) Ratio-intensity plot where the control spots used to perform normalization are colored black and bad data points have been removed. (C) Ratio-intensity plot where genes whose expression ratios differ by more than 2 standard deviations from the mean are colored red. (D) A histogram showing the ratio-distribution for the raw (red) and normalized (black) expression ratios as well as expression ratios from control spots (blue).

8

Using the Gene Ontology for interpreting microarray expression datasets:

Once data preprocessing is performed and the data are subjected to statistical testing for significance, the experimental biologist is often confronted with long lists of “interesting genes”[11]. Extracting biological meaning from the increase (or decrease) of the mRNA levels of the listed genes is an important but time consuming task, the difficulty of which is exacerbated by the lack or incompleteness of structured gene annotations. In such a situation, the controlled vocabulary and the hierarchical classification of the vocabulary terms developed by the Gene Ontology (GO) consortium[12] is a very useful resource for the initial interpretation of gene lists. GO’s hierarchical classification of terms provides a basis for grouping genes in an unambiguous manner. By determining whether GO terms associated with a particular biological process, molecular function or cellular component are either over- or under- represented among the genes in each subgroup generated by a clustering , it may be possible to relate the alterations in the gene expression levels to existing knowledge about biological processes and gain insight into the biological significance of the alterations. I have developed a tool named CLENCH to assist in the task of retrieving GO annotations and determining the over-representation of GO terms in order to assist in interpreting gene lists resulting from microarray data analysis. Another common analysis performed on gene lists obtained from statistical tests on microarray data is to analyze the promoters of the genes assigned to one cluster for over-representation of known transcription factor (TF) binding sites. It is also informative to perform the same analysis after grouping the genes in clusters based on their annotations. The underlying assumption for this analysis is that if genes with similar expression profiles and similar functional annotation share a common TF binding site (or sites), it could indicate co-regulation of those genes. Along with analyzing GO annotations, CLENCH also analyzes the promoters of the genes having a common annotation for the over-representation of known TF binding sites to further assist in interpreting the gene expression changes. Although there are several programs that carry out GO category assignments and analysis for human, mouse and yeast genes, none of them is useful for Arabidopsis genes 9 because they do not accept AGI identifiers. Moreover, existing programs do not analyze the TF binding sites in the promoters of genes and do not allow simultaneous of expression, annotation and TF binding site data[13-15]. CLENCH – from CLuster ENriCHment – calculates functional enrichment in gene groups, resulting from the analysis of gene expression data, using the gene ontology. It also analyzes the promoters of genes assigned to each GO category for enrichment of known TF binding sites. Finally, CLENCH displays the expression, annotation and transcription factor binding site information for each sub group of genes formed on the basis of GO categories. CLENCH uses TAIR[16] as the source of annotations and promoter sequences because it is the most reliable public source of curated information for Arabidopsis thaliana. For calculating functional enrichment, the program accepts two lists of genes: 1) Total-Genes (which is the set of genes that each “cluster” is to be compared with and may be all of the genes in the genome, all of the genes on the microarray or even the complete list of genes identified as differentially expressed) and 2) Changed-Genes (a subset of genes, for example a cluster of genes, that is to be analyzed for enrichment of functional categories). The program retrieves GO annotations for both gene lists to calculate the number of genes (n and m) belonging to a particular GO category in each list and then calculates the hypergeometric (p-value) for finding at least n genes belonging to that category from N genes in the Changed-Genes list given that m genes were annotated to that category from M genes in the Total-Genes list. This p-value is

m M −m (n )(N −n ) calculated using the following formula: Pn (M ,m, N ) = N . The p-value tells us (n ) how likely it is to find at least n genes of a particular category in the Changed-Genes list by chance alone, given the number of genes in that category in the total set. A category is called enriched if the p-value is less than 0.05. For analyzing the promoters for known transcription factor binding sites, the program requires an additional file containing the names and consensus sequences of TF binding sites to search for. CLENCH will retrieve the promoter sequences for the genes

in the Total-Genes list and determine the number of promoters XTF containing each TF binding site. Then, for each binding site, in each group of genes assigned to each GO 10

category, it determines the number of promoters YTF containing that binding site and calculates the hypergeometric probability (p-value) for finding at least that many

promoters containing that binding site given that there are XTF promoters containing that site in the Total-Genes list. In order to visualize the expression profiles, CLENCH requires a tab delimited file containing the microarray data. CLENCH results are presented as an HTML page containing the following: 1) Overview images, shown in Figure 3, which provide a birds eye view of the TF binding site, annotation and expression data; 2) a directed acyclic graph (DAG), shown in Figure 4 and a table for each primary GO division (molecular function, biological process and cellular component) shown in Figure 5. The DAG shows the parent-child relationships among the GO categories of that GO division and the table contains rows corresponding to each node in the DAG with fields for the name and id of the category, the p-value for enrichment of the category, the genes annotated to the category, the expression profiles for those genes, an image showing the TF binding sites in the promoters of those genes and an image showing the annotations for those genes in the other GO divisions. The annotation image allows easy cross-referencing of annotations in the three divisions of the GO. Each GO id in the result is hyper-linked to the AmiGO browser at the Gene Ontology Consortium’s website to allow easy access to the definitions for each GO category. The gene identifiers are linked to their TAIR annotations.

11

Figure 3 Overview images visualizing the expression, annotation and TF binding site data. The TF binding site image represents a TF binding site in each row and a promoter in each column. A filled square represents a promoter containing one or more TF binding site corresponding to the row; a red colored row denotes an enriched TF binding site in the set of promoters analyzed. The annotation image represents a GO category in each row and a gene in each column. A filled square represents a gene annotated with the GO category corresponding to that row; a red colored row denotes an enriched GO category in the set of genes analyzed. The expression image visualizes the microarray data for the genes. 12

Figure 4 The directed acyclic graph (DAG) showing the parent-child relationships among the GO categories under the GO division of biological process. Nodes colored red indicate GO categories found enriched in the set of genes analyzed. 13

Figure 5 A sample row from the CLENCH result table corresponding to the ‘photosynthesis’ node in the GO division biological process shown in the DAG in Figure 4. The expression profiles plot shows the microarray data for the genes annotated with the label photosynthesis (the red profile is the average profile). The bold numbers above the plot show the number of genes in the Changed-Genes list annotated with that label and the numbers in parenthesis show the number of genes with that label in the Total-Genes list.

The TF binding site image represents a TF binding site in each row and a promoter in each column. A filled square represents a promoter containing one or more TF binding site corresponding to the row; a red colored row denotes an enriched TF binding site in the set of promoters analyzed.

The annotation image shows those GO categories from the molecular function and cellular component divisions of the GO that are used to annotate the photosynthesis genes. The annotation image represents a GO category in each row and a gene in each column. A filled square represents a gene annotated with the GO category corresponding to that row; a red colored row denotes an enriched Go category in the set of genes analyzed. All three images are hyperlinked to their larger versions. 14

The p-values reported by CLENCH should be interpreted with caution because the Total-Genes list to which the Changed-Genes list is being compared affects the p- value. For microarrays that comprise a random subset of all genes, using the list of all genes on the microarray as the Total-Genes list is equivalent to using the complete list of genes in the genome. However, for microarrays containing a selected subset of genes associated with a biological process, the choice of the gene set to use as the total set is not obvious. Comparing with all genes in the genome is very lenient because it biases the results towards categories that were already enriched during the pre-selection process. Comparing with the set of arrayed genes is too stringent because it biases the results against categories that were enriched during the pre-selection process. Thus, depending on the choice of the total set, the p-values should be interpreted in light of other available information, keeping in mind what is known about involvement of a particular biological process in the process under study, the number of genes in the category reported to be significant and the relative size of the category in the two lists. To aid in this process CLENCH reports the number of genes assigned to each category in the Changed-Genes list, a ratio of the percentage of genes in each GO category in the two lists analyzed, named “relative enrichment” or RE, and p-values calculated using the Chi-square and binomial distributions. Another caveat of deciding significance using the calculated p-value (and a cutoff of 0.05) is the multiple testing that occurs. Multiple testing happens because we do not pre-select which category to test for enrichment, but rather test each existing category. This allows multiple opportunities (equal to the number of categories tested) to obtain a statistically significant p-value, by chance alone, from a given gene list. However, correcting for this (for example using a Bonferroni correction in which the critical p- value cutoff is divided by the number of tests made) is too restrictive and is not advised because the correction assumes independence of categories and even truly enriched categories are not detected[11]. However, CLENCH can estimate the false discovery rate (FDR) for the enriched categories by performing simulations in which the program generates a user specified number of random gene lists of the same size as the Changed- 15

Genes list and calculates the average number of categories that are considered enriched in the random gene lists, at a p-value cutoff of 0.05. The FDR provides an estimate of the percentage of false positives among the categories considered enriched in the Changed- genes list. If the FDR is above a user specified cutoff, CLENCH will lower the p-value cutoff for declaring a category enriched in order to reduce the FDR to acceptable levels. The reduction in the p-value cutoff can be performed in one of the following two ways: 1) “Globally” where the p-value cutoff is reduced by the same amount for all categories; 2) In a “stepped” fashion where the cutoff is reduced by a larger amount for smaller categories and by a smaller amount for larger categories depending on the specific FDR for a category of that size. Annotation of genes and proteins by TAIR is an ongoing effort, therefore there may be categorization errors and some categories may not yet be completely covered. One approach to managing such inconsistencies is to use GO Slim, a list of high level GO terms covering all three GO categories[17]. The GO Slim terms convey biological meaning at a coarser level of resolution and each fine-level annotation can be mapped to a GO Slim term before performing the functional analysis[17]. To support such an approach, CLENCH can also accept a list of GO categories at a coarser resolution level and automatically map the annotation returned by TAIR to the coarse level terms. In order to map the fine level gene annotation to user defined coarse levels, CLENCH uses a local installation of the GO database[18] and searches the path made by the parent-child relationships between terms, from the fine level annotation towards the root of each GO category. The first term from the coarse term list found in the path is assigned as the annotation of the gene. In cases of multiple parents for a fine level term, the user can choose to assign all the parent terms or just the coarse term nearest to the fine level annotation and farthest from the root as the annotation of the gene. Such mapping to coarse terms is particularly useful when the annotation is very sparse and direct analysis results in a long list of categories found ‘enriched,’ but each containing just one gene. CLENCH is written in Perl 5.6, runs on Windows NT/2000 and XP platforms and is available along with detailed documentation at www.personal.psu.edu/nhs109/Clench. It uses a local MySQL installation of the Gene Ontology database for mapping TAIR annotations to user-defined coarser levels. These files are available from the GO 16 consortium and MySQL is available from www.mysql.com (both free). CLENCH, by default, does not use local annotation files; instead it fetches annotations from TAIR during execution and hence updates made by TAIR are immediately passed on to users. However, if required, it can be configured to use local files for both annotations and promoter sequences. CLENCH is currently in use at the University of Oklahoma and the Carnegie Institution of Washington at Stanford, as well as in our lab.

Linear optimization for interpreting expression, annotation and promoter datasets

During development of CLENCH, it became clear that the reason the GO is so useful for interpreting expression data is because it serves as an organizing framework, which explicitly articulates prior biological knowledge (about molecular function and biological processes in this case) and allows us to use it computationally while analyzing and interpreting expression data. The natural question is: Would a more explicit articulation of prior knowledge be more useful? While attempting to interpret gene lists using results from CLENCH, we try to find gene groups that have a certain list of properties such as a similar expression profile, a similar promoter composition and similar annotation. As the list of properties we wish to check grows, the set of genes that satisfy the criteria keeps becoming smaller and smaller. However, if we can articulate our biological knowledge about genes that ‘go together’ in the form of declarative constraints such as “two genes are similar if they have their expression profiles correlated above a cutoff of 0.7, have at least 3 common transcription factor binding sites in their promoters and have 3 common annotations”. Then we can apply optimization methods (especially linear programming and related methods)[19] to identify groups of genes that have the maximum number of properties in common. This is a more formal approach than just looking for genes that have similar expression profiles and similar promoter composition and similar annotations. The groups of genes that satisfy the intersection mentioned above will be very small and will be different from the group of genes that contains the maximum number of genes with common properties where the exact property that is common between two genes in the 17 group can be different. For example, on analyzing microarray data for oxidative stress response in A. thaliana using this approach, we identified a group of transcriptions factor genes, which includes two transcription factors that showed a reduction (the majority of transcription factors showed an increase) in mRNA levels in response to the stress and had more than two common TF binding sites in their promoters. These two genes are missed by CLENCH because they get assigned to a separate cluster based on expression data and then do not qualify as an enriched category in that cluster. We can see that by formally declaring our prior biological knowledge in the form of constraints in a linear optimization procedure, it may be possible to extract more meaningful results from the data. So it would seem that a more rigorous organizing framework is more useful.

Signaling pathways as an organizing framework for expression data

High-throughput microarray gene expression studies are widely used in cancer research. One of the main aims of such studies in cancer research is to identify genes and gene expression patterns that show a consistent change between cancer and normal states. Methods from the field of and decision theory have been successfully used to identify genes whose expression patterns can reliably distinguish cancer and normal cells[20-23]. However, though these results are useful for staging and diagnosis purposes, they shed little light on the underlying biology due to the fact that the genes whose expression pattern can discriminate between cancer and normal states are not necessarily involved in causing the cancer[24]. Without prior knowledge, it is extremely difficult to determine whether these genes are artifacts or truly involved in the causation of the cancer. As a result of my previous medical training, I became interested in developing methods that bridge the gap between the data resulting from a cancer gene expression study and our ability to discern the meaning of those data to understand the underlying biology. Just as the Gene Ontology can serve as an organizing framework to interpret the meaning of gene expression changes, the notion of a signaling pathway can also serve as 18 an organizing framework to formulate specific queries on expression data to interpret the expression changes. The analysis becomes more specific on examining a relatively small set of genes based on prior biological knowledge about a given pathway. In my work, during a summer internship with the Functional Genomics and Systems biology group, at IBM Research, we attempted to develop methods to determine whether a particular pathway is involved in a given type of cancer by investigating the gene expression profiles of only those genes that are associated with that particular pathway. The approach consists of three major steps: The first step is a careful selection of genes that are known to be associated with the pathways that are crucial in the regulation of important cellular processes such as cell proliferation, apoptosis, DNA-damage repair. Here the genes associated with a pathway are defined as genes whose proteins are inputs, members or targets of the pathway as evidenced by prior knowledge in the form of journal articles and public databases. (The term “pathway” therefore doesn’t imply a biological signaling pathway in the strict sense.) Next, using the expression data of only those genes that are associated with a specific pathway, we develop three different formulations of a signature in the microarray data in terms of the degree and pattern of differential expression between cancer and control samples. Finally, to assess the statistical relevance of our findings in the previous step, we create pseudo-pathways by randomly picking the same number of genes as in the real pathway from the thousands of genes available on the same microarray. The statistical relevance is determined by the likelihood of finding the same signatures in the pseudo-pathways. We applied our approach to gene expression data for four different cancers (colon, pancreas, prostate and kidney) and for six different pathways (p53, Ras, Brca, DNA damage repair, NFκb and β-catenin). To determine whether these pathways were involved in the four cancers we performed a literature review in the following manner. We identified studies in which the key member of the pathway, such as NFκb in the NFκb pathway, is studied experimentally in a particular cancer. If the key member is affected in some manner (in most cases by a mutation), then it is reasonable to expect that the genes associated with the key member could show changes in expression as the key members are usually transcription factors (e.g. p53, Nfκb) themselves or are proteins that affect transcription indirectly (e.g. Ras). Therefore, if one of the key members of a 19 pathway was found to be abnormal (for example, a point mutation) in a particular type of cancer, we considered that the pathway was involved in that particular type of cancer. For the 24 different cancer-pathway pairs (for 4 cancers and 6 pathways) the literature review identified an involvement in 12 instances, no involvement in 3 instances and one instance (NFκb in kidney cancer) with conflicting reports. On comparing these results with statistically significant signatures found by our analysis, we found that we detected a signature in 8 out of 12 expected instances; we did not detect a signature in any of the 3 instances where we did not expect to find one. In the case where there are conflicting reports, we did detect a signature. This work is described in further detail in Shah et al 2003[25]. The point I want to make here is that, although is possible to argue about the definition of “pathway” and the definition of a “signature”, if we all agreed on both the definitions we could use such an approach as a first pass tool to determine whether a particular pathway may be involved in the causation of those cancers where the biology is poorly understood. Once again we see that using prior knowledge explicitly (in the form of predefined pathways) can allow us to formulate more focused queries on our data. Moreover, if we believe that our formulation of what constitutes a signature is accurate, then we have a formal mechanism to evaluate the statement “The ABC signaling pathway is-involved-in the causation of cancer X”, using expression data. It would be extremely valuable to make a set of such statements and systematically query large datasets for evaluating them.

Summary

We have seen that having an organizing framework (in the form of GO or predefined pathways) that structures prior knowledge for computational use serves as a reference for integrating new data with prior knowledge. Such a framework also allows formulation of more specific queries to the available data, which return more specific results and facilitate interpretation by increasing our ability to fit the results into the “big picture”. It would be extremely valuable to have a framework that allows us to make statements or sets of statements, comprising a hypothesis, about biological processes and 20 systematically query a wide variety of high-throughput datasets for evaluating them. In the next chapter I will describe the need and challenges for formally representing both the data and the set of statements we wish to verify using the data.

21

Chapter 3: Towards a unified formal representation for genomics data

From the preceding chapter it emerges that imposing an organizing framework – either the GO, a linear optimization model or the notion of a pathway – on complex datasets facilitates their interpretation and that if we make our organizing framework more rigorous, our ability to query and interpret complex datasets increases. Currently, in order to interpret existing data and design further experiments, the experimentalist must: 1) gather information of many different types about the biological entities that participate in a biological process 2) formulate hypotheses about the relationships among these entities, 3) examine the different data to evaluate the degree to which his/her hypothesis is supported and 4) refine the hypotheses to achieve the best possible match with data. In today’s data-rich environment, this is a very difficult, time- consuming and tedious task. For example, to evaluate a simple hypothesis such as protein A is a transcriptional activator of genes X, Y and Z, the experimentalist must examine the literature for evidence showing that protein A is a transcription factor or exhibits protein sequence homology with known transcriptional factors. He/she must look for evidence indicating DNA binding activity for protein A and if found, examine the promoters of X, Y and Z for presence of binding sequences for protein A. Finally, the refined hypotheses are subjected to experimental testing. Hypotheses that survive these tests – validated hypotheses – are published in scientific publications and represent the growing knowledge about biological entities, processes and relationships among them. Validated hypotheses are eventually synthesized into systems of relationships called “models.” Biologists’ models are generally presented as diagrams showing the type, direction and strength of relationships among biological entities such as genes and proteins. Usually the goal of constructing a model is to predict the outcome or result from a system at some point in the future. I believe that in biology such predictive models, though essential, are still a thing of the future because much of biology works by applying prior knowledge (‘what is known’) for interpreting datasets rather than the application of a set of axioms that will elicit knowledge[26]. Therefore, currently the most profitable manner in which we can use models is to construct a model or a set of models 22 and then test them for consistency with the information at hand, revise models to minimize the inconsistencies and then pick the most consistent model as a basis for further experiments. If we adopt the notion of an hypothesis (or a model) as the organizing framework for the datasets we wish to query and interpret, we immediately encounter several problems. For a large number of genes and proteins, it is extremely difficult to integrate current knowledge about the relationships within biological systems to formulate such hypotheses or models. It is difficult to determine whether such hypotheses are internally consistent or are consistent with data, to refine inconsistent hypotheses and to understand the implications of complicated hypotheses[2]. It is widely recognized that one key challenge in managing this data overload is to represent the results of high-throughput experiments in a formal representation – a computer-interpretable standardized form that can be the basis for unambiguous descriptions of hypotheses and models[3]. This raises the following question: What are the desirable properties of a formal representation for hypotheses or models? Peleg et al[27], have suggested the following set of desirable properties in a formal representation for models of biological processes: I. A formal representation should be able to present structural, functional and dynamic views of a biological process. The structural and functional views show the entities that participate in a process and relationships among them. The dynamic view shows the process over time, shows branch points and conditional sub-processes. II. A formal representation should include an associated ontology that unambiguously identifies the entities and relationships in a process. III. It should be able to represent biological processes at various scales and should allow hierarchical representation of sub processes to manage complexity. IV. The representation should have an intuitive visual layout. V. The representation should be able to incorporate new data as they become available and should be extensible to allow new categories of information as they come in to existence. VI. The representation should have a corresponding conceptual (mathematical) 23

framework that allows verification of system properties using simulation or inference mechanisms. Now, if we can devise a formal representation for hypotheses as well as the mechanisms to check the consistency of hypotheses with data and prior knowledge, we may be able to computerize the process. Peter Karp has suggested that as we get more formal in representing our data and formulating hypotheses that explain the data, we can check those hypotheses for internal consistency and logical errors more easily and can refine them more easily too[2]. Moreover, if we develop such a formal representation, we can envision a tool that will accept current datasets, information and existing knowledge to integrate them in an environment that supports the formulation and testing of hypotheses. We are developing such a novel tool which will accept statements of alternative hypotheses and automate their evaluation for consistency with prior data and knowledge. However, before we can attempt to build such a tool, we need a formal representation for the data and hypotheses as well as the steps to evaluate consistency of hypotheses with data. There are several challenges in building such a unified formal representation, which I describe below.

Challenges for developing a unified formal representation

The first challenge is the problem of knowledge representation. How can we systematically express the various biological entities that participate in any given biological process and the many qualitatively different kinds of relationships between them? This requires the development of an ontology for unambiguously representing biological objects and interactions. An ‘ontology’ is most generally defined as a specification of the concepts and relationships that can exist in the domain under consideration[28]. Specifically, an ontology is a collection of concepts representing domain-specific entities along with definitions of those concepts, a set of relationships among concepts, properties of each concept, the range of allowed values for each property, and, in some cases, a set of explicit axioms defined for those concepts. We require different ontologies to represent biological processes (and data) at different levels 24 of resolution because biological processes and data can be considered at varying levels of detail, ranging from molecular mechanisms to general processes such as cell division and from raw data matrixes to qualitative relationships[29]. Ontologies have gained a lot of popularity in molecular biology over the last several years. The earliest ontologies describe properties of ‘objects’ such as genes, gene products and small molecules. The newer ontologies describe the ‘processes’ that gene, proteins and small molecules participate in. Currently several ontologies exist that allow representation of biological objects and their properties. For example, the gene ontology (GO), is a structured, precisely defined, common controlled vocabulary for describing the roles of genes and gene products in any organism.[12] The sequence ontology (SO), provides a detailed vocabulary for naming various sequence attributes[30] and InterPro provides concepts and terms to name protein motif and families. The microarray gene expression data society provides a controlled vocabulary for microarray data and genomics experiments[31] and the Biomaterial Ontology provides an ontology to describe Reagents/material in biological experiments[31]. There are several ontologies that allow representation of processes in a biological system by specifying relationships between biological objects for tasks ranging from modeling biological systems to extracting information from literature. The main ones are: Systems biology markup language, SBML, which can represent quantitative models of biochemical reactions and pathways[32]. However for most systems such detailed information is not yet available. EcoCyc’s ontology, which is used to represent information about metabolic pathways for E. coli[33] provides a hierarchy of biological entity types and processes. An ontology developed by Rzhetsky et al allows representation signal transduction pathways for programs that extract information about such pathways automatically from published literature[34, 35]. The most recent ontology in this category, and perhaps the one most suitable for use in a formal representation, is the Bioprocess Ontology developed by Russ Altman's group[27]. The second problem is to represent the biological system conceptually. A conceptual framework for representing biological systems must accommodate the modularity and temporal behavior of biological systems, as well as handle their non- 25 linearity and redundancy. The conceptual frameworks used to represent biological models vary from ordinary differential equations to Boolean[36-38] and Bayesian[39-41] networks as well as Petri Nets. However, these methods face several drawbacks which I will describe in chapter 4. The inability to represent disparate kinds of information, at different levels of detail, about biological systems in a common conceptual framework is a major limitation in the creation of formal representations of biological systems, and current efforts usually focus on just one or two categories of information[40, 42]. In the light of such a limitation of existing methods, an alternative approach is to represent the biological processes in a system as a sequence of ‘events’ that link particular ‘states’ of the system[43, 44]. It has been shown before that complex processes, particularly those that exhibit non-linear behavior, are readily described by event-driven dynamics[45] because event dynamics allows description of the process in terms of observed effects of the non-linear behavior rather then requiring that the non-linear behavior itself be represented as a mathematical function. Moreover, event-based approaches can represent simple processes, such as protein phosphorylation, to complex ones, such as the cell cycle, allowing a wide range of resolution. An event-based framework offers several other advantages as well: 1) It can explicitly represent states which allows for representing information such as commitment to a developmental pathway[27]; 2) It allows hierarchical representation of properties and hence avoid a rapid increase in the number of states that need to be represented[46]; 3) It can represent temporal constraints on when events occur; 4) It can readily accept new categories of information and represent information at different resolution levels. A related challenge is to integrate the conceptual framework and the ontology to create a single formal representation. Event-based representations per se do not have an associated ontology[43, 47, 48]. For example, the relationship ‘protein A activates protein B’ is different from ‘protein A activates gene X’ though it may be described by the same word. When representing a biological process in a conceptual framework, it is essential to distinguish between the two meanings. An ontology can distinguish between the two relationship by providing different terms to represent the two meanings. Having an associated ontology where each term in the ontology has a corresponding construct in the conceptual framework (termed mapping an ontology) allows this distinction to be made 26 in the conceptual model as well. The final challenge is the gathering and storage of existing information. Even if all of the above challenges are met, getting access to the information and converting (or translating) it into the relevant ontology in an automated manner is a major challenge because the information resides in separate repositories, each with custom storage formats. Moreover, most databases do not store information in an explicit ontology and groups that design ontologies[27, 34] do not store all relevant information structured in those ontologies. Hence, efforts aimed at building a unified formal representation need to convert existing information into their ontology. For most formal representations – including the one we have developed – such conversion is an unsolved problem. In this situation, central repositories and model organism databases such as YPD[49], SGD[50] or TAIR[16] that store biologist-curated data and information about an organism or a biological system are our best resources. Data, such as gene and protein sequences as well as GO annotations, can be easily retrieved from these sources and converted to the relevant ontology. Curated information – derived by analyzing relevant data, such as putative protein function assignments based on protein domains found in the predicted protein sequence – in such repositories is not structured to support a formal representation. Therefore, we need to translate such information into our ontology and then store the translated information locally along with a reference to the original data and the analyses method used. However, a clear reference to the original data, as well the analysis method, is not always available, breaking our link to the underlying data for some information. Newer repositories designed to store data and information using an explicit ontology are under development for some biological systems (e.g. GeneCards for human genes[51] and Reactome for core pathways and reactions in human biology[52]) and might serve as a source of highly structured information in the future with the caveat that they might exacerbate the problem of unlinking structured information from its underlying data. All the challenges described above are strongly inter-related as shown in Figure 6 and are active research areas with numerous active researchers. Our goal is to build on those research efforts and create a unified formal representation that will support computer-aided formulation and testing of hypotheses. 27

Formal representation

Conceptual Domain knowledge model

framework (Ontology) Knowled g ebase

Establish a correspondence Domain information between the conceptual structured according to framework and the ontology the ontology

Figure 6 Components of a formal representation: A formal representation consists of a conceptual framework and a corresponding domain ontology, which provides the entities to create an instance of a model defined by the conceptual framework. The information in a given domain structured according to an ontology constitutes a knowledgebase and is essential for a formal representation to be useful. 28

Description of relevant related efforts

As awareness grows for the need to present information about biological processes in a formal representation, which can provide unambiguous descriptions of hypotheses and models[2, 3], there are increasing efforts to formulate such formal descriptions[27, 53-55]. The efforts from the biology community are aimed at developing shared domain knowledge models (ontologies) and symbolic notations for unambiguously representing and sharing information along with constraints on what can and can not be expressed using them. The main efforts in this direction are the Riboweb project and “modeling biological processes as workflows” project in Russ Altman’s group[27, 53]. The efforts in the and engineering community are aimed at developing computable descriptions of biological processes based on formal languages, logic and reasoning methods along with associated tools for simulation. The main effort in this direction is the Pathway Logic project [55].

The Riboweb project

Riboweb is a prototype online collaborative data resource for the ribosome[53, 54]. The goal of the Riboweb project is to develop a system that can represent and interpret multiple, diverse data sources and support collaborative scientific interpretation of these sources[53, 54]. Riboweb consists of a large knowledgebase of relevant published data, computational modules that can process this data to test hypotheses about the ribosome’s structure and a web-based . Riboweb links published 3D models of the prokaryotic ribosome structure to the primary data sources on which they are based[53]. It also allows users to modify their interpretation of data and compute new models by using several stored data analysis methods. The information in Riboweb’s knowledgebase is organized using four ontologies, created using the Ontolingua tool, to explicitly represent information about the ribosome. These ontologies are: 1) physical-thing ontology, which specifies the molecular components and relationships between them necessary for representing ribosome structure; 2) data ontology, which specifies the types of data gathered about ribosome 29 structure; 3) reference ontology, which specifies the publication type for the source of structural data and 4) methods ontology, which specifies the types of analyses Riboweb can perform on the stored data and the required inputs to execute the analyses. The knowledgebase contains information and ribosome structural data from 160 published papers. The papers are read by curators and information is manually entered into the knowledgebase. The computational modules provide the steps to execute the analysis methods declared in the methods ontology. Users access Riboweb via a web-based interface to select a dataset and an analysis method to apply on the dataset. Riboweb computes the results – distances between various residues in the 3D structure – and allows their comparison to distances generated using different datasets. Riboweb serves a user community of about 20 labs studying the structure of ribosomes[54].

Modeling biological processes as workflows

The goal of modeling of biological processes as workflows (BioWorkflows) by Peleg et al[27] is to create a framework for representing models of biological systems. BioWorkflows use an explicit ontology for describing biological processes and hybrid Petri Nets as the underlying conceptual framework. The ontology used in this project is created by merging aspects of the Transparent Access to Multiple Biological Information Sources (TAMBIS ) ontology and an ontology used to describe business processes[27, 56]. The TAMBIS ontology provides the biology-specific entities whereas the business processes ontology provides concepts to represent parallel, competing, alternative of sequential processes. The resulting ontology can represent biological processes at varying levels of resolution. Existing knowledge is captured using Protégé and is stored as an object-oriented database made up of text files. A model of a biological process is created using a Protégé plug-in, which provides a visual interface to draw the biological process. Hybrid functional Petri Nets (which I will describe in chapter 4) are used to represent the biological process mathematically and to perform qualitative reasoning about the possible outcomes of a process. BioWorkflows were used to create a model of erythrocyte invasion by P. falciparum 30 merozoits and the qualitative reasoning ability was used to answer questions such as the following: If we inhibit the process “attachment to erythrocyte involving Glycophorin A” can we still get to a state where “merozoit is permanently attached”? However, the qualitative analysis is not computerized and has to be performed manually.

Pathway Logic

The goals of the Pathway Logic work are to build biological process models which can assist in the generation of informed hypotheses about complex biological processes and can be interactively modified by biologists[57]. Pathway Logic attempts to algebraically formalize the biologists’ reasoning task, which uses informal notations and potentially ambiguous representations of concepts like pathways, cycles, and feedback loops. It applies techniques from formal methods and rewriting logic to develop a representation of biological processes that: (i) allows discrete reasoning, (ii) formally defines a model, such as a pathway, and allowable reasoning steps, and (iii) generates testable hypotheses[55, 57]. Pathway Logic models are developed using the Maude (http://maude.csl.sri.com) system, a tool based on rewriting logic[57]. Rewriting Logic is a logical formalism which presents states of a system as elements of an algebraic data type1 and the behavior of a system as local transitions between states described by statements called rewrite rules. In Pathway logic, algebraic data types are used to represent both biological entities such as proteins and small molecules as well as their properties such as biochemical modifications and cellular compartmentalization[57]. Rewrite rules are used to represent local processes within or between cells. A biological process is represented as a collection of rewrite rules together with the algebraic entity declarations. Rewriting logic then allows reasoning about possible changes given the basic rules already defined. The Maude system provides search and model-checking capabilities for the pathway logic framework. The search capability can predict future states of a system, in

1 Data type is a name or label for a set of values and some operations which can be performed on that set of values. An is a special data type with one or more constructors, which, during programming, allows the declaration of different instances of the data types tailored to the entity being represented. 31 response to a stimulus or perturbation, from a given initial state. Model-checking can identify properties of unfeasible pathways or identify a pathway with a given property. Maude can export models to other formalisms and formats suitable for input to other tools and visualization. For example, models can be exported to the Bionet viewer tool, which lays out the models as Petri Nets. Pathway Logic was first used to create of models of signal transduction networks, such as the Epidermal Growth Factor Receptor (EGFR) network[55].

32

Chapter 4: A novel conceptual framework

In order to develop a unified formal representation, we need a conceptual framework that can represent biological system models. The conceptual framework should represent the temporal dynamics of the process and should not require a complete model rewrite on minor changes. The conceptual framework should provide systematic methods to evaluate, update, extend and revise models represented in that framework. The conceptual framework should also be able to represent the manner in which biologists revise models and incorporate that to allow model revision. In this chapter I will describe the conceptual framework for representing biological processes conceived by Stephen Racunas and compare it with other widely used frameworks. I was closely involved in defining the requirements of the framework, mapping it to real biological concepts and debugging the framework. In the following chapter I will describe the other components that combine with this conceptual framework to comprise a unified formal representation for biological processes and will describe a prototype system that we developed based on it Biological systems can be represented by describing the system’s dynamics in terms of the events that occur in the system. In 2003 we described the development of a framework for conceptualizing biological processes in terms of events[44]. We define biological events as changes in a biological system for which we can obtain experimental evidence. In order to represent hypotheses about a biological process in a formal manner, this framework requires an ontology of objects (agents) and processes, their properties and a specification of the relationships in which these entities can participate. We refer to this ontology as the hypothesis ontology, which I will describe in the next chapter. The conceptual framework specifies a grammar that generates a formal language for representing hypotheses. The grammar is declared as follows: Event → Subject.Verb.Object Event → Subject.Verb.Object.Context Event → Subject.Verb.Object.Context. AssocCond Subject → (Actor | Context | Event) Verb → (Physical | Biochemical | Logical) Object → (Actor | Context | Event) Actor → (Gene | Protein | Complex …) 33

Context → (Physical | Genetic | Temporal) AssocCond → (Presence of | absence of).Agent

The terminal symbols – symbols which cannot be further decomposed in a grammar – are supplied by the hypothesis ontology. So we describe biological events by naming the agents (such as proteins and nucleic acids) from the hypothesis ontology and the processes (such as “binds”) that connect them. Thus, a biological event consists of an acting agent (a “subject,” such as a protein), a relationship (a “verb,” such as induce, repress …), a target agent (an “object,” a gene, protein…), the genetic and cellular contexts in which the event takes place, and a set of associated conditions (such as the presence or absence of other agents) which can accompany the event. A sentential2 conforming to the grammar is an hypothesis. An event where the terminals are from the hypothesis ontology is the most basic sentential construct and corresponds to the simplest hypothesis that can be expressed. The set of all possible sentential constructs derived using the grammar is the set of all possible hypotheses and is called the hypothesis space. This grammar, together with the hypothesis ontology, allows us to represent hypotheses in a formal language that specifies the time and context-dependent relationships among the system’s objects and processes[44, 58]. The conceptual framework specifies methods to evaluate such formal language hypotheses for internal consistency and for agreement with existing knowledge[44]. Internal consistency is evaluated by checking conformance to the grammar. Consistency of an hypothesis with observed data and prior knowledge is evaluated by applying constraints and rules. Constraints specify classes of forbidden events. Rules are the operations performed upon available information in order to enforce the constraints. Rules generate judgments of support or conflict, depending upon whether or not an assertion in the formal language is supported by existing data and knowledge. The rules can operate on a wide variety of data types such as numeric data from microarrays, DNA and protein sequence data and qualitative statements extracted from the literature. Rules can be easily extended (or new rules can be added) to use new data types as they become available. For each hypothesis, the final tally of these support and conflict calls determine

2 A grammatical unit that is syntactically independent and has a subject that is expressed or, as in imperative sentences, understood and a predicate that contains at least one finite verb. 34 its degree of agreement with existing knowledge. In this manner the framework integrates diverse data at the logical level. The framework also specifies neighborhood functions to formalize “similarity” between hypotheses. Neighborhood functions use biologically acceptable notions to generate sets of variant events for events that conflict with existing data or prior knowledge. We examine these variants to find more fitting events and replace conflicted events with superior variants to produce hypotheses – which we call neighboring hypotheses – that better fit the stored information. These neighborhood functions can vary in complexity ranging from simple ones such as “replacing the subject of a conflicted event with an entity that shares 80% sequence similarity and participates in a valid event of the same type in model organism X” to more complex ones such as “replacing the conflicted sequence of events in the hypothesis with an alternative sequence of events that can produce the same output/s in homologous model organism X”. Neighborhood functions perform computer-aided hypothesis revision and allow a search of the hypothesis space in a formal manner. The conceptual framework along with the hypothesis ontology forms an event- based formal representation, which allows data integration at the logical level. The resulting representation is easily extensible to incorporate new data types as they become available. Every piece of information rules out some portion of the hypothesis space putting increasingly tighter bounds on the set of possible hypotheses. Because the entities as well as relationships used to construct hypotheses are derived from an ontology, unintended assertions are avoided, which are common artifacts of statistical and equation- based approaches, while specifying an hypothesis. In this conceptual framework, the hypotheses, evaluation methods and data are directly compatible with each other, making it possible to bring together the implications of many kinds of data and information in a unified formal representation.

Extensions to the conceptual framework

We have made several extensions to the conceptual framework after we deployed our prototype system. I am discussing them here because they are a continuation of the 35 work just described. These extensions make the conceptual framework more rigorous, expressive and powerful. We have made the formal language for hypothesis representation more rigorous by developing an associated logic for evaluating hypotheses for consistency with existing information[59]. A logic associated with a formal language provides a set of rules for constructing formulas (testable expressions) using terms from that language. Because of this extension, we can now represent a biological hypothesis as a formal model in strict model-theoretic terms [60]. In the domain of model theory, a model is an abstract (but formal) representation of entities and relations as well as interdependencies among them within the system described by the model[60]. For every expression constructed using a formal language with an associated logic, a model provides a testable explanation implied by the logic. As before, an event where all terminals are from the hypothesis ontology is the most basic model and corresponds to the simplest hypothesis that can be expressed. More complex models (‘higher-order’ models) such as pathways, which can express causal, temporal, and hierarchical relationships between biological events are constructed using lower-level models describing individual biological events. Such higher-level models, composed of lower-level models, can themselves be used to form still higher levels of abstractions, including those describing biological systems, and relationships between systems[59]. Thus models, at different resolution levels, represent events, hypotheses, and collections of hypotheses. Therefore we can represent a biological system at a whole range of resolution as well as incorporate differing resolution levels in one model. As a result of these extensions, the neighborhood functions can now be made more rigorous by the fact that we can borrow models from other organisms to use as templates during the process of generating of neighboring hypotheses.

36

Comparison with other conceptual frameworks

In the context of formally representing models of biological process, a conceptual framework provides a set of abstract constructs that can be used to create a formal description of the process. Various conceptual frameworks have been used for this purpose varying from ordinary differential equations, which represent processes in terms of rates of change over time of the member entities, to Boolean[36, 38, 61] and Bayesian networks[39-41] as well as Petri Nets[27, 62], which represent processes as directed graphs. Here I will briefly describe each of these frameworks and compare them with the conceptual framework we developed in terms of the desirable properties of a conceptual representation put forth by Mandel et al[63].

Ordinary differential equations

Ordinary differential equations (ODEs) represent biological processes in terms of the rate of change over time of the entities participating in the process. ODEs use parameters to link variables in different equations comprising the model of a biological process.[64] The general form of ODEs used to represent biological processes can be described as dY = p + p f (X ) , where the rate of change of Y is dependent on the dt 1 2 level of X, the “strength” of the dependency is represented by the parameter p2, the

“type” of dependency by the function f and the parameter p1 can be used to represent baseline levels or certain constants. The value of X at a given time might be measured or can be described by another differential equation. The complete set of differential equations describing a biological system is referred to as a system of differential equations. Numerical solution methods enable differential equations to simulate a system by solving the equations numerically through time and generate testable predictions about the system. In addition, standard tools and software exist to construct and simulate differential equation models. However, there are disadvantages to ODEs as a conceptual framework for biological processes: Detailed information about the concentrations of involved entities and the values of the dependency parameters is not available for most biological 37 processes. Even if such information is available, the task of constructing a differential equation model of biological processes requires significant training in in order to formulate the differential equations correctly. Therefore, their use is usually limited to modeling relatively small networks[65, 66]. Due to the manner in which ODEs are presented it is seldom possible to examine system of differential equation such as those of Tyson et al and comprehend the biological process they represent[63, 64]. Moreover, when we look at a set of ODEs it is not possible to infer the creation or presence of an entity just by examining the equations. Solving a large system of ODEs is also computationally intensive. Differential equation models are not easily extensible and addition of a new type of information or even a new entity modifies a lot of dependency parameters, usually requiring a complete rewrite of all the equations comprising the model. It is also not possible to associate an ontology with ODEs, so that a specific form of function f gets associated with a particular relationship such as “induce”. It is not easy to create an intuitive visual representation of a system of differential equations[63]. Moreover, it is not easy to represent qualitative transitions in ODEs such as from an inactive to active protein. (This would have to be represented as a variable whose value is the ratio of the concentration of the active to the inactive protein and then specify a threshold parameter for that variable.) ODEs also can not represent physical movement of entities across cellular locations or represent states such as stages in a developmental pathway.

Boolean networks

A Boolean network represents a biological process in the form of a directed graph, where each entity and interaction is represented by a node in the directed graph and there is an arrow from one node to another if and only if there is a causal link between the two nodes. Each node in the graph can be in one of two states, on or off. For a gene, on corresponds to the gene being expressed; for other entities, it corresponds to the substance being present[63, 67]. For processes, on means the process is active. The sum total of the states of all nodes at each instant defines the state of the entire system. Time is represented as proceeding in discrete steps. At each step, the new state of a node is a 38

Boolean function of the prior states of the nodes with arrows pointing towards it[63, 67]. This update procedure depends only on the upstream nodes without any effect of the prior state of the downstream node. The validity of the model is tested by comparing simulation results with actual time series observations[61, 67]. Boolean networks fail to capture the diversity of available information because they have only causal links between nodes. Therefore, it is not possible to associate different relationship types in an ontology with separate types of links. Boolean networks require specification of Boolean functions corresponding to each node in the system and the process of designing these functions is very subjective (depending on the input nodes, the type of edge, etc). Adding/removing nodes also requires one to add/remove these functions leading to difficulty in maintaining and extending Boolean models of a system. Boolean networks cannot represent physical events such as transport of entities across cellular compartments or creation/destruction of a protein complex.

Bayesian networks

Bayesian networks (BN) represent causal relationships in biological processes as conditional probability distributions between stochastic variables. BN treat biological entities such as genes and proteins as stochastic variabls and the relationship between them as a conditional probability distribution (CPD). For example, in a BN a link such as A “causes” B is specified by giving the CPD of B given A, which represents what values will B take and with what probability given a specific value for A. When determining the degree of fit of a BN with data, this CPD is estimated by maximizing the joint probability of the model, given the data. However, for reasoning about a biological process using BN, we would have to use the data as an approximation of the CPD, for which the current data is insufficient, or we have to specify the CPD of every node in the BN a priori, which is usually not possible. Bayesian networks are visualized as directed acyclic graphs where nodes represent variables and arcs represent postulated causal dependencies between these variables. The application of BN to biological networks has been investigated by Friedman et al, Pe’er et al and Hartemink et al.[39-41] Hartemink used BNs to model gene regulation 39 of galactose metabolism in Saccharomyces cerevisiae and then evaluated the agreement of alternative models with microarray data. Simple BNs cannot model cyclic dependencies and representing feedback is possible only with dynamic Bayesian networks (DBN), which represent feedback as dependency relationships between nodes across successive time-steps[39, 63, 67]. Bayesian networks suffer from many of the same problems as Boolean networks such as requiring the specification of the CPD of each node analogous to the Boolean function for each node and the inability to represent physical events such as transport across membranes. Bayesian methods do not readily incorporate qualitative information such as presence or absence of phosphorylation on a protein. They also cannot represent biological knowledge at different levels of resolution because it becomes extremely difficult to specify the CPD between entities that exist at different scales. Moreover, the modification which enables DBNs to represent cyclic dependencies makes the notation extremely unreadable for visualization as directed acyclic graphs[63].

Petri Nets

Petri Nets were originally developed to describe and analyze concurrent systems, which have multiple events happening at the same time. There are many variants of the formalism (I will describe one of them below) and a variety of languages and tools for specification and analysis of systems using Petri Nets. Petri Nets have a graphical representation that matches conventional representations of biological processes. They have been used to model both metabolic pathways and simple genetic networks[47, 48, 62, 68]. A hybrid functional Petri Net (HFPN)[48], a popular variant, represents a biological process as a directed graph containing two kinds of nodes: places P ={pi} and transitions T ={ti}. An HFPN also contains three kinds of arcs A ={ai}: normal, inhibitory and test arcs. Arcs connect places with transitions to provide flow routes through the net and each arc can have an associated weight, which quantifies the strength of the link represented by that arc. Places contain a type of “markers”, called tokens, which represent the current state, called a marking, of an HFPN at each instant. Normal 40 arcs allow the flow of tokens between places and transitions. Normal arcs are of two types, input arcs, which connect places to transitions and output arcs, which connect transitions to places. Test arcs and inhibitory arcs modulate transitions without using tokens from the upstream places. A Petri Net is simulated in the following manner: Each place pi contains a number of tokens M(pi) and collects tokens until they are passed on via transitions to downstream places. Tokens are consumed or produced when a certain is satisfied and a transition is activated. Petri Nets are most suitable for modeling systems that can be described by finite sets of atomic processes and atomic states[68]. Petri Nets have an intuitive graphical representation, can represent systems at coarse- or fine-grained levels, and enable qualitative analysis. However, Petri Nets do not have features for expressing spatial properties of entities, each configuration has to be expressed as a different system state and Petri-net models with spatial features do not scale up well[68]. Petri Nets also lack an ontology for establishing “types” of transitions and do not provide methods for revising or generating neighbors of biological process models expressed as Petri Nets in a computer-assisted manner.

Summary

Table 1 summarizes the comparison of differential equation, Boolean, Bayesian and Petri Net frameworks with our conceptual framework in terms of the desirable properties of a conceptual framework put forth by Mandel et al[63]. Our conceptual framework and Petri Nets both are derived from engineering efforts to represent complex processes in an event-based manner. Moreover it is possible to map an ontology to Petri Nets so that different relationship types in an ontology can be associated with different types of transitions. Hybrid functional Petri Nets are the closest to our conceptual framework and have been used by Peleg et al[27] to create a formal representation for biological processes by mapping an ontology used to describe business workflows to Petri Nets. The resulting formal representation is very close to ours and I will discuss it further in chapter 7.

41

ODE BN DBN HFPN Racunas et al framework Represent states No Yes Yes Yes Yes Represent transformations of Indirectly No No Yes Yes objects Represent state transitions No Yes Yes Yes Yes Represent transport and No No No Yes Yes physical movement of entities Represent creation and Indirectly No No Yes Yes destruction of entities Represent stochastic behavior Indirectly Indirectly Yes Indirectly Indirectly Represent temporal dynamics Yes Indirectly Indirectly Indirectly Yes Simulation basis Numerical Rule based Probabilistic Quasi- Constraint numerical based Computational complexity High Medium High High Low Existing tools Mathematica, None BayesiaLab Visual Object Hybrow Xppaut known Net, Woflan

Table 1 A comparison of the properties of different conceptual frameworks in terms of their the desired properties of a conceptual framework proposed by Mandel et al[63]. ODE = Ordinary differential equations, HFPN = Hybrid functional Petri Nets, BN = Boolean networks and DBN = Dynamic bayesian networks. Columns for ODE, BN, DBN and HFPN are adapted from Table 1 in Mandel et al[63].

42

Chapter 5: Prototype implementation of HyBrow

I have described how a unified formal representation can enhance our ability to interpret large-scale datasets. I have described the challenges in developing such a representation and have outlined a novel conceptual framework, developed jointly with Stephen Racunas, as one of the components of such a formal representation. Several issues still need to be addressed such as developing the ontology, gathering information and structuring it into the ontology before we have a unified formal representation. In this chapter I will describe the development of the remaining components of the formal representation and a working prototype, named HyBrow (Hypothesis space Browser), based on it. At the heart of HyBrow is the idea that disparate kinds of information can be represented in a unified formal representation. HyBrow is a prototype system for constructing hypotheses and evaluating them for consistency with existing knowledge. HyBrow consists of a conceptual framework with the ability to accommodate diverse biological information, an event-based ontology for representing biological processes at different levels of detail, a database to query information structured in the ontology, and programs to perform hypothesis design and evaluation. We used yeast galactose metabolism (GAL system) to develop our ideas and construct a prototype because this experimental system is well-characterized and has abundant publicly available data of different types[69]. Sequence and annotation data are freely available in the Saccharomyces Genome Database (SGD)[70], gene expression data are available in the presence and absence of galactose[69] and there is also a repository of curated literature information about S. cerevisiae genes and proteins, YPD™ (created by Proteome). To test the HyBrow prototype, we evaluated and ranked alternative hypotheses about the GAL system. During evaluation, HyBrow assayed all stored data for conflicting or supporting evidence for each statement in each hypothesis. HyBrow modified hypotheses which contained errors to generate variants with fewer flaws. Finally, HyBrow combined the resulting determinations of conflict and support to generate 43 evaluations and rankings for all of the original and variant hypotheses. We presented the prototype at the Intelligent Systems in Molecular Biology conference 2004 at Glasgow. I designed the hypothesis ontology for the GAL system, designed a database to store yeast GAL data structured in the ontology and developed the hypothesis evaluation and revision routines. Steve developed the hypothesis composition and visualization software and Istvan Albert implemented the hypothesis scoring rules.

Hypothesis ontology

For the purpose of constructing and validating hypotheses in HyBrow, we need an Hypothesis Ontology that can represent both agents and interaction in the biological system and is compatible with the conceptual framework outlined before. Although there are several specialized ontologies, there is a need for an ontology that allows users to represent biological processes at multiple levels of resolution ranging from single events to modular sub processes. This ontology should allow users to construct alternative hypotheses about biological processes and should be compatible with existing ontologies so that hierarchical relationships can be developed between terms in existing ontologies and this Hypothesis Ontology[12, 27]. For example: In the hypothesis ‘protein A binds to the promoter of gene b’, the process ‘binds to the promoter’ implies that the promoter of gene b has a binding site X and that protein A binds to the site X. In this example if binds to promoter is a term in the Hypothesis Ontology, while has binding site and binds to site are terms that already exist in the Bioprocess Ontology then the two ontologies are considered compatible. Although specialized ontologies exist[34, 71], there is no ontology that allows users to represent biological processes in a manner compatible with our conceptual framework. I designed an hypothesis ontology, compatible with HyBrow’s conceptual framework[44], for representing GAL system information. For guidance, we relied upon the principles used to design the Rzhetsky and the Bioprocess ontologies[27, 34] as well as recommendations from the Stanford medical informatics group[72] and BioHealth Informatics Group at the University of Manchester[26]. 44

The main components of an ontology are concepts, relations, instances, and axioms[26]. Concepts represent groups of objects that share common properties within a domain we want to represent. For example, “protein” is a concept in molecular biology. Relations are logical or natural associations between concepts. There are two kinds of relations: hierarchical, which organize concepts into parent-child tree structures and associative, which link concepts independent of their hierarchical arrangement. Instances are the actual objects represented by a concept. For example, Gal4p is an instance of the concept “protein”. In most cases, ontologies do not contain any instances. The instances are stored in databases structured according to the ontology. Axioms are constrain properties of concepts and instances. There are various formal methods for specifying an ontology, based on different knowledge representation methodology such as, CKML (Conceptual Knowledge Markup Language), Frames, UML (Unified Markup Language) and OIL (Ontology Interchange Language)[73]. Depending on how an ontology is specified, it can vary widely in its complexity ranging from: (i) controlled vocabulary, such as MeSH terms, which is simply a list of terms, defined in natural language. (ii) Taxonomy, such as the Gene Ontology, which is a set of concepts that are arranged into generalization-specialization hierarchy. (iii) Frame based ontologies, such as the biological process ontology, which represent concepts as “frames” and their properties as “slots”3. (iv) Description (DL) based ontologies, such as the Gene Ontology Next Generation (GONG), which capture declarative knowledge about a domain in terms of concepts, roles (relations) and individuals (instances) using logical predicates. DL allows reasoning about the captured knowledge to automatically create a network of concepts and their inter-relations as well as verify the description consistency[73]. Figure 7 shows different types of ontology specifications. The ontology (Shown in Figure 8) is designed using Protégé[74], which uses a Frame-based representation where each concept is represented by a frame with slots which describe the properties of the concept. The ontology accommodates currently available literature data, extracted primarily from YPD[75] at a coarse level of resolution.

3 A frame is a data structure invented by Marvin Minsky in the 1970s for representing knowledge. A frame is used to represent groups of objects that share common attributes. A slot is a component of the frame data structure and stores a shared attribute of the group of objects represented by the frame. 45

An event consists of an acting agent (the “subject,” such as gene, RNA, protein), a target agent (the “object,” such as a gene, protein, complex), a relationship (the “verb,” such as induce, repress, bind), a context in which the event takes place, and an optional set of associated conditions (such as the presence or absence of other agents) which accompany the event. Contexts specify where events occur in the cell and under what genetic conditions they occur. The contexts are derived from established ontologies. For example, terms for specifying physical locations in the cell come from the cellular component division of the Gene Ontology. We currently represent genes, proteins, mRNA, small molecules, and complexes of proteins as agents in our prototype. We represent three main categories of relationships: logical (e.g. induce), biochemical (e.g. phosphorylate) and physical (e.g. bind). The construction of event sets from events as well as hypotheses from event sets is governed by a context-free grammar4, which allows the evaluation of the validity of a given sequence of events. Events that occur in the same context are combined to form event sets and an hypothesis consists of event sets linked by logical and temporal operators. An hypothesis must contain at least one event set, which must contain at least one event. Formal specification of this grammar is presented in Appendix A.

4 A context-free grammar (CFG) is a formal grammar in which every production rule is of the form V → w where V is a non-terminal symbol and w is a string consisting of terminals and/or non-terminals. The term "context-free" comes from the fact that the non-terminal V can always be replaced by w, regardless of the context in which it occurs. A Formal grammar is a set of rules that mathematically delineates a set of finite-length strings over a finite alphabet. Formal grammars are so named by analogy to grammar in human languages. 46

Figure 7 Examples of different types of ontology specifications. See text for detailed explanation.

Figure reproduced from “Survey of existing Bio-Ontologies” by Natalia Sklyar[73].

47

Figure 8 An overview of the ontology used to represent data in an event-centered way. ‘Operators’ are the relationships that can exist between agents.

48

The key design principle is that the ontology describes a regulatory system in an event-based way consistent with our conceptual framework. Our current hypothesis ontology allows representation of events such as: ‘Gal4p binds to the promoter of the gal1 gene in the presence of galactose in wild type S. cerevisiae’. Depending on the granularity of the ontology, this approach can represent anything from simple protein phosphorylation to the entire cell cycle[52] . The complete ontology along with documentation is available for download at www.hybrow.org The ontology is “well-formed” because it satisfies the criteria of a well-formed ontology such as clarity, coherence, extendibility and minimal encoding bias as proposed by Gruber et al and Stevens et al[26, 28]. For example, “Extendibility” requires that one should be able to define new terms for special uses based on the existing vocabulary without revision of the existing definitions. The current ontology satisfies this criterion and we have defined new types of agents after the deployment of the prototype system.

Inference rules and constraints

As described in chapter 4, constraints specify classes of forbidden events. Rules are the operations performed upon available information in order to enforce the constraints. Rules generate judgments of support or conflict, depending upon whether or not an assertion is supported by existing knowledge. I have defined constraints and the rules that determine whether or not a constraint is satisfied for each relationship expressible with terms from our ontology. There are several categories of constraints: Ontology constraints determine what agents can participate in which types of biological relationships. For example, a gene cannot transport a gene, but a protein can transport a small molecule or another protein. Annotation constraints determine what annotations are valid for a particular relationship. For example, for the relationship ‘protein A binds to the promoter of gene B’, it is acceptable for protein A to be annotated as localized in the nucleus or cytoplasm but not on the cell membrane. Data constraints determine what are the required value / properties of the relevant data for the relationship being evaluated. For example, in case of the binds to promoter relationship, it is required that the promoter of gene B contain one or more 49 binding sites for protein A. Literature constraints require that the relationships should not be contradicted by statements extracted from published papers. Existence constraints require an agent’s presence before it can enter a relationship. For example, a protein cannot perform its function in a genetic context where its gene has been deleted. Temporal constraints govern the transmission of modifications made to an agent by previous events. For example, event ‘X phosphorylates Y’ implies that in all subsequent events Y is phosphorylated (unless a dephosphorylation event occurs). As noted earlier, rules are the operations performed upon available information in order to enforce the constraints. Each rule has divisions that correspond to the different categories of constraints that exist in HyBrow. The first division deals with ontology constraints, the second, with constraints on annotation data in Gene Ontology (GO)[12] format, the third deals with literature-extracted information structured in the ontology, and the fourth with constraints on the specific data type(s) for a relationship, such as promoter sequence in the case of the Binds to promoter relationship. For each constraint that is violated in any division of the rule, the relationship is assigned a ‘conflict.’ For each constraint that is satisfied, the relationship is assigned a ‘support’. If a constraint is neither violated nor supported, a ‘cannot comment’ is assigned. These support, conflict or cannot comment assignments are used by the programs performing hypothesis evaluation. The steps for executing divisions 1, 2 and 3 can be generalized because they have a common structure for different relationships and the operations to be performed on the data are very similar. Division 4 is very specific because of the different ways in which different data types for each relationship must be used. There are additional general divisions that enforce existence and temporal constraints. For example, the rule for protein A binds to promoter of gene B (shown in Figure 9) has the following divisions: 1) check if A is a protein or a protein-complex and if B is a gene; 2) check whether protein A is annotated a) to have the molecular function of a transcriptional activator or repressor, b) to be involved in the biological process of transcriptional regulation and c) to have a nuclear localization; 3) determine whether the literature reports the postulated event; 4) search the promoter of gene B for a binding site for protein A; 5) ensure that the associated conditions of the event, if specified, are satisfied. 50

Rules are coded in Perl as hierarchical function libraries to keep the rule set extensible and flexible. The hypothesis evaluation programs load the rule-libraries during execution and it is possible to change the rule- without modifying the web- interface or the evaluation programs. Most of the constraints enforced by the generalized divisions are stored in database tables which are queried at run time, allowing flexibility for changing the stringency of the constraints. The perl code for the rule library is available for download at www.hybrow.org

51

Binds_to_promoter [P, g, +/- of z]

if g is not a gene, give conflict. if P is not a protein or protein complex, give a conflict.

If annotation for P exists: if Molecular function is transcriptional ind/rep give support if Biological process is transcriptional regulation give support if cellular location is not nucleus give conflict if cellular location is nucleus give support

if binds_to_promoter table has entry for the P-g pair, give support. if table has an entry for the negated statement, give conflict

if promoter of g does not have any binding site for P, give conflict. if promoter of gene g has at least one binding site for protein P give support.

if z is specified: if associated condition has entry = z, for the above P-g pair, give support else give 'can not comment'.

# binds_to_promoter [P, g]=> is_site_for (XXX, P) AND Has_site (g,XXX).

Figure 9 Outline of the binds to prompter rule: The ontology constraints are shown in grey, the annotation constraints in green, the literature constraints in blue and data constraints in brown. The constraints in black check the associated conditions of an event if specified. The line with ‘#’ shows other relationships in the ontology implied by the binds to promoter relationship. The red terms in the annotation constraints represent Gene Ontology terms and can be modified to vary the stringency of the constraint. P = acting agent, g = target gene, +/- = Presence or absence, z = some associated condition. 52

Database and information gathering

Biological information residing in the published literature and electronic databases is expanding at an accelerating rate. Retrieving information and translating it into our ontology presents several problems because the information is in different repositories and in different storage formats. Further, only a fraction of the published literature is available electronically. The problem of automating extraction of information from the literature is being addressed by a number of research groups, but is far from solved[34, 76, 77]. The most promising approaches appears to be MedScan[78], Textpresso[79] and GeneWays[35], which can parse literature abstracts using an explicit ontology to identify ‘biological events’. But information extraction is still largely manual, practiced by annotators who read papers for relevant concepts and information. Currently, because of the high level of subjectivity involved in interpreting data, there is no immediate solution to this problem and a biologist's discretion is indispensable for interpreting the information, especially from published papers. For example, consider the following information from the literature: Gal4p phosphorylation at ser699 is required for galactose-inducible transcription. This information has to be ‘interpreted’ as: Phosphorylated Gal4p induces in presence of galactose, and the biologist has to fill in with galactose inducible genes using his expertise. We used several different approaches for gathering data and structuring information in our ontology. For data with standardized representation formats, we designed user agents to access the existing public repositories and retrieve desired information. For example, I designed a user-agent to retrieve promoter and gene sequences from the S. cerevisiae Promoter Database[80]. In most cases, we were able to access well-curated information from the Saccharomyces Genome Database[70] directly. We used YPD[75] to access curated literature information about S. cerevisiae genes and proteins. I designed a form-based layout for gathering biological information from YPD reports and filled in predefined table fields compatible with the ontology from specific fields of the YPD report. 53

For example, the YPD report (presented as a web page) for Gal4 protein contains an entry “Gene regulation: Repressed by Mig1p”. This sentence implies the following in our ontology: Mig1p is an acting agent, gal4 gene is the target agent, and repress is a relationship. The final construct in the ontology is: “Mig1p repress gal1 in the nucleus in the wild type S. cerevisiae”. Note that “nucleus” and “wild type S. cerevisiae” contexts are not mentioned in the YPD report entry and have to be inferred from the preceding sentences of the report and the fact that the current entry refers to a gene, which is in the nucleus. This process can be automated to some extent if direct access to database tables is obtained or can be extended to frame-based ‘loading forms’ used by the EcoCyc and Reactome databases[33, 81]. For quantitative data, such as that from microarray expression profiling experiments, I converted the data to our table format using custom Perl scripts. If microarray data are structured in the MAGE ,[82] this task is more straight- forward. I also designed a MySQL database and mapped our ontology onto the database for easy extension as our ontology evolved. We chose to create a table in the database for each class in the ontology, at the finest level of resolution. The table has fields for properties of the relevant class. This creates more tables than would be present in a well- normalized relational database. However, a prototyping effort requires the backend to be easily modifiable in response to changes in the ontology. The backend also contains tables to store constraints used during evaluation. The code for the database schema is available at www.hybrow.org

54

User interfaces

An important feature of HyBrow is that it is easy for the user to construct a computer-readable formal language hypothesis. We have created two interfaces for this purpose: a visual interface (Figure 10 left panel) and a widget interface (Figure 10 right panel). Steve wrote the code for the visual interfaces whereas I served as a tester and bug reporter. Our visual interface allows users to construct hypotheses using a visual notation constructed in accordance with proposed conventions[83, 84]. This interface allows users to draw diagrams which are then automatically translated into formal language hypotheses. The widget interface allows the user to write hypotheses in English-like “sentences” constructed using subject/verb/object selection menus. A user can construct portions of an hypothesis using different interfaces and then combine them. Hypotheses are saved to local files and then submitted for evaluation via the web. Details on how to use the tools are provided in Appendix B and source code is available at www.hybrow.org.

Figure 10 Screen shots of the visual and widget interface used to construct hypotheses.

The hypothesis evaluation process

The hypothesis evaluation process is illustrated diagrammatically in Figure 11. When HyBrow receives an hypothesis, it checks the connections between events and event sets for conformity with the hypothesis grammar. If the hypothesis passes these tests for syntax, each event is then checked for validity using the appropriate rule for the relationship proposed in the event. For each event, a support, conflict or cannot comment result corresponding to each of the four divisions of the inference rules is returned. 55

Finally, the support and conflict calls are tallied based upon the logical structure of the hypothesis. Each ‘and’ between event sets leads to the inclusion of results from both sets in the final tally. For each ‘or’ connection, the ‘better’ set is chosen using a hierarchical set of rules. [Sample rules: 1) an event set with conflicts it is better than an event set with more conflicts and worse than one with fewer conflicts. 2) An event set for which all events have at least some support is better than an event set for which at least one event is not supported. 3) If one event set's support is a strict superset of another event set's support, the superset is superior., …] We apply these rules sequentially until one of the rules returns a clear decision. For each event, the hypothesis evaluation process finds all conflicts with existing knowledge and indexes them, along with their sources. These are reported to the user to allow the user to identify specific problems with the hypothesis and the conflicting data source. For each event that has a conflict, a set of variant events is generated using biologically motivated heuristics, such as replacing the acting agent with agents that share a sequence similarity or share a similar cellular localization with the original agent. Neighboring hypotheses that share the logical structure of the original are generated by replacing conflicting events with the best variant events. These neighboring hypotheses are then evaluated, and if a better (more supported, less conflicted) hypothesis is found, it is presented to the user. After evaluation, the user is shown 1) the support and conflict totals, 2) the least conflicted, most supported event sets that fit the logical structure of the hypothesis, 3) a support-conflict scatter plot of neighboring hypotheses automatically generated from the user submitted hypothesis, and 4) a list of all events that had conflicts, the data that triggered the conflicts, an explanation of why the rules interpret that data as a conflict, and a reference to the original article or data source. The results pages (Figure 12) allow a user to gauge the ‘fitness’ of his/her hypothesis in the light of all stored data. Iterative of the hypothesis allows the user to reconcile all existing data into a single coherent representation whose level of detail depends on the resolution of the ontology used for constructing hypotheses. 56

Server

User Result Inference rules Hypothesis Justification Visual Widget Browser parser and routines ranking rules Neighboring events Hypothesis generator file Event Handler Database

Figure 11 The hypothesis evaluation process: The visual or the widget interface is used to design hypotheses, which are sent to the server via a browser. The hypothesis parser is the entry point for the system and it uses the event handler, which manages the event library, ranking, justification and event neighborhood generation. The database stores the different data structured into ‘events’. 57

A. Representation of an hypothesis in terms of events (ev = event)

C. Plot of the support verses conflicts for submitted and neighboring hypotheses (n1, b1). n1 Clicking on the n1 submits that b1 hypothesis as ‘seed’

B. Holding the mouse on a neighboring hypothesis (b1) shows what event was replaced to create it

Figure 12 Screen shot of the result page

Test runs with sample hypotheses

In order to test the prototype, we composed hypotheses about the GAL system and ranked them. The Gal system contains a genetic regulatory switch, as a result of which enzymes required specifically for transport and catabolism of galactose are expressed only when galactose is present and repressing agents such as glucose are absent. Galactose utilization occurs by a biochemical pathway that converts galactose into glucose-6-phosphate and the regulatory network controls whether the pathway is on or off[85]. The process involves three types of proteins: 1) A permease (Gal2p) that transports galactose into the cell (encoded by gal2 gene). 2) The proteins that utilize 58 intracellular galactose; galactokinase (encoded by gal1), uridylyltransferase (encoded by gal7), epimerase (encoded by gal10), and phosphoglucomutase (encoded by gal5). 3) The regulatory proteins Gal3p, Gal4p and Gal80p, which exert tight transcriptional control over the gene encoding the transporter, the enzymes, and to a certain extent, their own genes. Gal4p is a DNA-binding factor that can strongly activate transcription, but in the absence of galactose, Gal80p binds Gal4p and inhibits its activity. When galactose is present in the cell, it causes Gal3p to bind Gal80p. This causes Gal80p to release its repression of Gal4p, so that the gal2, gal1, gal7 gal10 and gal5 genes are expressed at a high level[69]. HyBrow successfully identified the hypothesis that best explained the current understanding about GAL system regulation[69, 85]. For 6 of the 7 events that had conflicts, HyBrow was also able to successfully suggest corrections that increased agreement with stored information. All hypotheses used in the tests and explanations of their evaluations are available for download at www.hybrow.org. Here I will describe the evaluation of a simple illustrative hypothesis: “Gal2p transports galactose into the cell at the cell membrane. In the cytoplasm, galactose activates Gal3p. Gal3p binds to the promoter of gal1 gene and induces its transcription in the presence of galactose”. This hypothesis was decomposed into events as shown in Figure 12 A. On evaluation, HyBrow reported support from literature and GO annotation for event number 0 (ev0), support from literature for ev1, support from ontology constraints and annotation for ev2 and support from the ontology, literature and data divisions for ev3. It reported a conflict for ev3 (which is marked in red) from the annotation rule division because Gal3p is annotated to be primarily in the cytoplasm[70]. HyBrow then searched for variant events. For ev3 it found an event (Gal4p binds to promoter of gal1) with higher support and for ev4 it found the more meaningful event (Gal4p induces gal1 in nucleus in wt in presence of galactose) with the same support but no conflict. These events were inserted in place of the original events to create a neighboring hypothesis that is better than the original hypothesis (Figure 12 B, C). When a submitted event contains a perturbation, such as the deletion of a gene, HyBrow identifies the agents disabled because of the perturbation and infers a conflict with events that depend on those agents. For example, if the submitted event is: “Gal3p 59 induce gal1 in nucleus in gal3-K/O” HyBrow reports a conflict. (The event “Gal3p not induces gal1 in nucleus in gal3-K/O” gets support). Some of the inferences suggested by HyBrow are obvious for the small GAL system, but HyBrow’s ability to automate the process offers a substantial advantage for systems containing large numbers of genes and proteins.

60

Chapter 6: Lessons learned from the prototype

The HyBrow prototype allows the construction and evaluation of hypotheses expressed in familiar diagram or intuitive text-based formats to aid in integrating data into working models. HyBrow’s methodology is evaluation-based. Thus, unlike systems that construct statistical or equation-based models, HyBrow is able to provide explicit reasons (and references) for its output. HyBrow does not force the user to accept all of the data, nor does it judge the validity of stored information. Rather, it gives the user links to the exact source of each conflict, leaving it up to the user to judge the relative merits of information sources. The user can choose to ignore conflicts from data sources deemed unreliable. In the test runs, HyBrow identified the least conflicted hypothesis accurately and suggested valid ‘corrections’ for events with conflicts. I believe that we can build upon this success to extend and strengthen HyBrow in several ways. Currently, improvements to hypotheses are suggested using neighboring events generated using simple heuristics, while our conceptual framework supports neighborhood functions that create similar event sets from a given event set[44] or create similar event sets using event sets from a related organism as a “template”. Extending HyBrow to use neighborhoods of event sets as well as of events requires new evaluation routines that track all of the modifications that an event set generates and ensure that the neighboring event sets satisfy them. The current rule library contains rules for ‘extrapolations’ in the presence of perturbations such as gene knock-outs and constitutive over-expression. We also need to incorporate rules for such extrapolations in the presence of chemical agents that block certain processes. To accomplish that, we can define experimental contexts for events representing the presence or absence of such agents. We can then formulate rules indicating how a specific experimental context modifies the properties of acting agents in an event occurring in that context. Our current implementation can only propagate temporal constraints about the presence or absence of biochemical modifications. We need to propagate constraints about activation/inhibition and induction/repression in an attempt to model how an event affects down-stream agents. To that end, we can extend the current ontology to include 61

‘modification state’ and ‘activation state’ descriptors for agents; previous events can then modify these state descriptors. In the current prototype existence constraints are checked only once, at the start of the hypothesis evaluation, and hence we can not represent interactions such as gene knock-downs. To accomplish the dynamic checking of existence constraints we can include an ‘existence state’ descriptor for agents; previous events and experimental contexts can then modify this state descriptor to enable or disable an agent Finally, HyBrow can identify events that are frequently specified, but for which evaluation was not possible. Identifying such events will allow HyBrow to aid experiment design. For instance, if many users include an event in their hypotheses and there is no experimental evidence for it, HyBrow can indicate a need to obtain such data.

Revision of the hypothesis ontology

After building the prototype HyBrow system, I made several revisions to the hypothesis ontology to make it more expressive by including experimental contexts as well as additional descriptors for agents. In the new ontology, each agent has additional descriptors that store its activation state, modification state and existence state. These state descriptors allow us to track changes made to agents over time and account for their effect later on. We have also introduced additional agents such as miRNA.

Figure 13 Properties of agents in the revised ontology. The figure shows the descriptors such as activation_state that are stored for each agent and the types of agents that exist in the current ontology. 62

The new ontology incorporates the notion of an experimental context that allows us to specify the presence or absence of agents. The experimental context then modifies the allowed molecular functions as well as modification, activation and existence state descriptors of agents in an event occurring in that context. Moreover, the experimental context itself can be modified by previous events allowing us to model complex situations like ‘knock-down’ of a gene using a miRNA. The new ontology also contains a developmental stage context that allows a user to specify the developmental stage (using a controlled vocabulary) at which a particular relationship holds true. This allows us to represent situations where the function of a protein differs at different developmental stages. All of these modifications make the ontology more expressive and allow the representation of more complex hypotheses in a formal manner. Currently, the ontology has 57 different relationships and 11 agent types and is available for download at http://wartik19.biotec.psu.edu/Docs/Ontology/dev. Revision of the ontology requires a corresponding change in the backend of the relational database associated with it. While prototyping HyBrow, I mapped the ontology to a relational backend manually using the heuristic of creating a table in the database for each class in the ontology, at the finest level of resolution. The table contained fields for properties of the relevant class. In subsequent work, I have adapted the Protégé-to- MySQL converter developed by the Reactome group to automatically create a corresponding relational backend from our ontology. This allows us to use protégé as a knowledge gathering tool to create a knowledgebase and then convert it to a relational database automatically.

Bottleneck for structuring data and role of knowledgebases

During our prototyping effort, the most tedious and time consuming phase was the gathering and structuring of literature data. Even though we had access to YPD, I had to manually read each gene/protein report and enter the information into loading forms manually. Clearly, this is not a scalable approach. Natural language processing methods have been applied to address this problem. The most successful approaches, such as MedScan, Textpresso and Geneways, use an explicit ontology and semantic rules to mark 63 up text from paper abstracts or full text to index all sentences containing a particular concept from the ontology[35, 78, 79]. It is then possible to use predefined ‘templates’ for relationships between entities and search the marked up text to automatically identify papers containing a particular relationship. For example, a sentence such as “In the npr1 mutant, ozone fails to induce the WRKY33 gene” can be marked up in the following manner:

In the npr1 mutant ozone fails to induce the WRKY33 gene.

Semantic rules, such as [\w{3} \d+ \s+ (mutant | knock out)?], declare that a genetic context, in which the entity under consideration is disabled, is implied when the text contains a string of three lower case alphabets followed by one or more numbers, a space and optionally the words ‘mutant’ or ‘knock out’; and that ‘fails to’ is a form of negation. From this marked up text, using the semantic expressions and their interpretation, we can automatically extract a biological event structured in our ontology: ozone does-not-induce WRKY33 in the npr1-K/O context. A variety of integration approaches have been developed for other data types, such as gene and protein sequences, GO annotations, conserved protein domains and microarray expression data. One simple approach is to use customizable user agents created using tools such as myWEST[86] or programmed in Perl, which repeatedly execute a fixed set of queries on a large database such as saccharomyces genome database. More ambitious proposals include adopting emerging technologies such as web services and the semantic web. BioMoby is a proposal to create a data integration system using web services[87], which are collections of protocols and standards for exchanging data over the internet between software applications written in different programming languages and running on different platforms. The LSID is an attempt to create a semantic web5 compliant naming scheme, based on the type, source and unique identifier

5 The Semantic Web is a project that intends to create a universal medium for information exchange by giving meaning (semantics), in a manner understandable by machines, to the content of documents on the Web. Currently under the direction of its creator, Tim Berners-Lee, the Semantic Web extends the ability of the through the use of standards, markup languages and related processing tools. 64 of an entity, for identifying data about that entity stored in different public databases to overcome the limitations of naming schemes in use today[88]. Both these projects are in their preliminary stages and it is not clear which approach is better. Even though there are considerable efforts aimed at both structuring literature data and providing easy access to other data types, currently there is no optimal solution to these problems. It is becoming clear that capturing and representing complex knowledge is tedious, time consuming, difficult and expensive[89]. Therefore, at the moment, knowledgebases such as Reactome that structure knowledge about biological pathways in an event centered manner are the best available resources for tools like HyBrow.

65

Chapter 7: Comparison with related efforts

Awareness is growing that there is a need to represent information about biological processes in a formal representation that can be the basis for unambiguous descriptions of hypotheses and models[2, 3]. As a result, there are increasing efforts to formulate formal descriptions of biological processes ranging from domain knowledge models and ontologies, which explicitly specify the biological concepts and relationships among them, to special representation languages which can “simulate” a process. In this chapter I will revisit the efforts described in chapter 3 and highlight their differences with our work. The efforts from the biology community are aimed at developing shared domain knowledge models (ontologies) and symbolic notations for unambiguously representing and sharing information along with constraints on what can and can not be expressed using them. The main efforts in this direction are the Riboweb project and “modeling biological processes as workflows” project by Russ Altman’s group[27, 53]. The efforts from the computer science and engineering community are aimed at developing computable descriptions of biological processes based on formal languages, logic and reasoning methods along with associated tools for simulation. Such descriptions are derived from methods used to represent concurrent computer processes and allow the representation of biological systems at high levels of abstraction[90, 91]. The ultimate goal of both these efforts is to precisely and formally describe both, a complex biological system and relevant information about it. Our work attempts to bridge the two approaches by demonstrating the equivalence between ontologies and formal languages as well as developing a unified framework for integrating diverse data by establishing a strong correspondence between an ontology and a conceptual representation[44, 59, 92]. Moreover, instead of taking an existing conceptual representation and retrofitting it with a domain ontology, we have developed both, the ontology and the conceptual framework in sync from the outset.

66

The Riboweb project

Riboweb is the research project that is closest in spirit to HyBrow. As mentioned before, Riboweb is a prototype online data resource for the ribosome. It comprises of a large knowledgebase of relevant published data, computational modules that can process this data to test hypotheses about the ribosome’s structure and a web-based user interface. Riboweb serves a user community of about 20 labs studying the structure of ribosomes[54] whereas HyBrow is a proof-of-concept system. HyBrow extends the ideas underlying Riboweb in several ways: 1) HyBrow covers more data types compared to Riboweb, which only stores data about the 3D structure of the ribosome. 2) HyBrow contains an explicit conceptual framework to model biological processes, whereas Riboweb contains a list of methods that can be applied to the stored data to create a 3D structure. 3) HyBrow allows users to submit hypotheses about biological processes and evaluate them in the light of data; whereas Riboweb allows a user to submit their structural data, or use stored data, and apply listed analyses methods to visualize as well as compare the resulting 3D structures. The hypothesis about the ribosome’s structure is implied by the data. 4) Only HyBrow can suggest revisions to submitted hypotheses based on “hypothesis neighborhoods”.

Modeling biological processes as workflows

The “modeling of biological processes as workflows” project by Peleg et al shares similarities with HyBrow in terms of having an explicit ontology for describing biological processes and having a mapping to a conceptual framework that can support qualitative reasoning. Peleg et al use hybrid Petri Nets as the underlying conceptual framework[27], which can support qualitative reasoning and represent “rules” that constrain what processes (represented as transitions in the Petri Net) can or can not occur as well as represent biological processes at varying levels of resolution. The ontology used in this project is created by merging aspect of the TAMBIS ontology[56] and an ontology used to describe business processes[27]. TAMBIS provides the biology specific entities whereas the business processes ontology provides concepts 67 to represent parallel, competing, alternative of sequential processes. Existing knowledge is captured using Protégé and stored as text files. A model of a biological process is created using a Protégé plug-in to draw the process. However, the qualitative reasoning about the biological processes has to be performed manually[27]. The qualitative analysis can answer queries such as: will a certain reaction happen in the future? Is a certain state of the cell reachable if so and so event is blocked? However, there is no automated way of querying the stored data for support or contradiction to individual events in the described process and Petri Nets also do not support “neighborhoods”. As a result, it is not possible to identify contradicted events and suggest revisions to a model. Moreover, all the data are stored as flat files accessed by Protégé which imposes limits on the kinds and amount of data.

Pathway logic

Pathway Logic is an algebraic framework enabling the symbolic analysis of biological signaling pathways[55, 57]. The goals of the Pathway Logic work are to build biological process models which can assist in the generation of informed hypotheses about complex biological processes and can be interactively modified by biologists[57]. Pathway Logic attempts to algebraically formalize the biologists’ reasoning, which consists of informal notations and potentially ambiguous representations of important concepts like pathways, cycles, and feedback loops. As described before, Pathway Logic models are developed using the Maude (http://maude.csl.sri.com) system, a formal language and tool set based on rewriting logic[57]. The Maude system provides the model-checking capabilities for the pathway logic framework. A model of biological process is represented as a collection of rewrite rules together with the algebraic entity declarations. Rewriting logic then allows reasoning about possible changes given the basic rules already defined. Models can be exported to the Bionet viewer tool, which lays out the models as Petri Nets. Pathway Logic was first used to create of models of signal transduction networks, including the Epidermal Growth Factor Receptor (EGFR) network and closely related networks[55]. In subsequent work Talcott et al[57] created a more detailed representation of 68 signaling proteins in which protein functional domains as well as rewriting rules for interactions between domains are explicitly defined. The Pathways Logic framework does not have an explicit ontology associated with it; therefore converting existing information into that framework is a bigger bottleneck than it is for us. Moreover, each time a new model is desired, a domain expert proficient in using Maude has to encode the model in Maude syntax after reading the relevant literature. During the process of model-checking, the framework cannot provide explicit references for violations of its rewriting rules by a model. The pathway logic framework also does not support the notion of “neighborhoods” and therefore can not create alternative models from a user submitted model. It also does not allow for comparing model similarity within or across organisms.

Summary

There are increasing efforts in both, the molecular biology and computer science communities to formulate formal descriptions of biological processes ranging from domain knowledge models and ontologies to special representation languages. The efforts from biologist groups are geared towards explicitly representing existing knowledge about biological processes whereas the efforts from computer science groups are geared towards developing frameworks that allow formal representation and reasoning about biological processes. Our work synchronizes these two related efforts to develop a unified representation framework by establishing a strong correspondence between a biological knowledge model and a conceptual framework as well as developing them in a tightly integrated manner.

Role of Knowledgebases

Our effort to develop the HyBrow prototype is often misinterpreted as similar to developing a knowledgebase. A knowledgebase is a body of formally represented knowledge based on an ontology and stores information at a high level of abstraction. Specifically, a pathway knowledgebase stores formal descriptions of agreed upon models 69 of biological processes, much like an online text book of models, in a manner that allows their computational manipulation. EcoCyc is a knowledgebase of metabolic pathways in E. coli. It is based on a detailed ontology for describing biological entities, their functions and processes they participate in. EcoCyc allows complex querying of this knowledge. It lacks an associated conceptual framework to evaluate and rank alternative representations of a metabolic pathway[93]. Reactome is a knowledgebase that structures knowledge about biological pathways of Homo sapiens in an event-centered manner[52] and serves as a free public resource of structured information. HyBrow differs fundamentally from knowledgebases such as EcoCyc and Reactome. HyBrow uses knowledgebases as a source of curated, structured descriptions of biological process models. The purpose of HyBrow is to make the unit components of such descriptions available as ‘building blocks’ to users for building their own alternative models and evaluating them in the light of available information. In fact, in light of the bottleneck for structuring existing literature information, knowledgebases like Reactome that structure current knowledge about biological pathways in an event centered manner are the best available resources for tools like HyBrow.

70

Chapter 8: Proofreading the Reactome knowledgebase

Credits: This chapter describes unpublished work on which Stephen Racunas and I worked very closely. Stephen Racunas accomplished all the theoretical work of adapting the conceptual framework for assessing trustworthiness. He conceived a formal language for Reactome, specified the formulation of models from that formal language and developed the logical framework for checking models (described under the methods section). I mapped the models to reactome by establishing the correspondence between models and pathways stored in Reactome. We jointly defined the list of event relationships for pathway models and specified the list of testable properties for pathway models as well as formulated the list of desired properties for trustworthiness for Reactome. I programmed the tests to verify the desired properties for trustworthiness and compiled the results. We jointly wrote the manuscript, which is currently under review.

Numerous biological databases have been developed for storing and querying the rapidly accumulating data[81]. At the same time, biological process databases such as Reactome have begun to represent information at a high enough level of abstraction to be designated knowledgebases. The structured information stored in pathway knowledgebases has the potential to be more immediately useful for tools such as HyBrow, than the raw data stored in more conventional databases. We believe that evaluating and proofreading knowledgebases is an important step toward the overall goal of using knowledgebases like Reactome as resources for information integration and computer-aided reasoning about biological processes. The usefulness of a knowledgebase such as Reactome to a tool like HyBrow depends on characteristics we designate as trustworthiness and expressiveness. The trustworthiness of a knowledgebase is a measure of its quality and completeness, while the expressiveness of a knowledgebase is reflected in such properties as the complexity and sophistication of the queries it will support and its ability to represent biological systems at multiple scales. 71

To be trustworthy, a knowledgebase should be complete, free of internal conflicts, explain as many steps as possible in each pathway, and provide the most complete pathway descriptions possible. Omissions, inconsistencies, errors in the order of steps in a pathway, missing steps, extra steps and self-reflexive loops all limit the utility of the knowledgebase. Thus a trustworthy knowledgebase should be complete, consistent, direct, gap-free, well-formed and acircular. In this chapter, I will describe how we adapted our conceptual framework for assessing a knowledgebase's trustworthiness, define each of these properties and then present the results of applying the tests for trustworthiness to releases 10 and 11 of the Reactome knowledgebase.

Background

The Reactome project, a collaborative effort involving the Cold Spring Harbor Laboratory, the European Institute, and the Gene Ontology Consortium, is developing a knowledgebase comprising the core pathways and reactions in human biology. The information in the Reactome knowledgebase is authored by expert biological researchers and maintained by the Reactome editorial staff. The basic unit of Reactome is the reaction, defined as any biological event that converts inputs to outputs[81]. The inputs and outputs of a reaction are entities such as small molecules, proteins, lipids or nucleic acids, or complexes of these. Reactions include not only the chemical conversion of one set of entities to another, but also the formation and dissociation of complexes and the transport of entities from one compartment to another. Reactome's definition of reaction encompasses classical biochemical reactions (for example, the phosphorylation of glucose to glucose-6-phosphate), as well as events such as binding, dissociation, complex formation, translocation and conformational changes. In addition to inputs and outputs, a reaction may include information on the organism, sub-cellular location, and the experimental evidence for the reaction in the form of one or more literature citations. Other attributes of reactions include a catalytic activity and regulatory information. The defining attributes of a reaction are its input, output, and catalyst activity. Therefore, reactions which have identical substrates and 72 products, but differ in catalyst are stored as separate, distinct reactions. Similarly, chemically identical entities in different cellular compartments are represented as distinct entities. Thus for example, extracellular and cytosolic glucose are stored as separate entries. Entities in different biochemical modification states are also represented as separate entities. The p53 protein, for example, is represented by three distinct entities: native p53, p53 phosphorylated at Ser15 and p53 phosphorylated at Ser20. Such multiple states are derived from a single base entity called the Reference Entity, which contains information common to all the states. Reactome events can also contain generic physical entities such as ‘any tRNA’[81]. A concrete reaction is defined as a reaction in which all inputs, outputs, and catalysts are concrete physical entities (which are entities that represent a single instance of a gene, protein or chemical), and in which the conversion of inputs to outputs occurs in a single step. Reactions are grouped into causal sequences to form pathways[81] that take into account the reactions’ temporal relationships and interdependencies. Pathways in Reactome are groupings of functionally related reactions, and can contain sequential reactions, parallel reactions or reactions ordered in a cycle[81]. Pathways can also nest. That is, pathways can have other pathways as their components and can be sequential or parallel[81]. A concrete pathway is a multi-step concrete event whose components are concrete reactions, concrete pathways, or both. If a concrete pathway contains sequential reactions, they are displayed in the order they occur. Reactome also stores a specific ‘preceding event’ relationship that defines the exact order of two reactions.

Methods

Knowledgebases conceptualize the domain for which they store data in an ontology. The knowledgebase ontology can be used to generate a formal language describing the domain. In chapter four we saw that a logic associated with a formal language provides a set of rules for constructing and verifying formulas, which are testable expressions, formulated using terms from the language. We also saw that in our conceptual framework a model is an abstract formal representation of entities, relations, and transformations among them within the system[60] and that for every expression 73 constructed using a formal language with an associated logic, a model provides a testable explanation implied by the logic. Thus, if we can make the formal language and the associated logic of a knowledgebase explicit, we can treat the contents of the knowledgebase as a collection of models compatible with our conceptual framework. Specifically, a knowledgebase of biological processes can be treated as a collection of models representing reactions, pathways and collections of pathways. We first show that the Reactome framework implicitly generates a formal language and make this language explicit. We then provide a definition of a model expressed in the Reactome language and specify a logic using which we can perform . We use this logic to check pathways by testing for certain desirable properties, the aggregate of which we designate as the trustworthiness of the knowledgebase. We have specified the tests for trustworthiness of pathway knowledgebases in our conceptual framework because it has a number of advantages: First, it allows us to cast knowledgebase verification in a broader context of computer-aided hypothesis evaluation, thus building a strong connection to computer-aided hypothesis composition and evaluation done by HyBrow. It also allows us to verify our potential data sources in the same logical environment in which they will be eventually used. Moreover, building a logic for verifying knowledgebases allows us to extend our formal representation in ways that are useful for phrasing system-level hypotheses.

A formal language for Reactome

In Reactome, an event is any biological process in which input entities are converted to output entities in one or more steps. To avoid confusion, we restrict the use of the term event to events that are concrete reactions and use the term pathway for a concrete pathway. In order to specify a formal language for a logic, we need a set of functions, a set of relations and a set of constants. Formally, A language L is a triple L = 〈F,R,C〉 where:

• F is a set of function symbols f, each with positive integer arity nf, 74

• R is a set of relation symbols R, each with non-negative integer arity nR and • C is a set of constant symbols c.

The concepts defined in the Reactome framework can be combined to satisfy the requirements of a formal language. We make the following translations:

• Reactome’s physical entities provide the constants of the language. • Reactions that transform one version of a physical entity into another version, form or dissociate complexes, or transport entities into different contexts form the functions of the Reactome language. • Reactions not represented with functions become relations. • Temporal relationships are expressed by the ‘precedes’ relation.

Models from a formal language

The language we have derived for Reactome and the event relationships we have defined allow us to define formal models[60]. Let L be a language. A model (or L- structure) is a tuple M = 〈M ,F M ,RM ,C M 〉 , where M is a set called the universe containing all objects denoted by constants or functions of constants from L. FM is a set of functions, RM is a set of relations, and CM is a set of constants in M. We restrict our attention to finite models[59], which can represent anything from an event, to a pathway, to a set of pathways, such as “Insulin receptor mediated signaling,” by specifying a hierarchy of constants, relations, and functions. For Reactome pathway-models, the constants are individual Reactome events. We defined the ‘enables’ and ‘supplies’ relationships whereas the relation ‘precedes’ is defined by Reactome. The universe for such models is the set of all Reactome events, together with “initial conditions” which provide all of the entities assumed to be axiomatically present.

75

A logic for checking Pathway-models

Model-checking requires two inputs: a model and a formula. Given a language L, a model M and an L-formula φ, solving the model checking problem involves deciding whether φ is true in M. Given a model M and a formula φ (x) with free variables x, solving the query evaluation problem involves the relation defined by φ on M which is the setϕ M := {a ∈ M : M |= ϕ(a )}. Finding all pathway-models that exhibit a specific property, such as being complete, is a query-evaluation problem. Query evaluation is closely related to the model-checking problem and can be converted to a set of model-checking problems. To construct a logical framework for checking pathway-models we formalize an analogy between truth and presence of an entity. We then apply a logic capable of deciding truth values in order to verify which compounds could be present at which times. In traditional logic, we evaluate ‘truth’ using deduction rules such as modus ϕ,ϕ ⇒ψ ponens, given as: , which says that given formulas φ and φ =>ψ, we can ψ deduce ψ. We have defined a set of rules analogous to the deduction rules of traditional logic. This new set of rules, which we call verification rules, takes the place of the standard deduction rules in the Reactome logic. We defined such rules for supply, enabling and other relations based on their definitions (described in the next section) in order to evaluate whether or not a given entity can be present given the reactions that have happened so far. For example, we have defined the following set of verification rules for verifying supply in Reactome.

C We introduce the syntax I → R O to describe an event, which represents a reaction R on a set of inputs I, leading to the generation of a set of outputs O with the aid of catalysts C. I →C O, present(I ), present(C) 1. R ; present(O) 76

I →C1 O,O →C 2 O2, precedes(R1, R2) 2. R1 R2 ; I →(C1∪C 2) O2 Rnew Thus, we verify that the outputs of the reaction are present only if all of a reaction's required inputs and catalysts are present, established by preceding reactions or assumed (axiomatically) as initial conditions.

We define a verification, as a finite sequence φ0, φ1, …, φn such that φi is obtained from an axiom, or is obtained from φ0,…,φi-1 through the use of a verification rule. Establishing verification rules as described above allows us to check pathway-models to determine whether or not they satisfy the desired properties for trustworthiness. For example, it is desirable that all of the agents required in a certain pathway be supplied either as initial conditions to that pathway or by events upstream.

Results

In Reactome, an event is any biological process in which input entities are converted to output entities in one or more steps. Reactome events thus include both reactions and pathways. To alleviate possible ambiguities, we restrict the use of the term event to events that are concrete reactions and use the term pathway for a concrete pathway.

Event relationships in Reactome

The base event-level relationship in Reactome is temporal precedence, which is stored directly for each reaction. Below we define additional event-level relations that allow us to define testable properties at the pathway level.

• We consider a Reactome event en to be directly supplied by event en-1 if the inputs

of en are included among the outputs of en-1 and consider it directly enabled by

event en-1 if all of the catalysts of en are found among the outputs of en-1.

77

• We consider a Reactome event en to be indirectly supplied by events e1 … en-1 if

the inputs of en are contained in the union of the outputs of the other events and

consider it partially supplied by event e1 if at least one of the inputs of en is

present among the outputs of e1. According to the Reactome framework, partially supplying the following event is a requirement for an event to precede another event.

• We consider a Reactome event en to be indirectly enabled by events e1 … en-1 if

the catalysts of en are contained in the union of the outputs of the other events and

partially enabled by event e1 if at least one of the catalysts of en is present among

the outputs of e1.

• We say that a set of Reactome events E1 prepares a set of Reactome events E2 if,

given E1, every event in E2 is both enabled and supplied, either from E1 or from entities axiomatically assumed to be present.

Testable properties for pathways

1. We say that a sequence of events in a pathway is well-formed if the (direct) precedence relation is a superset of the (direct) supply relation over the set of events in the sequence.

2. We say that a sequence of events in a pathway is circular if there exist cycles in the precedence order that are not also defined to be cycles by the pathway.

3. We say a pathway is complete if every event in the pathway is either supplied by a preceding event or prepared by the axioms.

4. We say that an acyclic pathway is inconsistent if an event comes before another event in the pathway order, but the opposite is true in the precedence order.

78

5. We say that a Reactome pathway is verbose if there exist events a and c such that

(a p c) in the precedence order, and there exist “extra” events ei, i ≥ 1, such that (a p e1), (e1 p e2) … (ei p c) in the Reactome pathway ordering.

6. We say that a pathway is terse if there exist events a and c such that (a p c) in the

pathway ordering, and there exist “extra” events ei, i ≥ 1, such that (a p e1), (e1 p e2) … (ei p c) in the precedence order.

7. We say that a pathway is gapped if there exist events in the pathway for which supply is violated, and there exist preceding events in the database which are capable of supplying those pathway events.

Desired Properties for Trustworthiness

In order for Reactome to become a reliable resource of core pathways and reactions in human biology, it is desirable that it satisfy the completeness, consistency and gap-free properties:

• Each pathway in the Reactome database should be complete. • None of the pathways in Reactome should be inconsistent. • None of the pathways in Reactome should be gapped pathways.

Several additional properties that would increase the utility of Reactome are the directness, well-formedness and acircularity properties:

• As many pathways as possible should be adjusted so that they are neither verbose nor terse. • Every sequence of events in Reactome should be a well-formed sequence. • There should not be any cyclical sequences in the database that are not also annotated to be cyclical pathways.

79

Proofreading Reactome

We developed scripts (available for download at www.hybrow.org/Reactome) to test the properties for trustworthiness on the concrete human pathways stored in Reactome. We focused exclusively on the concrete pathways because the properties we have defined and the results of our analyses are easily interpreted for concrete pathways. This section summarizes the results of these tests performed on recent releases of Reactome.

Data format integrity Overall, Reactome adheres to its chosen data format. However, it appears to digress from it to some extent, as follows: generic entities are allowed to interact with concrete physical entities in generic pathways. This leads to peculiarities such as a generic event preceding a concrete event e, which precedes events that are instances of the generic event preceding e.

Supply and Enabling Direct supply is often violated, probably due to missing events and the incomplete state of current experimental knowledge. The concept of enabling is not yet tracked by Reactome, because no events specifically modify or create catalysts for other events in a pathway. However there are events that create physical entities which act as catalysts in other pathways.

Completeness Since we found that Reactome does not consider enabling, we included an assumption of universal enablement. (If we had not done so, the completeness criterion would be violated everywhere.) Reactome also does not track the creation and use of concrete simple entities, so we had to assume the availability of such entities to avoid systemic judgments of incompleteness. We also assumed the presence of the input reactants to the first step in each pathway. After incorporating these assumptions, we 80 were still able to locate several pathways which are not complete. One such pathway is the inosine formation pathway.

The inosine formation pathway (id 74236) has two reactions: Adenosine + H2O

→ inosine + NH3 and inosine 5'-monophosphate + H2O → inosine + orthophosphate. (ids 74235 and 73822)

The events that precede 73822 are 73797 and 76590. However, these events are not part of the inosine formation pathway and the event that is listed in the pathway does not supply inputs to event 73822. Thus, the inosine formation pathway is not yet complete.

Overall, we found that 8 of the 55 pathways in Reactome were incomplete, and that there were 14 events in those 8 pathways that were the cause of the judgments of incompleteness. Since there are 742 events in these 55 pathways, these 14 events constitute 2% of all events.

Consistency Our tests identified a number of inconsistencies in Reactome. One example is the ornithine metabolism pathway.

The ornithine metabolism pathway (id 70693) has 9 events. Of these, event 70577 (ATP + aspartate + citrulline ↔ argininosuccinate + AMP + pyrophosphate) is listed directly before 70560 (carbamoyl phosphate + ornithine → citrulline + orthophosphate) in the pathway ordering. However, the preceding event of 70577 is 70560 according to the preceding event relationship specified in Reactome. This assertion is therefore inconsistent with the pathway ordering.

Eight of the 55 pathways had consistency problems and two of these are incomplete pathways. There were 23 reactions that were the cause of these inconsistencies and these 23 reactions comprise about 3% of all reactions. 81

Gap-free Pathways There were very few gapped pathways in Reactome. Of the 55 concrete human pathways, only 4 were found to have gaps. There were 5 events in release 10 (0.6 % of all reactions) which could supply the necessary patches to seal the gaps.

Consider the “glycogen breakdown in liver cells” pathway (id 71598), which has 7 events. Event 71593 in the pathway requires{(1,6)-alpha-glucosyl}poly{(1,4)- alpha-glucosyl}glycogenin-2 as input, which is not provided by any of the previous events in the pathway. However, it can be supplied by event 71594, which is listed as its preceding event but is not considered part of the pathway.

Directness (terseness and verbosity) Our test for terseness revealed that only 3 of the 55 pathways were too terse and in each case, there was only one occurrence of terseness.

Again, consider the “glycogen breakdown in liver cells” pathway (id 71598). There are two events between event 71591 and 71593 according to the preceding event relationship.

On testing for verbosity, 18 of the 55 pathways in Reactome release 10 were found to be verbose.

Consider the “xanthine formation” pathway (id 74257), event 74249 and 75251 directly precedes event 74255 according to the preceding event relationship. However, the xanthine formation pathway lists 4 events 74248,74249,74250,74251 grouped in the guanine formation pathway (id 74252) as occurring either in parallel or before event 74255.

Verbosity does not appear to be a serious problem, because excessive verbosity indicates that descriptions of pathways by domain experts can include extra steps for which evidence of the exact sequence of events is not explicitly specified or known. Although 82 not optimal, this situation is not surprising for a biological process still under investigation.

Well-Formedness On testing for well-formedness between consecutive events in pathways, we found that 21 of the 55 Reactome pathways were completely well-formed and 25 pathways presented 80% or more of their constituent events in well-formed sequences (80% well-formed). Table 1 shows the number of pathways at each level of well- formedness. Overall, 36 of the 55 pathways were at least 50% well-formed.

Consider the Pyrimidine biosynthesis pathway (id 73648). Event 73629 produces all the required inputs for event 75124 and hence supplies event 75124. However, it is not considered its preceding event, violating well-formedness.

Acircularity (Self-loops) We found no events for which the precedence relation is reflexive.

Changes in Release 11 These tests represent a snapshot of Reactome release 10. However, Reactome itself is constantly changing and evolving. This raises the possibility of studying how pathways evolve over time as curation progresses. As a first step in this direction, we compared our test results for the two Reactome releases available to us. Upon repeating the tests on the release 11, we found that 14 pathways were incomplete (36 events) and 9 pathways had inconsistencies (24 events). Given that there are 953 events in the 65 pathways in release 11, this represents 3.7% incompleteness and 2.5% inconsistencies. Inconsistencies have decreased compared to the previous release; however, the incompleteness percentage has increased slightly. There are still 5 pathways with gaps (6 events), 21 verbose pathways (57 events) and 3 terse pathways (3 events), corresponding to 0.6% gaps, 5.9% verbosity and 0.3% terseness. Of the 65 concrete human pathways in release 11, 30 are 80% well-formed and 43 are more than 50% well- formed. There are no pathways with self loops. Table 2 shows the numbers of well- 83 formed pathways at various cutoffs. Table 3 shows the comparison of release 10 with release 11 in terms of the properties tested.

Percentage well Release 10 Release 11 formed 100 21 [38%] 26 [40%] 90 22 [40%] 27 [42%] 80 25 [45%] 30 [46%] 70 27 [49%] 33 [50%] 60 33 [60%] 40 [62%] 50 36 [65%] 43 [66%] 40 37 [67%] 44 [68%] Table 2 Numbers of Well-Formed Pathways at different levels of well-formedness for the latest releases. The numbers in the brackets are the percentages of total pathways. Release 10 contains 55 pathways and release 11 contains 65 pathways.

Property Release 10 Release 11 Total pathways 55 65 Complete pathways 47 51 Consistent pathways 47 54 Gap-free pathways 51 60 Non-verbose pathways 37 44 Non-terse pathways 52 62 90 % well-formed pathways 21 27 Self-loops 0 0 Table 3 Property comparison for the latest releases of Reactome. The table shows the number of pathways that satisfy each property tested

Summary

There are increasing efforts to create formal descriptions of biological processes to allow their representation at higher levels of abstraction. These include the use of workflows[27], semantic networks[94] and compound graphs[84]. Simultaneously, knowledgebases are being developed that provide such formal descriptions of pathways[33, 81, 95]. The pathway resource list (PRL) at www.cbio.mskcc.org/prl currently lists 167 such resources. As a consequence, there is both a need for a methodology to 84 assess their quality and compatibility with alternative formal descriptions of biological processes[2] We formulated a set of tests for evaluating the trustworthiness of a knowledgebase and applied them to two consecutive releases of the Reactome knowledgebase. Such tests allow monitoring of the quality of a knowledgebase during its development and subsequent growth. Basing the tests on a formal theory makes it possible to predict and systematically check for the types of errors that can occur in a knowledgebase as well as examine its compatibility with a formal representation of biological processes.

85

Chapter 9: Summary and Future directions.

The rapid growth of life sciences research and the associated literature over the past decade, the rapid expansion of biological databases, and invention of high throughput techniques that permit collection of data on many genes and proteins simultaneously have created an acute need for new computational tools to support the biologist. Interpretation of data sets, usually by integrating different data into one biological hypothesis, is essential to gain new insights in understanding molecular processes. And there is increasing awareness that formal representations are required for this task. The prototype implementation of HyBrow for the GAL system demonstrates that structuring data in a formal representation framework facilitates information integration and allows computer-aided hypothesis evaluation. HyBrow can accommodate both more data and more types of data as they become available. Moreover, its constraints can be elaborated as understanding about the biological system grows. I believe that the approach we have developed in constructing the HyBrow prototype can significantly inform experimentation by facilitating the integration of large amounts of information. However, capturing and representing existing data in a formal representation is tedious, time consuming, difficult and expensive, because of the complexity of the data and the continuous discovery of new relationships between aspects of the data. Manual approaches that we adopted for deploying our prototype will not scale adequately for organism level deployment. Therefore, at the moment, knowledgebases such as Reactome that structure existing information knowledge about biological pathways and model organism databases that provide access to a large variety of genomic data are the best available resources for tools like HyBrow. Even for such knowledgebases and model organism databases, the task of defining the right representation for the data remains a constant challenge and currently the most researched solution is the development of adaptors that will convert current unstructured information to the desired structured knowledgebase[96, 97]. However, if we are to use existing knowledgebases as data sources for tools such as HyBrow, we need to assess their trustworthiness and expressiveness. We 86 have generalized the conceptual framework underlying HyBrow and have successfully used it to evaluate the trustworthiness of the Reactome knowledgebase.

Future directions

Short-term goals Along with trustworthiness, a somewhat more subtle characteristic that determines the usefulness of a knowledgebase is its expressiveness, which is determined by the complexity and sophistication of the queries a knowledgebase can support. During the process of evaluating the trustworthiness of Reactome, it became evident that in most cases, HyBrow's subject parallels Reactome's catalyst and its object parallels Reactome's input, although in a reaction like binding, the subject and object are both inputs. However, there are no explicit types of verbs for classifying events such as a phosphorylation event or binding event. Categorizing the current list of events by “verb types” is necessary for formulating and testing causal hypotheses about both events and pathways. Specifying similarity relations between groups of Reactome events can further increase expressiveness to support HyBrow's concept of “event neighborhoods” and allow principled specification of hypothesis variants as well as comparison of hypotheses across related organisms. In the future we will generalize our results to provide guidelines for evaluating the trustworthiness of a knowledgebase as well as formalize methods for comparing its expressiveness with a given formal representation of biological processes. We will derive a minimal set of requirements for a knowledgebase to be useful for computer-aided inference. We will also formulate a minimal set of desirable properties of a formal representation in order for it to be useful for comparing biological processes across organisms and knowledgebases.

Long Term goals Our eventual goal is to build the knowledgebase infrastructure required for supporting machine-aided hypothesis evaluation for Arabidopsis thaliana. We will use the BioLingua knowledge representation and manipulation environment[98] to construct 87 and store this new knowledgebase because BioLingua can readily import data from TAIR and it already incorporates the gene ontology information for Arabidopsis and the AraCyc pathway knowledgebase[99]. We will augment this knowledgebase with information extracted from the literature abstracts available through the PubSearch system. In order to move beyond our HyBrow prototype, we need to gather all the components in one development environment. Besides storing the knowledgebase, BioLingua can also directly support our rules library. Therefore, we will deploy the HyBrow in the BioLingua environment. Thus virtually everything except the graphical interface can be built in the BioLingua environment. We will also create a searchable community hypothesis archive, which will allow users to construct, evaluate, store and search hypotheses in a formal representation, avoiding the ambiguities of natural language and ad hoc diagrammatic representations.

88

References

1. Kuchinsky, A., K. Graham, D. Moh, M. Creech, K. Babaria, and A. Adler. Biological storytelling: A software tool for biological information organization based upon narrative structure. in Advanced Visual Interfaces. 2002. Trento, Italy. 2. Karp, P.D., Pathway databases: A case study in computational symbolic theories. Science, 2001. 293(5537): p. 2040-4. 3. Gifford, D.K., Blazing pathways through genetic mountains. Science, 2001. 293(5537): p. 2049-51. 4. Shah, N.H., D.C. King, P.N. Shah, and N.V. Fedoroff, A tool-kit for cdna microarray and promoter analysis. Bioinformatics, 2003. 19(14): p. 1846-8. 5. Quackenbush, J., Microarray data normalization and transformation. Nat Genet, 2002. 32 Suppl: p. 496-501. 6. Eisen, M.B., P.T. Spellman, P.O. Brown, and D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 1998. 95(25): p. 14863-8. 7. Tusher, V.G., R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 2001. 98(9): p. 5116-21. 8. Tran, P.H., D.A. Peiffer, Y. Shin, L.M. Meek, J.P. Brody, and K.W. Cho, Microarray optimizations: Increasing spot accuracy and automated identification of true microarray signals. Nucleic Acids Res, 2002. 30(12): p. e54. 9. Fielden, M.R., R.G. Halgren, E. Dere, and T.R. Zacharewski, Gp3: Genepix post- processing program for automated analysis of raw microarray data. Bioinformatics, 2002. 18(5): p. 771-3. 10. Mahalingam, R., A. Gomez-Buitrago, N. Eckardt, N. Shah, A. Guevara-Garcia, P. Day, R. Raina, and N.V. Fedoroff, Characterizing the stress/defense transcriptome of arabidopsis. Genome Biol, 2003. 4(3): p. R20. 11. Zeeberg, B.R., W. Feng, G. Wang, M.D. Wang, A.T. Fojo, M. Sunshine, S. Narasimhan, D.W. Kane, W.C. Reinhold, S. Lababidi, K.J. Bussey, J. Riss, J.C. Barrett, and J.N. Weinstein, Gominer: A resource for biological interpretation of genomic and proteomic data. Genome Biol, 2003. 4(4): p. R28. 12. Ashburner, M., C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel- Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, and G. Sherlock, Gene ontology: Tool for the unification of biology. The gene ontology consortium. Nat Genet, 2000. 25(1): p. 25-9. 13. Khatri, P., S. Draghici, G.C. Ostermeier, and S.A. Krawetz, Profiling gene expression using onto-express. Genomics, 2002. 79(2): p. 266-70. 14. Doniger, S.W., N. Salomonis, K.D. Dahlquist, K. Vranizan, S.C. Lawlor, and B.R. Conklin, Mappfinder: Using gene ontology and genmapp to create a global gene-expression profile from microarray data. Genome Biol, 2003. 4(1): p. R7. 89

15. Al-Shahrour, F., R. Diaz-Uriarte, and J. Dopazo, Fatigo: A web tool for finding significant associations of gene ontology terms with groups of genes. Bioinformatics, 2004. 20(4): p. 578-80. 16. Rhee, S.Y., W. Beavis, T.Z. Berardini, G. Chen, D. Dixon, A. Doyle, M. Garcia- Hernandez, E. Huala, G. Lander, M. Montoya, N. Miller, L.A. Mueller, S. Mundodi, L. Reiser, J. Tacklind, D.C. Weems, Y. Wu, I. Xu, D. Yoo, J. Yoon, and P. Zhang, The arabidopsis information resource (tair): A model organism database providing a centralized, curated gateway to arabidopsis biology, research materials and community. Nucleic Acids Res, 2003. 31(1): p. 224-8. 17. Go slim, 2003, http://www.geneontology.org/GO.slims.shtml 18. Gene ontology database, 2002, www.geneontology.org 19. Biegler, L.T. and I.E. Grossmann, Retrospective on optimization. Computers & Chemical Engineering, 2004. 28(8): p. 1169-1192. 20. Yeang, C.H., S. Ramaswamy, P. Tamayo, S. Mukherjee, R.M. Rifkin, M. Angelo, M. Reich, E. Lander, J. Mesirov, and T. Golub, Molecular classification of multiple tumor types. Bioinformatics, 2001. 17 Suppl 1: p. S316-22. 21. Ramaswamy, S., P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, and T.R. Golub, Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci U S A, 2001. 98(26): p. 15149-54. 22. Brown, M.P., W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey, M. Ares, Jr., and D. Haussler, Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci U S A, 2000. 97(1): p. 262-7. 23. Califano, A., G. Stolovitzky, and Y. Tu, Analysis of gene expression microarrays for phenotype classification. Proc Int Conf Intell Syst Mol Biol, 2000. 8: p. 75-85. 24. Kaminski, N. and N. Friedman, Practical approaches to analyzing results of microarray experiments. Am J Respir Cell Mol Biol, 2002. 27(2): p. 125-32. 25. Shah, N., J. Lepre, Y. Tu, and G. Stolovitzky. Can we identify cellular pathways implicated in cancer using gene expression data? in Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE. 2003. 26. Stevens, R., C.A. Goble, and S. Bechhofer, Ontology-based knowledge representation for bioinformatics. Brief Bioinform, 2000. 1(4): p. 398-414. 27. Peleg, M., I. Yeh, and R.B. Altman, Modelling biological processes using workflow and petri net models. Bioinformatics, 2002. 18(6): p. 825-37. 28. Gruber, T.R., Toward principles for the design of ontologies used for knowledge sharing, in Formal ontology in conceptual analysis and knnowledge representation, G. Poli, Editor. 1993, Kluwer academic publishers: New York. 29. Brazma, A., On the importance of standardisation in life sciences. Bioinformatics, 2001. 17(2): p. 113-4. 30. Eilbeck, K., S.E. Lewis, C.J. Mungall, M. Yandell, L. Stein, R. Durbin, and M. Ashburner, The sequence ontology: A tool for the unification of genome annotations. Genome Biol, 2005. 6(5): p. R44. 31. Stoeckert, C.J., Jr. and H. Parkinson, The mged ontology: A framework for describing functional genomics experiments. Comp Funct Genom, 2003. 4: p. 127-132. 90

32. Hucka, M., A. Finney, H.M. Sauro, H. Bolouri, J.C. Doyle, H. Kitano, A.P. Arkin, B.J. Bornstein, D. Bray, A. Cornish-Bowden, A.A. Cuellar, S. Dronov, E.D. Gilles, M. Ginkel, V. Gor, Goryanin, II, W.J. Hedley, T.C. Hodgman, J.H. Hofmeyr, P.J. Hunter, N.S. Juty, J.L. Kasberger, A. Kremling, U. Kummer, N. Le Novere, L.M. Loew, D. Lucio, P. Mendes, E. Minch, E.D. Mjolsness, Y. Nakayama, M.R. Nelson, P.F. Nielsen, T. Sakurada, J.C. Schaff, B.E. Shapiro, T.S. Shimizu, H.D. Spence, J. Stelling, K. Takahashi, M. Tomita, J. Wagner, and J. Wang, The systems biology markup language (sbml): A medium for representation and exchange of biochemical network models. Bioinformatics, 2003. 19(4): p. 524-31. 33. Karp, P.D. and M. Riley, Ecocyc the resource and the lessons learned, in Bioinformatics databases and systems, S. Letovsky, Editor. 1999, Kluwer Academic Publishers: New York. p. 47-62. 34. Rzhetsky, A., T. Koike, S. Kalachikov, S.M. Gomez, M. Krauthammer, S.H. Kaplan, P. Kra, J.J. Russo, and C. Friedman, A knowledge model for analysis and simulation of regulatory networks. Bioinformatics, 2000. 16(12): p. 1120-8. 35. Rzhetsky, A., I. Iossifov, T. Koike, M. Krauthammer, P. Kra, M. Morris, H. Yu, P.A. Duboue, W. Weng, W.J. Wilbur, V. Hatzivassiloglou, and C. Friedman, Geneways: A system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform, 2004. 37(1): p. 43-53. 36. Liang, S., S. Fuhrman, and R. Somogyi, Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput., 1998: p. 18-29. 37. Akutsu, T., S. Miyano, and S. Kuhara, Algorithms for identifying boolean networks and related biological networks based on matrix multiplication and fingerprint function. J Comput Biol, 2000. 7(3-4): p. 331-43. 38. Akutsu, T., S. Miyano, and S. Kuhara, Inferring qualitative relations in genetic networks and metabolic pathways. Bioinformatics, 2000. 16(8): p. 727-34. 39. Friedman, N., M. Linial, I. Nachman, and D. Pe'er, Using bayesian networks to analyze expression data. J Comput Biol, 2000. 7(3-4): p. 601-20. 40. Hartemink, A.J., D.K. Gifford, T.S. Jaakkola, and R.A. Young, Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac Symp Biocomput, 2001: p. 422-33. 41. Pe'er, D., A. Regev, G. Elidan, and N. Friedman, Inferring subnetworks from perturbed expression profiles. Bioinformatics, 2001. 17(Suppl_1): p. S215-S224. 42. Wessels, L.F., E.P. van Someren, and M.J. Reinders, A comparison of genetic network models. Pac. Symp. Biocomput., 2001: p. 508-519. 43. Reddy, V.N., Modeling biological pathways: A discrete event systems approach, in Department of Chemical Engineering. 1994, University of Maryland: College Park. p. 66. 44. Racunas, S., N. Shah, and N.V. Fedoroff. A contradiction-based framework for testing gene regulation hypotheses. in Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE. 2003. 45. Ho, Y.C., Special issue on discrete event dynamical systems: Editorial. Proc IEEE, 1989. 77(1): p. 24-38. 91

46. Heiner, M. On exploiting the analysis power of petri nets for the validation of discrete event systems. in IMACS Symposium on Mathematical Modelling. 1997. Wien. 47. Chen, M. and A. Freier, Petri net based modelling and simulation of metabolic networks in the cell. 2002: Bielefeld. p. 5. 48. Matsuno, H., A. Doi, M. Nagasaki, and S. Miyano, Hybrid petri net representation of gene regulatory network. Pac Symp Biocomput, 2000: p. 341- 52. 49. Hodges, P.E., A.H. McKee, B.P. Davis, W.E. Payne, and J.I. Garrels, The yeast proteome database (ypd): A model for the organization and presentation of genome-wide functional data.Pg - 69-73. Nucleic Acids Res, 1999. 27(1). 50. Saccharomyces genome database, 2001, http://genome- www.stanford.edu/Saccharomyces/ 51. Safran, M., I. Solomon, O. Shmueli, M. Lapidot, S. Shen-Orr, A. Adato, U. Ben- Dor, N. Esterman, N. Rosen, I. Peter, T. Olender, V. Chalifa-Caspi, and D. Lancet, Genecards(tm) 2002: Towards a complete, object-oriented, human gene compendium. Bioinformatics, 2002. 18(11): p. 1542-3. 52. Genome knowledge base, 2003, http://www.genomeknowledge.org/ 53. Chen, R.O., R. Felciano, and R.B. Altman, Riboweb: Linking structural computations to a knowledge base of published experimental data. Proc Int Conf Intell Syst Mol Biol, 1997. 5: p. 84-7. 54. Altman, R.B., M. Buda, X.J. Chai, M.W. Carillo, R.O. Chen, and N.F. Abernethy, Riboweb: An ontology-based system for collaborative molecular biology. Intelligent Systems, IEEE [see also IEEE Expert], 1999. 14(5): p. 68-76. 55. Eker, S., M. Knapp, K. Laderoute, P. Lincoln, J. Meseguer, and K. Sonmez, Pathway logic: Symbolic analysis of biological signaling. Pac Symp Biocomput, 2002: p. 400-12. 56. Stevens, R., P. Baker, S. Bechhofer, G. Ng, A. Jacoby, N.W. Paton, C.A. Goble, and A. Brass, Tambis: Transparent access to multiple bioinformatics information sources. Bioinformatics, 2000. 16(2): p. 184-5. 57. Talcott, C., S. Eker, M. Knapp, P. Lincoln, and K. Laderoute, Pathway logic modeling of protein functional domains in signal transduction. Pac Symp Biocomput, 2004: p. 568-80. 58. Sudkamp, T.A., Languages and machines. 1988, Reading: Addison-Wesley. 59. Racunas, S., C. Griffin, and N. Shah. A finite model theory for biological hypotheses. in Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE. 2004. 60. Marker, D., Model theory: An introduction. 2002, New York: Springer-Verlag. 61. Akutsu, T., S. Miyano, and S. Kuhara, Identification of genetic networks from a small number of gene expression patterns under the boolean network model. Pac Symp Biocomput, 1999: p. 17-28. 62. Reddy, V.N., M.L. Mavrovouniotis, and M.N. Liebman, Petri net representations in metabolic pathways. Proc Int Conf Intell Syst Mol Biol, 1993. 1: p. 328-36. 63. Mandel, J., N.M. Palfreyman, J.A. Lopez, and W. Dubitzky, Representing bioinformatics causality. Brief Bioinform, 2004. 5(3): p. 270-83. 92

64. Sveiczer, A., A. Csikasz-Nagy, B. Gyorffy, J.J. Tyson, and B. Novak, Modeling the fission yeast cell cycle: Quantized cycle times in wee1- cdc25delta mutant cells. Proc. Natl. Acad. Sci. USA, 2000. 97(14): p. 7865-7870. 65. Smolen, P., D.A. Baxter, and J.H. Byrne, Modeling transcriptional control in gene networks--methods, recent results, and future directions. Bull Math Biol, 2000. 62(2): p. 247-92. 66. D'Haeseleer, P., S. Liang, and R. Somogyi, Genetic network inference: From co- expression clustering to reverse engineering. Bioinformatics, 2000. 16(8): p. 707- 26. 67. de Jong, H., Modeling and simulation of genetic regulatory systems: A literature review. J Comput Biol, 2002. 9(1): p. 67-103. 68. Peleg, M., D. Rubin, and R.B. Altman, Using petri net tools to study properties and dynamics of biological systems. J Am Med Inform Assoc, 2005. 12(2): p. 181-99. 69. Ideker, T., V. Thorsson, J.A. Ranish, R. Christmas, J. Buhler, J.K. Eng, R. Bumgarner, D.R. Goodlett, R. Aebersold, and L. Hood, Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 2001. 292(5518): p. 929-34. 70. Cherry, J.M., C. Adler, C. Ball, S.A. Chervitz, S.S. Dwight, E.T. Hester, Y. Jia, G. Juvik, T. Roe, M. Schroeder, S. Weng, and D. Botstein, Sgd: Saccharomyces genome database. Nucleic Acids Res, 1998. 26(1): p. 73-9. 71. Karp, P.D., An ontology for biological function based on molecular interactions. Bioinformatics, 2000. 16(3): p. 269-85. 72. Noy, N.F. and D.L. McGuiness, Ontology development 101: A guide to creating your first ontology. 2001, Stanford Knowledge Systems Laboratory: Palo Alto. 73. Sklyar, N., Survey of existing bio-ontologies. 2001, The Institute of Computer Science, Dept. of Comp. Science, Univ. of Leipzig: Leipzig. p. 23. 74. Noy, N.F., M. Crubezy, R.W. Fergerson, H. Knublauch, S.W. Tu, J. Vendetti, and M.A. Musen, Protege-2000: An open-source ontology-development and knowledge-acquisition environment. AMIA Annu Symp Proc, 2003: p. 953. 75. Yeast proteome database, 2002, http://www.proteome.com/YPDhome.html 76. Andrade, M.A. and P. Bork, Automated extraction of information in molecular biology. FEBS Lett, 2000. 476(1-2): p. 12-7. 77. Discovery link, 2002, www..com 78. Novichkova, S., S. Egorov, and N. Daraselia, Medscan, a natural language processing engine for medline abstracts. Bioinformatics, 2003. 19(13): p. 1699- 1706. 79. Muller, H.M., E.E. Kenny, and P.W. Sternberg, Textpresso: An ontology-based and extraction system for biological literature. PLoS Biol, 2004. 2(11): p. e309. 80. Zhu, J. and M.Q. Zhang, Scpd: A promoter database of the yeast saccharomyces cerevisiae. Bioinformatics, 1999. 15(7-8): p. 607-11. 81. Joshi-Tope, G., M. Gillespie, I. Vastrik, P. D'Eustachio, E. Schmidt, B. de Bono, B. Jassal, G.R. Gopinath, G.R. Wu, L. Matthews, S. Lewis, E. Birney, and L. Stein, Reactome: A knowledgebase of biological pathways. Nucleic Acids Res, 2005. 33 Database Issue: p. D428-32. 93

82. Spellman, P.T., M. Miller, J. Stewart, C. Troup, U. Sarkans, S. Chervitz, D. Bernhart, G. Sherlock, C. Ball, M. Lepage, M. Swiatek, W.L. Marks, J. Goncalves, S. Markel, D. Iordan, M. Shojatalab, A. Pizarro, J. White, R. Hubley, E. Deutsch, M. Senger, B.J. Aronow, A. Robinson, D. Bassett, C.J. Stoeckert, Jr., and A. Brazma, Design and implementation of microarray gene expression markup language (mage-ml). Genome Biol, 2002. 3(9): p. RESEARCH0046. 83. Cook, D.L., J.F. Farley, and S.J. Tapscott, A basis for a visual language for describing, archiving and analyzing functional models of complex biological systems. Genome Biol, 2001. 2(4). 84. Demir, E., O. Babur, U. Dogrusoz, A. Gursoy, G. Nisanci, R. Cetin-Atalay, and M. Ozturk, Patika: An integrated visual environment for collaborative construction and analysis of cellular pathways. Bioinformatics, 2002. 18(7): p. 996-1003. 85. Lohr, D., P. Venkov, and J. Zlatanova, Transcriptional regulation in the yeast gal gene family: A complex genetic network.Pg - 777-87. Faseb J, 1995. 9(9). 86. Masseroli, M., A. Stella, N. Meani, M. Alcalay, and F. Pinciroli, Mywest: My web extraction software tool for effective mining of annotations from web-based databanks. Bioinformatics, 2004. 87. Wilkinson, M.D. and M. Links, Biomoby: An open source biological web services proposal. Brief Bioinform, 2002. 3(4): p. 331-41. 88. Clark, T., S. Martin, and T. Liefeld, Globally distributed object identification for biological knowledgebases. Brief Bioinform, 2004. 5(1): p. 59-70. 89. Attwood, T.K. and C.J. Miller, Progress in bioinformatics and the importance of being earnest. Biotechnol Annu Rev, 2002. 8: p. 1-54. 90. Regev, A., W. Silverman, and E. Shapiro, Representation and simulation of biochemical processes using the pi-calculus process algebra. Pac Symp Biocomput, 2001: p. 459-70. 91. Cardelli, L., Bioware languages, in Computer systems: Theory, technology, and applications, A. Herbert and K.S. Jones, Editors. 2005, Springer: New York. p. 59-65. 92. Racunas, S.A., Classifying regulatory hypotheses, in Deptartment of Electrical Engineering. 2004, Pennsylvania State University: University Park, PA. p. 211. 93. Karp, P.D., M. Riley, M. Saier, I.T. Paulsen, J. Collado-Vides, S.M. Paley, A. Pellegrini-Toole, C. Bonavides, and S. Gama-Castro, The ecocyc database. Nucleic Acids Res, 2002. 30(1): p. 56-8. 94. Hsing, M., J.L. Bellenson, C. Shankey, and A. Cherkasov, Modeling of cell signaling pathways in macrophages by semantic networks. BMC Bioinformatics, 2004. 5(1): p. 156. 95. Pathways knowledge base, 2005, http://www.ingenuity.com/products/pathways_knowledge.html 96. Davidson, S.B., J. Crabtree, B.P. Brunk, J. Schug, V. Tannen, G.C. Overton, and C. Stoeckert, K2/kleisli and gus: Experiments in integrated access to genomic data sources. IBM Systems Journal, 2001. 40(2): p. 512-531. 97. Mack, R., Y. Mass, H. Matsuzawa, L.V. Subramaniam, S. Mukherjea, A. Soffer, N. Uramoto, E. Brown, A. Coden, J. Cooper, A. Inokuchi, and B. Iyer, Text 94

analytics for life science using the unstructured information management architecture. IBM Systems Journal, 2004. 43(3): p. 490-515. 98. Massar, J.P., M. Travers, J. Elhai, and J. Shrager, Biolingua: A programmable knowledge environment for biologists. Bioinformatics, 2004: p. bth465. 99. Mueller, L.A., P. Zhang, and S.Y. Rhee, Aracyc: A biochemical pathway database for arabidopsis. Plant Physiol, 2003. 132(2): p. 453-60.

95

Appendix A – Formal specification of the hypothesis grammar The grammar is presented in the Backus-Naur Form syntax6

hypothesis : eventstream ; eventstream : event | event STREAM_OP event | eventstream LOGIC_OP eventstream | eventstream STREAM_OP event | LPAREN eventstream RPAREN ; event : EVENT_NAME | EVENT_NAME EQUALS event | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT SYNTAX_SUGAR AGENT ASSOC_OP | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT SYNTAX_SUGAR PERT_CONT | AGENT AGENT_OP AGENT SYNTAX_SUGAR PHYS_CONT SYNTAX_SUGAR PERT_CONT SYNTAX_SUGAR AGENT ASSOC_OP ;

6 BNF is one of the most commonly used notations for specifying the syntax of programming languages, command sets, formal grammars and the like. It is widely used for language descriptions but seldom documented anywhere, so that it must usually be learned by using it. 96

Appendix B – Using the GUI

Screen shots Instructions From start screen, press f to enter file mode, right-click to bring up popup menu and select “open”. After that select “GALsample.xml” or select “new” to start with a blank hypothesis.

Press c to enter context placement mode: Left-click to set the upper-left corner of the context box. Right-click to set the lower-right corner of the context box and bring up the popup menu. Select a context type from the popup menu

97

Press a to enter actor placement mode: Right-click to set the location of the actor and also bring up the popup menu To place a small signaling molecule, select it from the popup menu To place a gene or protein, first Left-click on the gene or protein list. Then, select the gene or protein by clicking on the appropriate entry in the list.

Press r to enter the relation specification mode: (A dotted grey line will indicate the endpoints of the relation) Left-click to set the starting point and Right-click to set the ending point for relation and to bring up the popup menu. Select the desired relation name from the list in the popup menu

98

Press f to enter file manipulation mode, and then Right-click to bring up the popup menu and select “transcribe.” The hypothesis diagram will appear on a black background. Selecting a verb from the blue list will highlight the actors and contexts that complete the “sentence” containing this verb. Click on the subject (left column), object (immediate right column), or context (far right column) list to correct any errors. Repeat for all verbs. Click on one of the buttons in the lower right corner to transcribe the hypothesis to a file or to the screen.

This is what a hypothesis will look like when transcribed to the screen.

Transcribing to file will prompt you for a location and name for the file. Save the file and upload it at hybrow.org.

Upload the saved file at this place on the HyBrow query page.

Vita

Nigam H Shah

EDUCATION

2005 Ph.D. Integrative Biosciences (IBIOS) The Pennsylvania State University, University Park, PA

1999 M.B.B.S. (Bachelor of Medicine and Bachelor of Surgery) Baroda Medical College, M.S. University Baroda, India

WORK EXPERIENCE

2003 / 02 Research Intern, Functional Genomics and Systems Biology Group, IBM T. J. Watson research center.

1999 - 2000 Medical Internship, three months each in the Departments of Internal Medicine, Surgery, Obstetrics and Gynecology, and Public Health.

TEACHING EXPERIENCE

2004 / 03 Lead instructor for Microarray Data Analysis methods in Penn State Bioinformatics workshop.

Spring 2001 Laboratory in Mammalian Physiology (Biology 473) Taught basic human physiology to 34 pre-med students.

SELECTED RESEARCH PUBLICATIONS

Racunas SA*, Shah NH*, Albert I and Fedoroff NV HyBrow: A System for computer aided hypothesis testing Bioinformatics 2004 Supplement; 20: i257-i264 (*= equal authors)

Shah NH and Fedoroff NV CLENCH: A Program for Calculating Cluster Enrichment using the Gene Ontology, Bioinformatics. 2004 May 1;20(7):1196-7.

Racunas SA, Shah NH and Fedoroff NV A Contradiction-Based Framework for Testing Gene Regulation Hypotheses IEEE Computer Society CSB Conference, Aug 2003, Proc pg 634 - 638

Shah NH, King DC, Shah PN and Fedoroff NV A Tool-kit for cDNA microarray and promoter analysis Bioinformatics. 2003 Sep 22;19(14):1846-8.