! "#$"%&    ! %'($%$##)$'"**$#  + , ++ -$.%*.#                    ! " # $%%&'!(') * +   *   *,+  + ,+-.+  /   0 +-

 1 2+3-%%&-  -4    * 1 */ +5*1 -6

  -       

  '''-)! - -41#&$78&'8))8$9!!8)-

#/+ ++  +   + :+  +*    8   * -1 * /+ /  *  +       * *     +       +    +;    /+  -.++ *   * * +   +       +/+ +    +    < +*  * + - 1 +         +     = ++               + +   / =        += *       :+ + *       - 4 ++4    + *  +  */    *  / + +* /++  *1 -.+/ =      8   / = +      +* * + +8/++   /+  *      + /++   +      + *-  >+ +=*    *+  **  *1 + / =     -.+ * *     */  *  *    /+      + *        + - .    +   *+   +  *   /+ / = 4    /+   +?,,    -.+     +     / +  /++ *  8   - . <+*  *+ +  *         + *        * /+  *   +      + /  +    +  -6 * * +  @16>  *      *     +/+        +=    *1      +   -   + /   +  *            *          +  /   +     <*    +-A+      +  2+ +  *           ** +* -

      *   * + *  +  *  +  :8/     

 !    !" #$%&!  !'()$&*+ ! 

B31 2+%%&

411#'9)'89'& 41#&$78&'8))8$9!!8)  (  ((( 8'%&!%)+ (CC -=-C D E (  ((( 8'%&!%) List of Papers

This is based on the following papers, which are referred to in the text by their Roman numerals.

I Spjuth, O., Helmus, T., Willighagen, E.L., Kuhn, S., Eklund, M., Wagener J., Murray-Rust, P., Steinbeck, C., and Wikberg, J.E.S. : An open source workbench for chemo- and . BMC Bioinformatics 2007, 8:59. II Wagener, J., Spjuth, O., Willighagen, E.L., and Wikberg J.E.S. XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous Web services. BMC Bioinformatics 2009, 10:279 III Carlsson, L., Spjuth, O., Adams, S., Glen, .C., and Boyer, S. Use of historic metabolic biotransformation data as a means of anticipating metabolic sites using MetaPrint2D and Bioclipse. Submitted. IV Spjuth, O., Willighagen, E.L., Guha, R., Eklund, M., and Wikberg, J.E.S. To- wards interoperable and reproducible QSAR analyses: Exchange of data sets. Submitted. V Spjuth, O., Alvarsson, J., Berg, A., Eklund, M., Kuhn, S., Mäsak, C., Torrance, G., Wagener, J., Willighagen, E.L., Steinbeck, C., and Wikberg, J.E.S. Bioclipse 2: A scriptable integration platform for the life sciences. Submitted. Additional Papers

• Wikberg, J.E.S., Spjuth, O., Eklund, M., and Lapins, M. Chemoinformatics taking Biology into Account: Proteochemometrics. Computational Approaches in Chem- informatics and Bioinformatics. Wiley, New York, 2009. • Eklund, M., Spjuth, O., and Wikberg, J.E.S. An eScience-Bayes strategy for ana- lyzing omics data. Submitted. • Junaid, M., Lapins, M., Eklund, M., Spjuth, O., and Wikberg, J.E.S. Proteochemo- metric modeling of the susceptibility of mutated variants of HIV-1 virus to nucleo- side/nucleotide analog reverse transcriptase inhibitors. Submitted. • Lapins, M., Eklund, M., Spjuth, O., Prusis, P., and Wikberg, J.E.S. Proteochemo- metric modeling of HIV protease susceptibility. BMC Bioinformatics. 2008, 10:181. • Eklund, M., Spjuth, O., and Wikberg, J.E.S. The C1C2: A framework for simulta- neous model selection and assessment. BMC Bioinformatics, 2008, 9:360 • Ameur, A., Yankovski, V., Enroth, S., Spjuth, O., and Komorowski, J. The LCB Data Warehouse. Bioinformatics 2006, 15;22(8):1024-6 • Spjuth, O., Eklund, M., and Wikberg, J.E.S. An Information System for Proteochemometrics, CDK News, Vol. 2/2, 2005. Contents

1 Introduction ...... 7 1.1 Bioinformatics ...... 8 1.2 ...... 9 1.3 Drug discovery ...... 10 1.3.1 Computer-aided drug discovery ...... 11 1.4 Knowledge Management in the life sciences ...... 12 1.5 Integrated science ...... 12 1.6 eScience ...... 14 1.6.1 Service-orientation ...... 15 1.6.2 High Performance Computing ...... 16 1.6.3 Open science ...... 17 2 Aims of studies ...... 19 3 Methods ...... 21 3.1 Software integration ...... 21 3.1.1 Integration of local software ...... 21 3.1.2 Integration of services ...... 23 3.1.3 Rich Clients ...... 25 3.2 Data integration ...... 26 3.2.1 Standards ...... 26 3.2.2 Exchange formats ...... 26 3.2.3 Ontologies ...... 27 3.2.4 Semantic technologies ...... 27 3.2.5 Model-driven development ...... 28 3.3 Drug Discovery ...... 29 3.3.1 Structure-Activity Relationship ...... 29 3.3.2 Circular fingerprints ...... 30 4 Results and discussion ...... 31 4.1 A workbench for the life sciences (Paper I and V) ...... 31 4.2 Discoverable and asynchronous services (Paper II) ...... 34 4.3 Near-real time metabolic site predictions (Paper III) ...... 36 4.4 A new standard for QSAR (Paper IV) ...... 37 4.5 A scripting language for the life sciences (Paper V) ...... 38 5 Concluding remarks ...... 41 6 Acknowledgements ...... 43 Bibliography ...... 45

1. Introduction

The sequencing of the human genome [1, 2] was the beginning of a new era in the life sciences, an era characterized by an enormous increase in the amounts of data produced. New high throughput techniques have emerged, capable of generating data at a rate that was unthinkable only a few years ago. This summer the Wellcome Trust Sanger Institute surpassed 10 Terabases of cumulative genome sequence data, and has during the last year doubled the rate to now generate 400 Gigabases of sequence data per week (the equivalent of about seven human genomes). Massively parallel sequenc- ing technologies [3] (often called "next-generation" sequencing) are the most volumi- nous data producers, contributing about a Terabyte a month to public archives [4]. Apart from sequencing, it is today possible to obtain data regarding gene expres- sion, mutations, small molecules, and proteins on large scale using array techniques with tens of thousands of probes. Another example is the measuring of drug-target in- teractions in ultra high throughput screening laboratories, capable of producing over 100,000 data points per instrument and day. The data explosion has forced companies and genome centers to be equipped with Petabytes of storage and thousands of computer cores to process the flow of incom- ing data. The speed of data production is constantly increasing, and companies like Complete Genomics envision sequencing 10,000 genomes in 2010 [5], when the in- ternational collaboration 1000 genomes project [6] is not even completed. The new techniques open up many possibilities. One example is in diagnostics, where diseases can be identified for example by measuring the expression of a se- lected set of genes or proteins (biomarkers). Other examples include personalized treatments [7], identification of new drug targets [8], insights into the variation be- tween individuals with regard to diseases [9], and how we in rational ways can design new drugs [10]. Life science has thus in the last decade become a data-intensive field, and large in- ternational organizations use the Internet as a collaborative platform and to make data publicly available. It is nowadays a common task for biologists to interact with several out of the hundreds publicly available repositories [11]. There is also an increasing di- versity in the type of data stored, for example genomic sequences, protein structures, mutations, interactions, expression levels, and pathways. Researchers are required to handle and analyze the immense data amounts in order to convert them into accessi- ble knowledge, which has made the life sciences highly dependent on fields such as knowledge management and information technology.

7 (a)

(b) (c)

Figure 1.1: a) The central dogma of molecular biology. DNA is transcribed to RNA, which in turn is translated to Proteins that carry out functions. It is the goal of bioinformatics to handle and analyze information from these and related topics using information technology. b) Multiple sequence alignment is a technique to compare several sequences, such as DNA, RNA, or protein. The process aims at finding the result with the best overall similarity between the sequences. Image generated with ClustalX [12, 13]. c) Phylogenetic analysis can be used to study evolutionary relatedness by sequence similarity, here presented as a dendrogram. The two most similar sequences are joined in a cluster, and the process is repeated until all clusters are joined. Image from [14].

1.1 Bioinformatics Bioinformatics is the application of information technology to the field of molecular biology, with the goal to handle, analyze, and visualize information and data from the central dogma; the transcription of DNA to mRNA and the translation of mRNA to proteins (see Figure 1.1(a)), and related fields. This encompasses such various tasks as management and provisioning of biological data, algorithms and tools to aid in prob- lem solving, and analyses to answer biological questions. Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. It is the quest of bioinformatics to create the infrastruc- ture, technologies, and resources that enables analysis and reliable storage of these large data quantities. Examples of major research areas in bioinformatics include comparative sequence analysis, gene finding, protein structure modeling and prediction, gene expression analyses, protein-protein interactions, genome-wide association studies, and the mod- eling of evolution (Figure 1.1(c)). To attack these problems it is becoming increasingly common to make use of computationally intensive techniques, including advanced

8 (a) (b)

Figure 1.2: Examples of applications in Cheminformatics; a) Clustering of chemical struc- tures showing the distribution of activity for a set of compounds. Symbols are color-coded by structural class and symbol sizes are proportional to the negative common logarithm of the potency. Image from [24] b) Visualization of a drug-protein interaction; here is shown the ATP binding site for the protein CDK2 with the required locations for hydrogen bond-donating atoms (marked with blue-dotted spheres) and a hydrogen bond-accepting atom (marked with red-dotted spheres). Image from [25].

statistical analysis (such as machine learning, regression analysis, pattern recognition, and data mining) and high-end visualizations. Bioinformatics is a comprehensive research field in the life sciences as informa- tion technology is more and more integrated into everyday research. One of the most commonly used method is multiple sequence alignment (Figure 1.1(b)), which can be used to compare, search for similar, and compute the consensus or distance between two or more sequences. Another example is de novo prediction of protein structures from sequence, and while this is not yet feasible for other than small structures, ho- mology modeling can be used to predict the structure of e.g. unknown proteins based on existing structures of another protein with a shared evolutionary history (protein sequences evolve more quickly than protein structures). Data provisioning in Bioinformatics is commonly done over the Internet. The major genome databases Genbank [15], EMBL [16], and DDBJ [17] form the backbone of this infrastructure by proving the nucleotide sequences of sequenced species. Exam- ple of protein databases are Universal Protein Resource (UniProt) [18], the National Center for Biotechnology Information (NCBI) Protein database [19], and the Protein Data Bank (PDB) [20]. Many other databases provide specialized information, such as KEGG [21] and Reactome [22] which provides information about biological path- ways, and ArrayExpress [23] which is a public repository for gene expression data.

1.2 Cheminformatics Cheminformatics (also known as chemoinformatics or chemical informatics) focuses on the storage, and analysis of chemical structures. The field deals with both small molecules, such as metabolites and drugs, as well as larger structures, such as protein and structures. Since biological processes at a low level are chemical interac-

9 tions, the field sometimes overlaps with bioinformatics, and the term biocheminfor- matics has been used to refer to work integrating the domains. Common applications in cheminformatics include storage of chemical data in repositories (databases), the mining and screening of virtual molecular libraries (Figure 1.2(a)), and the modeling of chemical interactions in a target (Figure 1.2(b)). Most drug discovery applications also fall into the field of cheminformatics. Examples of public repositories in cheminformatics are: PubChem [26], ChemSpider [27], Drugbank [28], ChEMBL [29], and ChEBI [30].

1.3 Drug discovery Drug discovery is the process by which drugs are discovered or designed. While most drugs historically have been identified as the active ingredient of traditional remedies or by serendipitous discoveries, new approaches to model and control disease at the molecular level deliver more rational approaches. The drug discovery process generally comprises two major phases, the pre-clinical phase and the clinical phase (see Figure 1.3), with the former comprising testing in laboratory, and the latter testing on humans. The pre-clinical phase starts with Target Identification, which is the process to iden- tify a suitable target, e.g. a protein or a gene, which is associated with the disease and has functionality that can be affected by a drug in a positive fashion. Lead Identifica- tion (also known as Lead Discovery) is the process of finding chemicals that show a favorable interaction with the target, and which preferably do not interfere with other targets. A common technique for this is High Throughput Screening (HTS), where large libraries of synthetic or natural chemical substances are screened towards a tar- get. The outcomes from the Lead Identification phase are called drug leads, and are compounds with such properties that they are deemed promising enough to take fur- ther in the drug discovery process. Lead Optimization aims at optimizing the physiochemical properties of a drug lead to maximize its chances for success as a drug. This is generally done by varying the compound’s chemical structure to increase its activity against the target, reduce its activity against undesired targets, or to optimize its metabolic and pharmacokinetic properties. The Preclinical Development phase generally consists of toxicological studies, cell assays, and animal testing. The second phase of drug discovery process (see Figure 1.3) is Clinical Develop- ment, which in itself consists of three phases. In phase 1 the drug is tested on healthy

Figure 1.3: The drug discovery process, split into the pre-clinical and the clinical phases. Ab- breviations in the pre-clinical phase: TI = Target Identification, LI = Lead Identification, LO = Lead Optimization, PCD = Pre-Clinical Development.

10 volunteers, in phase 2 it is tested on people suffering from the disease (often in the order of hundreds) to produce a proof-of-concept, and in phase 3 on large patient populations (often thousands). The process by which a single new drug is developed into a fully registered product is lengthy and extremely costly. The process can take as much as 15 years to complete, and is currently estimated to cost around a billion dollars per drug [31]. A major factor that contributes to these costs is that the vast majority of the drugs fail at some stage during the development process, and the few successful ones must pay for all these failures. In the last decades, the advent of molecular biology and genomic sciences has largely affected drug discovery. These technologies allows for a deeper understanding of the underlying molecular processes, and have increased the number of targets as well as improved the pre-clinical phases [32].

1.3.1 Computer-aided drug discovery The recent advances in information technology have greatly affected drug discov- ery. Computer-aided drug design (CADD), where drugs are rationally designed using computational models, is to an increasing extent complementing traditional chemical research. Virtual Screening is one approach where computational filters are applied to screen virtual libraries of chemical structures, which can increase hit rates and sub- stantially reduce the number of compounds needed to be assessed experimentally. Docking and structure-based methods are also used to design compounds de novo, and to improve lead properties when a target’s 3D structure is known. Successful drug discovery requires enormous investments [33], and making the correct informed decisions as early as possible can avoid costly mistakes later in the development process. To increase productivity and reduce costs there is a desire to streamline the decision process by executing as much of the work as possible in sil- ico [34, 35]. Computational models are becoming increasingly important for decision support, aiding with information such as predicted toxicity and solubility, and suggest- ing whether or not a certain compound should be discarded due to high probability of failure at later stages [36]. Such predictive models commonly requires integration of diverse and heterogeneous data types in order to build up a knowledge base to gener- ate models from [34]. The increased complexity in the drug research needed to realize this has lately led to institutional changes promoting interdisciplinary research [32]. An important aspect of the drug optimization process is obtaining metabolic stabil- ity so that the candidate drug survives passage over the gut wall and through the liver, to reach the target tissue(s) unmetabolized in sufficient concentrations to achieve a pharmacological effect. Another aspect is to avoid and/or reduce formation of toxic metabolites from the drug candidate. Site-of- predictions aim at pinpoint- ing the site in a chemical structure which is most likely to undergo metabolization, hence aiding with decision support in the drug optimization process. Avoiding undesired drug-drug interactions is another important task where CADD plays an increasingly larger role. A drug can, for example, inhibit the metabolism of other drugs, which can have fatal consequences as the concentration of the other drug then might reach toxic levels. While drug metabolism occurs by the action of many metabolizing enzymes, the cytochrome P450 enzymes are extra important in this context. These enzymes exist in many isoforms in the body, and they are sub-

11 jected to genetic variations among the populations. Predictive genome-based models for enzyme inhibition by xenobiotics [37] is one method that probably can be entered into drug developmental pipelines in the not too distant future. Models for passage of drugs over biological barriers (e.g. drug passage over the blood-brain barrier and gastro-intestinal uptake) is another example.

1.4 Knowledge Management in the life sciences It is important to not confuse data with knowledge - the former being merely a collection of facts that require interpretation in order to be converted into knowl- edge. The same set of data may afford alternative interpretations, and the concept of "provenance" denotes the history of how knowledge came to be. Information is some- what loosely defined as ’useful data’, with the potential to become knowledge. The three concepts together form a paradigm of gradual understanding; data - informa- tion - knowledge (Figure 1.4), and are key parts in the field of Knowledge Manage- ment [38, 39]. Knowledge Management is the process of systematically capturing, structuring, retaining, and reusing information to develop an understanding of how a particular system (e.g. a pathway, gene, or disease) works, and subsequently to convey this in- formation meaningfully to other information systems (knowledge distribution) [40]. Traditionally, this interpretation of data was carried out manually by humans, but the data sets produced today require computers for their sheer size. This transition has led to a fundamental dependency on technological resources, such as networks and databases.

1.5 Integrated science The emerge of high throughput technologies in the life sciences opens up many new exciting applications and holds promises to answer complex questions. However, to obtain new insights and knowledge, information needs to be transformed into exec- utive summaries, which are brief enough for creative studies by a human researcher. Such summaries must for example include all the relevant elements in the study, the desired measurements, protocols, and preferably quality indicators [41]. Secondly, the information must be available in a format that can be fed into a computer pro- gram capable of using it. This has made it evident that standards must be established, defining what needs to be reported in a specific domain and how this data should be reported. Interoperable file formats for exchanging data are also needed so that information can be exchanged without data loss. Several standardization initiatives have emerged out of this need, proposing standards, ontologies (controlled vocabu- laries), and data exchange formats in various domains. Such initiatives includes the MGED consortium [42], FuGE [43], MIBBI [44], and HUPO PSI [45]. The process of standardization has advanced more in bioinformatics than cheminformatics, but the problems of incompatible file formats and standards have also been discussed in this field [46]. The importance of standardization is recognized by many actors of large scientific and economic impact [47, 48, 49, 50].

12 Figure 1.4: The paradigm of gradual understanding, a central concept in knowledge manage- ment. Data (unstructured facts) are converted to Information (human-accessible data) which in turn can be transformed into Knowledge.

It is obvious that only by combining data will we be able to maximize its usefulness, and Data Integration is hence a central part of knowledge management. However, in- tegrating data from different sources is not a trivial task, especially in the life sciences. This is partly because biological data and knowledge is very complex by nature, but also due to the fact that high-throughput instrumentation is a recent development for which the scientific community was not prepared [40]. This has left the life sciences with vast amounts of existing resources which are currently far from compatible on the data level. There are several prominent projects in the life sciences that attack the problem of data integration. Sequence Retrieval System (SRS) [51] links sequence data among databases and provides a uniform search engine to query for sequences in a free text fashion. The MRS retrieval system [52] is a server that allows for very rapid cross- database searches of biological sequences. BioMart [53] is a website and Web service which allows for advanced querying of biological data sources which have been inte- grated in a data warehouse fashion, and has become a central point for accessing data from large data repositories such as Ensembl [54], UniProt [18], and Reactome [22]. The Distributed Annotation System (DAS) [55] is a decentralized system and proto- col that allows multiple third-party annotators to provide annotations for sequences, which can be integrated by software clients on a need basis, and is used extensively in several major annotation projects. The number of databases in the life sciences is much larger than in other dis- ciplines [11] and the heterogeneity of data models and formats makes integration complex and demanding [56, 57]. Example of obstacles that must be tackled include different IDs for the same resources, incompatible database schemas, and free text annotations which severely hinders automatic data processing. Much information is currently only available in the scientific literature as part of published papers in sci- entific journals, and literature mining has emerged as an important research field [58] to bring this information into the open. A key factor to achieve integrated sciences is that software components should be easy to integrate. The switching between applications is a tedious task in many analyses, and sometimes require more work than the actual analysis. In computer science, Tight coupling implies that the software components are connected in a way that makes it difficult to replace individual parts, reducing the possibility of reusing a component later in another context. Tightly coupled software is much easier and faster to develop, since modularity and generality need not be taken into account. Loose coupling implies that components are indeed separate, and can be reused in different contexts without much hassle. They interact when necessary, but remain

13 Figure 1.5: A conceptual view of eScience. Users interact over a network cloud with collabo- rators, software services, databases, and high performance computational resources. The trans- parent access to all these resources is a fundamental part of eScience. The cloud is usually the Internet, but could also be a private cloud.

uncoupled from each other. It is a much harder and more time-consuming process to develop loosely-coupled applications, but the benefits are large in the long run. Due to the fact that a scientific career largely depends on the number of published papers, much software is only developed up to the point of proof-of-concept, and then left in this state. Poor software design is unfortunately also common in the life sciences, mainly (and understandably) because scientists generally have greater expertise in the life sciences domain than in software design. Over the years this has left the field of bioinformatics with a plethora of tools that solve important scientific problems, but which are not easily integrated with other components.

1.6 eScience eScience has emerged as a promising technology for tomorrow’s research. The term is used to describe the constantly increasing part of science that is computationally intensive and carried out in highly distributed network environments, or science that uses large data sets that require grid computing (see Figure 1.5). eScience is feasible due to the advancement of standardization processes, information technology, and the increasing awareness of interoperability with respect to data provisioning and soft- ware development. Many people believe that eScience will change the way science is conducted when information, tools, and computational resources are readily available to all researchers.

14 Life science is one of the fields where eScience will play an important role in the future. The field is distinguished by huge amounts of heterogeneous data acces- sible from an ever-growing number of data sources, and a wide range of software applications for analysis. However, much development is required in order for the life sciences to take full advantage of the many promises that eScience holds. New algorithms need to be designed that can make sense of the available data, and new statistics need to be developed as the field moves towards larger and larger data sets. Existing software tools must be made transparently accessible, and larger data sets also demand more computational power. A major problem with the current eScience technologies are that they are technically complex and require researchers with exten- sive computer experience. This is a major hindrance for making eScience available to the majority of researchers in the life sciences; a problem which will have to be remedied in order for eScience to reach its full potential.

1.6.1 Service-orientation The increasing amount of available data, as well as the number and diversity of soft- ware applications, call for machine accessible interfaces in order for analysis to be effective. The term service-oriented architecture (SOA) denotes the utilization and in- terconnection of many, often geographically distributed, services to carry out a given task. SOA makes heavy use of Web services, which are software systems that pro- vide machine-machine interoperable services over a network (commonly the Inter- net). Traditional web servers rely on web forms for invocation and only present results in HTML, where Web services can be invoked from any application. HTTP-based Web service technologies, like the W3C promoted Simple Object Access Protocol (SOAP) [59] (Figure 1.6) and REpresentational State Transfer (REST) [60], are today the most commonly used Web service technologies in bioinformatics. SOA has a number of advantages over traditional architectures. It is cheap and stan- dardized; there are no high costs associated with setting up a service, and there are standards to rely on that make them interoperable to a large extent. SOA is also a cheap means to obtain rights to use software without downloading a complete server setup, such as databases and application servers. The modularity of SOA enables services to be reused in different contexts without modification, and the sharing of hardware resources reduces the time these resources are idle. While SOA in larger institutions might mean sharing of internal capabilities, the major gain for academic research is the ability to reach out and deliver interoperable functionality as services to other re- searchers. Global service provisioning allows for integration of algorithms and meth- ods that was previously extremely time-consuming, and in many cases practically im- possible. It is common to make such services available over the Internet, practically forming a cloud of services and data. Generalizing services into other technologies than SOAP has made the term Cloud Service increasingly common, as well as the term Software as a Service (SaaS). The Web service technology has been embraced in the life sciences [61, 62] and nu- merous service providers have established hundreds of services. The European Bioin- formatics Institute (EBI) provides access to databases and common analysis tools in bioinformatics [63], including Soaplab which wraps the EMBOSS suite [64] as Web services. Entrez [65, 66, 67] is a federated search engine that allows for cross-

15 Figure 1.6: W3C Web services. Service Providers set up their services and publish metadata in separate documents, called WSDL files. The client reads the WSDL file and is then able to invoke the service. database searches in bioscience with a Web service interface. Other large initiatives that promote Web services include the european initiative EMBRACE Network of Excellence [68], and the Cancer Biomedical Informatics Grid (caBIG) [69]. The advent of eSciene has completely changed the face of software and data inte- gration in the life sciences only in the last few years. Old implementations are con- verted into interoperable Web services, and most new developments are available via machine-accessible interfaces. This is changing how computational analysis and hy- pothesis generation in life science is conducted. The modularity of SOA implies that it is possible to select the needed components from the available services, and chain them together [70, 57].

1.6.2 High Performance Computing High Performance Computing (HPC) denotes the infrastructure of hardware and soft- ware that provides scientists with the computational power, memory, and storage to carry out analyses that are not feasible on personal computers. The term includes par- allel computing, distributed computation, GRID-computing, and other systems with large memory and storage. In the life sciences, HPC is used to analyze large quantities of data, perform high-end simulations, and enables studying of complex multi-domain problems. Examples of fields that make heavy use of HPC include molecular dynam- ics [71], large scale virtual screening [72], whole-genome analyses [73], and protein folding [74]. The handling of data from the new high throughput instruments also relies on HPC resources for processing and storage. The term ’Grid’ is commonly associated with high performance computing, but the term has grown to encompass more topics, for example data storage and access control. This effectively allows users of a Grid to establish a community to share data, software, and computers over a network, transforming it into a joint knowledge re- source [75]. The european project Enabling Grids for E-sciencE (EGEE) [76] aims at simplifying and improving the access of Grid computing to scientists by establishing a multi-science Grid infrastructure, federating some 250 resource centres world-wide into providing over 40,000 CPUs and several Petabytes of storage. A major goal of the project is to simplify for demanding life science operations to make use of this infrastructure. The BioinfoGRID project [77] aims at developing and promoting this and other computational Grids for bioinformatics. MyGrid [78] is an eScience Grid project that aims to help biologists and bioinformaticians to perform workflow-based

16 in silico experiments. Cancer Biomedical Informatics Grid (caBIG) [69] is a project which is developing a service-oriented infrastructure of bioinformatics resources for cancer research.

1.6.3 Open science The enormous quantities of data produced by the new instruments in the life sciences open up for holistic approaches to systems biology [79]. Global collaborations, the depositing of data in public repositories, interdisciplinary research, and community- developed software are democratizing research, empowering any scientist, anywhere, to participate in research even when local resources are scarce. Open science [80, 81] is a relatively new concept with four fundamental goals [82]: • Transparency in experimental methodology, observation, and collection of data. • Public availability and reusability of scientific data. • Public accessibility and transparency of scientific communication. • Using the Web to facilitate scientific collaboration. The objectives of Open science can be summarized in the concept "Open Data, Open Standards, and Open Source" (ODOSOS). The idea behind this is that if data is publicly available in an open standard, everyone can use it to reproduce and val- idate results. If the source code of algorithms is available, people can trust software implementations and easily make use of them. There are several reports on how this enables new research and increases productivity, for example in drug discovery [83]. An example organization which promotes ODOSOS in cheminformatics is the [84]. Open science also includes many social components. Wikis and blogs are now com- mon terms for users of the World Wide Web, and scientists are embracing this form of communication as it allows for rapidly reaching collaborators and get responses, comments, suggestions, or educate others [85]. Web-based tools for social science are gaining momentum, with myExperiment [86], a portal for sharing Workflows in the life sciences, as one example.

1.6.3.1 Open source The term Open Source refers to software development and distribution where the source code is made publicly available, and where anyone can build derivative works upon this as long as the new code is released under the same license. The concept is formally defined for computer science by the Open Source Initiative in the Open Source Definition [87] and the most prominent example is the development of the Linux operating system. In the life sciences, open source has found many followers due to the iterative na- ture of science, and the benefits of allowing developers to reuse existing code and avoid ’reinventing the wheel’. One example is the parsing of file formats; a tedious task in the life sciences that developers are not keen on reproducing, but rather rely upon an existing framework for providing this functionality. The ease of use and straightforward availability has also made open source popular; simply download and use the software without complicated signing of contracts. That open source in most places do not impose licensing fees is important for academic research, since budgets for software are commonly limited.

17 There are numerous open source projects in the life sciences. Examples in bioin- formatics are the Bio* toolkits [88] (including Bioperl [89] and BioJava [90]), Cy- toscape [91], Strap [92], and JalView [93]. Examples in cheminformatics are the Chemistry Development Kit (CDK) [94], and JChemPaint [95]. The general statis- tics framework R and Bioconductor [96] deserves mentioning, as do many of the public repositories which are developed as open source software. There have been many papers published which are discussing the use and importance of open source in bioinformatics [97, 98], cheminformatics [84], and drug discovery [99, 100, 83]. Open source software in the life sciences has many advantages for software and data integration. Firstly, access to the source code makes it much easier to integrate a software component than if the source is not available. Secondly, open source software commonly use and promote open standards and interoperable technologies; and hence the goals of open source and software compatibility commonly align well. Further, the people developing open source are usually interested in the long term survival of the project and put more effort in producing sustainable applications that can be integrated with other components, adapted for new scientific areas, and improved by the wider scientific community. The Internet has also made open source development a form of scientific collaboration, where developers have the software implementation as a means of collaborating, often applying the resulting application on different problems and data.

18 2. Aims of studies

The work presented in this thesis is centered around the concepts of data and software integration, aiming to empower scientists with the emerging technologies of eScience in order to advance research. This includes research and development of:

• A sustainable infrastructure and extensible tools that enables the use of eScience in the life sciences, and which can be adapted for the constantly changing scientific landscape • Methods and implementations to accommodate for the large data amounts gener- ated by new high throughput technologies • Standards, exchange formats, and ontologies for the life sciences, primarily tar- geted towards interdisciplinary research and drug discovery • Protocols and best practices for interoperable services in the life sciences

19

3. Methods

3.1 Software integration Software integration is a topic which comprises both integration of software that runs on the local computer (hereafter referred to as local software), and software that runs on remote computers over a network (hereafter referred to as services). Software integration and data integration are tightly coupled fields, and the easiest and most straightforward way of integrating existing applications is to feed the output of one application as input to another. If both applications can read the same file format the problem boils down to mainly technical implementation details, but if they do not share a common serialization then information needs to be translated from one for- mat to another. This can be a difficult process since it might require advanced domain knowledge. Running software on the local computer has some advantages. Firstly, users do not need to worry about security as sensitive data need not be transferred over the net- work. Many scientists are reluctant to submit novel chemical structures or sequences to remote services, in fear of compromising future publications or patents. Secondly, the standard desktop computers of today are quite powerful, capable of performing a substantial amount of work. Using a networked service does imply some extra pro- cessing and data transfer, and for smaller jobs this overhead may be larger than the speedup gained from a service running on a high performance computer. Thirdly, lo- cal applications have many advantages over software running in a Web browser when it comes to the user experience, as it allows for more responsive and richer user in- terfaces. Fourthly, transferring large amounts of data over networks, especially the Internet, can be error prone and time-consuming. If large data resources are available locally, a local analysis might be the preferred option. SOA with interoperable Web/Cloud services has many advantageous properties, including modularity, reusability, ease of maintenance, standardized ways of interac- tion, and simple construction. Security can be increased by using authentication and encryption protocols together with certificates, or by restricting services to a secure, private network. These solutions can be quite complex to set up, but are crucial for scientists with sensitive data.

3.1.1 Integration of local software Integration of local software aims at reducing the manual work required when repeti- tively switching between locally installed applications. An example use case is when running one piece of local software, saving results to a file, opening the file in another local application, loading the previously saved data, and continuing the analysis. This can be done by pipelining data between software, or turning applications into com-

21 ponents that are accessible from a unifying interface. There are several approaches to adapt an existing application for inclusion in an integrated system: 1) keep it as a standalone application and call it from command line in batch fashion, 2) wrap the component in another application that makes it integrable with the system, and 3) if the component is open source, change the implementation to be compatible with the integrating system. In all cases, the problem of delivering applications for different operating systems and the management of dependencies needs to be handled. There exist numerous downloadable applications in the life sciences which provide functionality for highly specialized tasks, but very few provide a unified interface with a simple out-of-the-box solution. Existing open source projects for software integra- tion are usually focused on integrating available software applications, commonly using command line invocation on UNIX and Linux based systems. Jemboss [101] wraps around the EMBOSS [64] collection of open source bioinformatics tools using loose coupling, and users can extend functionality by adding e.g. shell commands. ISYS [102] and Gaggle [103] integrates existing software tools and data sources by wrapping them in code to interchange data between them. This requires more work, but gives a tighter integration and more advanced communication among components.

3.1.1.1 Extensibility An important feature for platforms that integrate local software is their extensibility. Most projects that fall into the integration category provide some sort of mechanism to add plugins to the application. Strap [92] is an application for protein alignment which has a plugin architecture based on the Java HotSwap [104] mechanism to re- place parts of the application at runtime without recompilation. The Workflow soft- ware Taverna [105] has a plugin architecture that builds on Raven, which is a custom classloader that uses the Java build system Maven [106] at runtime to manage plugin dependencies. Cytoscape [91] is a workbench for visualizing biomolecular interaction networks which also has a custom-built plugin architecture for contributing function- ality. An increasingly popular standard for component-based development is OSGi, which is a dynamic component model for the Java programming language defined by the OSGi Alliance [107]. Using OSGi, components can be added, removed, started, and stopped, and the system can detect and react upon these and other events.

3.1.1.2 Web applications One solution to deliver integrated software to users is to provide a Web front-end to a system where the integration is handled. Users can then interact with the integration system via a browser connecting to the Web application (sometimes called a portal). When such applications are run on powerful servers, they perform well on computa- tionally expensive tasks. Another major advantage with these systems is the ease of maintenance; there is only one location where software needs to be updated. Further, developers do not need to worry about cross-platform issues, as the software runs on a single server and is not distributed to the client computer. Examples of web portals in the life sciences include Entrez [65], which is the web portal of NCBI that allows for cross-database searches in various fields, CARGO (Cancer And Related Genes On- line), which facilitates integration and visualization of information regarding cancer from various Internet resources [108], and AnaBench [109], which provides integrated access to tools for sequence analysis. Web-based applications have some drawbacks,

22 such as restricted client system access, and they are relatively limited in terms of user interaction, making them unsuitable for integrated applications that require many and fast updates to the graphical user interface (GUI).

3.1.2 Integration of services A service is a loose definition, and in this thesis it is defined as to encompass soft- ware with programmatic (machine-accessible) interfaces that can be run locally, or consumed over a network (Web services). When several services are available, an im- portant question is how these are integrated to solve a specific problem. The obvious way is to write a small computer program (script) that calls the service, and many service providers offer libraries and snippets to simplify this.

3.1.2.1 Workflows Writing scripts requires experience in programming, a skill that the vast majority of scientists in the life sciences do not possess. An alternative approach is to use Work- flow software to coordinate services, which in theory allows for the graphical setup of complete scientific protocols (see Figure 3.1 for an example of a Workflow). This idea has received a lot of attention in the life sciences, and many initiatives provide packages for Workflow orchestration, including Taverna [105], Cyrille2 [110], and Knime [111]. Apart from making integration of services available for non-computer specialist, Workflows and parts of Workflows can be reused as parts of diferent analyses, or directly applied to similar problems. An example is the successful integration of mi- croarray and QTL data, which was linked via pathways using a Taverna Workflow,

Figure 3.1: An example Taverna Workflow to retrieve a number of sequences from the 3 species mouse, human, rat; align them, and return a plot of the alignment result. Workflow author: Stian Soiland-Reyes. Source: http://www.myexperiment.org/workflows/821

23 and identified a candidate gene for Trypanosomiasis resistance [112]. The Workflow was then reused without change in Trichuris muris infection in mice, which led to the identification of new biological pathways. The Workflow technology is still in its infancy. Constructing Workflows for the life sciences is not trivial, mainly because a lot of extra nodes are required to convert data between services. Further, long term sustainability has historically not been pri- oritized by service providers, as the scientists push on towards the next publication. Relying on external parties may hence be risky, because when services go down, or sometimes change without notice, this can break existing Workflows. However, the more Workflows that are produced, and the more standards and ontologies evolve and become adopted, the easier it will be to develop new Workflows.

3.1.2.2 BioMOBY BioMOBY [113, 114] is a system for interoperability in the life sciences which con- sists of extensible ontologies for service and data types, and a central service registry (BioMOBY Central), where service providers can register their services (see Fig- ure 3.2). The registry can then be queried for services by name, or for services that accept a certain data type as input or output (e.g. "give me all services that accept a DNA sequence as input").

3.1.2.3 Service discovery A problem with the SOAP and REST technologies is that users must be aware of the existence and location of the services before they can be consumed. There is no built-in functionality to query for services, hence users are currently restricted to Web searches. The EMBRACE Network of Excellence provides a registry [115] where providers can register Web services, and users can search for existing services and monitor their status. BioCatalogue [116] is a new project aiming to provide a regis- tration point, facilities for annotations, and a query interface for Web services. The previously mentioned BioMOBY project [113] also has an established registry where services can be discovered.

Figure 3.2: Architecture of the BioMoby system. Services are registered in the BioMOBY Central and clients can subsequently discover these via the Central, and invoke them directly. The BioMOBY Central has a registry and ontology of services and data types, which allows for looking up services based on input and output.

24 Figure 3.3: is entirely made up of plugins (circles) which can depend on other plu- gins (lines). The core plugins of Eclipse are referred to as the Rich Client Platform (RCP). Other tools can extend Eclipse (Tool 1-3) and these can depend on Eclipse plugins, or plugins contributed by other external components.

3.1.3 Rich Clients Rich Clients are downloadable software applications that take advantage of the many offerings of eScience. In contrast to web-based systems, Rich Clients run on the local computer and hence take full advantage of today’s powerful laptops and workstations. They are equipped with a responsive GUI and allow for tight integration with the operating system (e.g. drag and drop, system tray), and usage of local file system and devices (e.g. printers, scanners), but still have the option to invoke remote services and resources (e.g. networked servers, clusters, and databases). Rich Clients solve the many shortcomings of ordinary downloadable applications, such as delivering a cross-platform product, an integrated update system, and means to consume networked services. These are quite advanced features, and has lead to the emergence of frameworks called Rich Client Platforms (RCP), on which users can base developments in order to reuse common components. The two most well known platforms are Eclipse RCP [117] and NetBeans Platform [118].

3.1.3.1 Eclipse Eclipse [119] is a universal tool platform that was originally built as an integrated development environment (IDE), but over the years it evolved into a general frame- work for application development and integration. In Eclipse, all code is split up into plugins, even the core modules (see Figure 3.3). A plugin in the Eclipse world is a collection of functionality that can be seamlessly integrated with other plugins, and can consist of algorithms, visualizations, data, menu options, and much more. This flexible architecture allows for components to be used as building blocks, and the minimal set of plugins needed to form a complete application is collectively known as the Eclipse RCP [117]. This technology enables software developers to focus on the actual application functionality without concern for the core plumbing, which is inherited from Eclipse. The plugin-architecture of Eclipse is based on OSGi [107], offering the possibility to extend virtually any point in the framework; whether graphical or more lower level

25 such as the underlying domain object model. To define what can be extended, Eclipse utilizes the concept of Extension points, which exist for almost all basic functionality that developers would like to extend. If this is not enough, it is straightforward to create new extension points tailored to user needs, such as for functionality in a certain domain. Eclipse is also equipped with a powerful provisioning system, which means that it is easy to publish online updates to installed software and data. The advanced help and documentation functionality of Eclipse is also a major factor why it has become the most widely used RCP framework available today.

3.2 Data integration 3.2.1 Standards Data standardization in the life sciences aims at harmonizing differences between how studies are reported to enable the utilization of information from different sources in order to analyze, compare, combine, validate, reproduce, and extend data and meta- data. The Minimum Information standards are examples of standards which specify the minimum amount of metadata and data required to meet a specific aim in a field. The MGED consortium was a pioneer in bioinformatics with the Minimum Informa- tion About Microarray Experiements (MIAME) [120]. Another example is the Human Proteome Organization’s Proteomics Standards Initiative (HUPO-PSI), which is es- tablishing standards and controlled vocabularies in proteomics [121], for example the Minimum Information About a Proteomics Experiment (MIAPE) [122].

3.2.2 Exchange formats Exchange formats define how data or information is structured in a file to enable dif- ferent software applications to accurately parse and serialize information. The most common language for this is the Extensible Markup Language (XML) [123], which is an extensible markup language that is widely used in bioinformatics as an easy to use and standardized way to store self-describing data [124]. XML formats for exchanging data in the life sciences are abundant, with examples including Chem- ical Markup Language (CML) for chemistry [125], Systems Biology Markup Lan- guage (SBML) [126] for systems biology models, Distributed Annotation System (DAS) [55] for exchanging information about annotations, Uniprot XML for exchang- ing protein data [18], and PubChem XML for exchanging chemical structures and assays [26]. A major obstacle in data integration is the widespread use of legacy text formats, which are not easily interpreted by software and hence prone to errors. The transition from legacy formats to well structured XML is an important step towards achieving data interoperability. Minimum Information guidelines are normally expressed as a set of specifications and guidelines in a certain area, and usually formalized in data models, with accom- panying XML-based data exchange formats and ontologies. Examples of such data exchange formats are MAGE-ML for MIAME [127], and HUPO-PSI Molecular In- teraction format for the representation of protein interaction data [128]. It has now become a requirement that microarray data must be deposited in MIAME-compliant

26 public repositories using the MAGE-ML exchange format [127] in order to publish microarray experiments in most journals [129].

3.2.3 Ontologies In the life sciences, ontologies (also known as controlled vocabularies) are formal representations that are used to define concepts and their relationships in a specific domain. For example, the Gene Ontology has produced a controlled vocabulary that unifies the representation of gene and gene product attributes across all species [130]. By selecting a concept from an ontology it is clear what the meaning is, eliminating uncertainty of spelling and allowing for joining data in a coherent fashion. The need for ontologies in the life sciences has been heavily advocated [131] and is an active research field. Open Biomedical Ontologies (OBO) is an effort to create controlled vocabularies for shared use across different biological and medical domains [132]. The objective is to allow integration across ontologies by coordinating and reforming existing and new ontologies. The result is an expanding family of ontologies designed to be in- teroperable and logically well formed, and to incorporate accurate representations of biological reality.

3.2.4 Semantic technologies A promising technology to deal with the numerous distributed resources in the life sciences is the use of the Sematic Web [133], which promises an infrastructure of machine-understandable content; a World Wide Web made of semantically linked data and not only HTML documents [40]. The Semantic Web toolbox comprises several components. The simplest language is the Resource Description Framework (RDF) [134], which can be used to represent information in the form of so-called triples: subject, predicate, and object (see Figure 3.4a). This simple language identifies everything with a Uniform Resource Identifier (URI), and is enough to express advanced relations. An even more powerful language is the Web Ontology Language (OWL) [135], which was designed as an extension of RDF and provides creation of, for instance, new class descriptions (collection of

Figure 3.4: a) A triple consists of a subject (in this case Hemoglobin), a predicate (the relation ’is a’), and an object (Oxygen Transport Protein). b) Example of inferred knowledge. The relations A->B and B->C are known, which means that we can infer the relation A->C.

27 possible resources) and logical combinations (e.g. unions) of other classes. SPARQL (SPARQL Protocol and RDF Query Language) [136] is a querying language for in- formation stored in the form of triples, and allows for their retrieval from reposito- ries (triple stores). Reasoning (also known as inference) in the Semantic Web context refers to the process of inferring logical consequences from a set of asserted facts or axioms, or in simple terms to derive new data from data that are already known (see Figure 3.4b). Querying is a form of inference since search results are inferred from a mass of data. It is advocated that The Semantic Web for the Life Sciences (SWLS), when re- alized, will dramatically improve our ability to conduct bioinformatics analyses us- ing the vast and growing stores of web-accessible resources [56]. Several interesting projects are currently demonstrating its use. One example is the Semantic Web for Health Care and Life Sciences (HCLS) Interest Group has the mission to "develop, advocate for, and support the use of Semantic Web technologies for biological sci- ence, translational medicine, and health care" [137]. Another example is Linking Open Drug Data (LODD) [138], a task force of the HCLS initiative that focuses on linking the various sources of drug data available on the Web together, to answer sci- entific questions and demonstrate how physicians and patients can take advantage of the connected data sets. The Bio2RDF project [139] aims to transform silos of bioinformatics data into RDF/OWL, and integrate and query across distributed knowledge bases to infer bi- ological knowledge. The project has produced a semantic mashup of over 70 mil- lion triples built from 30 public bioinformatics data repositories, such as GO, NCBI, UniProt, KEGG, and PDB. This mashup was subsequently used to demonstrate the depiction of genes invovled in dopamine neuron degeneration [140]. SADI (Semantic Automated Discovery and Integration) [141] is a framework where Web services and their output are exposed as Semantic Web resources, and where SADI can dynamically generate RDF graphs from these without the need to keep all data in a huge triple store. This exposes data that would not be found using conventional search approaches (referred to as the Deep Web [142]). The project’s first implementation is an application called CardioSHARE [143], which has exposed data for cardiovascular health research. The previously mentioned BioMOBY does not make use of RDF/OWL, mainly because BioMOBY preceds the mainstream use of these technologies, but the project nevertheless exhibits similar behavior as Semantic Web applications. Several clients have been developed that take advantage of BioMOBY to discover services on the fly in exploratory biological analyses [144, 145] and it is also possible to include BioMOBY services in Taverna Workflows [146, 147].

3.2.5 Model-driven development Software development is expensive and time-consuming, and specialists in a domain generally do not understand the technicalities of the process. Model-driven develop- ment (MDD) focuses on creating abstractions of a domain rather than of implementa- tion details, with the aim to increase interoperability and compatibility between sys- tems. It is common to make use of the Uniform Modeling Language (UML), which is a standardized language for creating data models [148]. The UML model can then be

28 Figure 3.5: The contents of a QSAR data set. Chemical structures are described mathemati- cally using descriptors, which for example can be calculated properties (Descriptor 1) or an enumeration of the presence of structural fragments (Descriptor 2). The information about the measured effect is also added for each chemical structure (Response), and all values are con- catenated into a data matrix (data set) which can be subjected to statistical analysis and build e.g. predictive models.

used to generate database schemas, implementations in most programming languages, and an XML Schema Definition [149] that defines an XML format for exchanging re- sults according to the model. This top-down approach has several advantages: It is fast, allows modelers to focus on modeling and not on the implementations, and if the model is updated then implementations can easily be regenerated. However, it can be problematic to define the model detailed enough to generate enough implementation details. Some developers also argue that it makes people careless when too much code is generated, and that logical errors might slip through unnoticed.

3.3 Drug Discovery 3.3.1 Structure-Activity Relationship An important task when modeling chemical and biological systems is to understand the effects of small molecules interacting with protein targets. In cases when the struc- tures of the target is known, structure-based techniques, such as docking and molec- ular dynamics simulations, can be applied. However, reliable 3D structures are not available for many protein families. In such cases a ligand based approach, which aims to correlate structural and physico-chemical features of chemical entities to an observed biological activity, may be used. Quantitative Structure-Activity Relationship (QSAR) is a ligand-based approach to quantitatively correlate chemical structure with a response, such as biological activity or chemical reactivity. The process is widely adopted and has for example been used to model carcinogenecity [150, 151], toxicity [152, 153]), and solubility [154, 155], and the literature is replete with QSAR studies covering problems from lead opti- mization [156] to fragrance design, and detection of doping in sports [157]. In QSAR, chemical structures are expressed as descriptors, which are numerical representations such as calculated properties or enumerated structural fragments. Descriptors are con- catenated into a data set (see Figure 3.5), which can be subjected to statistical analysis in order to build predictive models.

29 QSAR is a widely used technology for safety assessment of novel substances, and has been advocated by authorities such as OECD [158] and FDA [159]. The pharma- ceutical industry make use of QSAR models in drug discovery, aiming to reduce the number of experiments that are performed in the wet lab [160]. Many QSAR data sets are based on a combination of different software tools, mixed with in-house developed solutions, and data sets are then manually set up in spreadsheets. Currently there exists no agreed-upon definition of descriptors, numer- ous descriptor implementations, and no standard for exchanging data sets in QSAR, making it a virtually impossible task to reproduce and validate analyses. This signifi- cantly hinders collaborations and re-use of data and models.

3.3.2 Circular fingerprints Circular fingerprints (also known as spherical fingerprints) [161] can be used to rep- resent the chemical environment of an atom in a form that can be processed by a computer. These fingerprints consist of atom occurrences in concentric layers radiat- ing out from a central atom (see Figure 3.6). The first layer contains the atom currently centered upon, subsequent layers the bound atoms to the atoms in the previous layers, continued up to an arbitrary length. The atoms are commonly described by their atom type, which apart from the chemical name also includes hybridization and aromaticity. Circular fingerprints have been shown to perform well compared to other descrip- tors in several cheminformatics applications, such as similarity searching [162] and the prediction of absorption, distribution, metabolism, excretion, and toxicity prop- erties [161]. An important advantage of circular fingerprints is the interpretability of predictions, since the contribution of descriptors can be graphically visualized in the original chemical structure.

Figure 3.6: An illustration of a circular fingerprint with three level atom environment. The atom the environment is calculated for is at the centre of the environment, labelled black in this figure, and the atoms in the second and third layers are labelled light and dark grey, respectively.

30 4. Results and discussion

4.1 A workbench for the life sciences (Paper I and V) Bioclipse (Paper I) originated from the proteochemometric technology [163] devel- oped in Professor Jarl Wikberg’s research group, and the need to integrate chemin- formatics with bioinformatics. Much time was spent on manual data conversions and switching between software applications, and this process did not scale. There was a need for a framework that integrated the necessary tools, and what started as a so- lution for proteochemometrics [164] soon grew to encompass other fields. A fruitful collaboration was established with Dr. ’s research group, at that time based at Cologne University Bioinformatics Institute (CUBIC), and Bioclipse was officially announced in October 2005. The open source license Eclipse Public License (EPL) [165] was chosen as it places no constraints on external plugin licens- ing; it is totally open for both open source plugins as well as commercial ones. Many external collaborators joined the project, and Bioclipse expanded rapidly.

Architecture and features Bioclipse is a workbench for the life sciences implemented as a Rich Client based on Eclipse [119]. The plugin-based architecture adheres to the OSGi standard and allows the workbench to be extended into virtually any direction, and also enables the reuse of plugins from other fields in the vibrant Eclipse ecosystem. Bioclipse was early equipped with features for importing, converting, editing, analyzing, and visu- alizing small molecules, proteins, sequences, and spectra (see Figure 4.1) and several mature libraries and frameworks were integrated to provide functionality, such as the Chemistry Development Kit (CDK) [94, 166] for cheminformatics and spectrum anal- ysis, JChemPaint [95] for 2D editing of chemical structres, [167] for interactive 3D-visualization (see Figure 4.2(b)), the CML file format for chemistry [168], and BioJava [90] for sequence management. As a Rich Client, Bioclipse is also capable of interacting with remote Web services (see Figure 4.1), such as the WSDbfetch service at EBI [169, 63] to download bioinformatics data from public repositories. A major factor for the success of Bioclipse was however its focus on the user interface, where all technical details are hidden behind graphical tools in a user-friendly workbench (see Figure 4.2(a)). Bioclipse 2 (Paper V) constituted a complete rewrite which provided the project with a strong foundation for integrating more components, and turned Bioclipse into a stable, scalable platform for the life sciences. The core was remodeled into a more flexible architecture, still based on Eclipse RCP but with Spring [170] added to pro- vide dependency injection and aspect oriented programming (AOP). These architec- tural changes enabled Bioclipse with a new scripting language (see Section 4.5), and

31 Figure 4.1: Overview of the Bioclipse architecture. Bioclipse is implemented as a Rich Client and inherits much functionality from the widely used Eclipse Rich Client Platform (RCP). Bio- clipse is then extended by other plugins, which gives the possibility to customize the workbench for a certain domain or use case. If a user for example is not interested in sequence analyses, this feature can be omitted upon installation. Bioclipse also interacts with remote Web/Cloud services, which are software available over a network (commonly the Internet).

made recording of GUI interaction possible. Several new graphical components were also developed, including a brand new version of the chemical 2D editor JChemPaint, a MoleculesTable for visualizing multiple molecules (see Figure 4.2(a)), an editor for working with biological sequences (see Figure 4.2(c)), and Balloon for 3D con- former generation [171]. A client for the PubChem eUtils [26] was added to enable the querying for and downloading of chemical structures, and several XMPP cloud services [172] were also integrated.

Scope The scope of Bioclipse is manifold. For developers, Bioclipse provides a framework to integrate and develop new functionality, and the results can then be made available together with other developers’ contributions. This brings people from different fields together and promotes interdisciplinary collaborations, and an encouraging experi- ence was when developers found points of interaction between plugins that were not thought of from the start. For ordinary users, Bioclipse is a workbench that provides much of the functionality which is required on daily basis. Advanced users can make use of the scripting functionality to automate tasks and create reproducible analyses that can be shared between parties. Bioclipse finds its place in several of the drug discovery phases due to its inherent capability to be tailored for different use cases. Scripting and database searches can for example be applied in the lead identification phase, and interactive graphical tools together with computational models for decision support are more suitable in the lead optimization phase.

32 (a)

(b) (c)

Figure 4.2: Screenshots from Bioclipse in different applications. a) Multiple chemical struc- tures visualized in the Molecules Table (top middle panel), with the selected structure rendered in a separate 2D view (bottom right panel) and computed properties (bottom left panel). Also shown is a 3D structure where monomers are selected in the outline (top rightmost panel) and highlighted in the interactive 3D visualization component Jmol (top right panel). The Javascript console (bottom middle panel) can be used to execute scripting commands. b) 3D visualiza- tion of the interaction between a protein and DNA rendered in Jmol. c) A multiple sequence alignment open in the Sequence Editor.

Community The Bioclipse community is active and constantly growing. The Bioclipse Wiki [173] and the mailing lists are central places where development is discussed and docu- mented, and the Bioclipse Blog [174] is primarily used for announcements. Bioclipse Planet [175] is a website that aggregates several blogs with the topic of using and developing Bioclipse. A portal is available where people can report bugs or request features [176]. During the Bioclipse development the versioning control system for source code was upgraded from the original Concurrent Versions Sys-

33 tem (CVS) [177] to Subversion (SVN) [178], and finally to the distributed versioning system Git [179]. The latter gives a high level of flexibility and simplifies keeping multiple versions of Bioclipse in the release pipe. It is a clearly stated goal of the Bio- clipse project to further develop the infrastructure for its community, where use cases and solutions are openly discussed, and where end users and developers can share experiences and knowledge. Bioclipse has from its initial release been well received by the scientific community, and in 2006 the project was awarded two innovation prizes; the JAX Innovation Award 3rd price and JAX Innovation Audience Award. Further, in November 2007 Bioclipse won the ’special prize of the jury’ in the international software contest Trophees du libre held in Soissons, France. The total number of downloads amounted in September 2009 to over 30,000, increasing with about 800 downloads each month. Integrating management and analysis of small molecules (e.g. drugs) with biolog- ical entities (e.g. genomic and proteomic sequences) is an important task in many fields, and Bioclipse is the first open source project that provides this functionality in a graphical workbench. We envision the need for such integrated approaches to increase, as fields like proteochemometrics and genomic medicine are gaining in pop- ularity [180, 181]. For the sake of this thesis, Bioclipse laid the foundation for the other studies and came to be the central part where all functionality was integrated.

4.2 Discoverable and asynchronous services (Paper II) The two predominant technologies for Web services, SOAP [59] and REST [60], have several drawbacks, including lack of discoverability and the inability for services to send status notifications asynchronously. Much is attributed the fact that both tech- nologies are based on the Hypertext Transfer Protocol (HTTP) which is intrinsically synchronous (a ’pull protocol’), meaning that clients must send a request to a server in order for it to reply. This is problematic for long-running jobs due to the fact that network connections can be terminated if there is no activity within a certain time. A common workaround for SOAP services is to implement a ticketing system (also known as ’polling’) where the client receives a ticket upon a request, and then repet- itively asks if the job is finished using this ticket. This procedure not only leads to communication overhead, but the most important drawback is that such ticketing sys- tems are not standardized, and hence it is not possible to build a general client for SOAP services with such implementations. As REST services by design are state- less, they are incapable of maintaining a state and are better suited for more rapidly responding services like data provisioning. Extensible Messaging and Presence Protocol (XMPP) [182] is a decentralized XML routing technology that allows any entity to actively send XMPP messages to another entity. In Paper II we presented a novel approach for Cloud services based on XMPP, consisting of an XMPP Extension Protocol (XEP) to comprise discovery, asynchronous communication, and definition of data types in the service [183]. The XEP proposed by us, named IO Data, was subsequently accepted by the XMPP council, and implementations were developed for Bioclipse and Taverna. An XMPP

34 Figure 4.3: XMPP is formidable for providing the backbone in a ’Service Cloud’, where clients like Bioclipse can invoke services on demand. Two advantages over traditional technologies is that XMPP Cloud services can be discovered, and communication between the server and the client can be asynchronous. This is ideal for long-running operations as it enables clients like Bioclipse to continue with other tasks. server, ws1.bmc.uu.se1 was established, and services for calculation of chemical properties using CDK [166], and prediction of drug susceptibility for mutated HIV proteases [184] were established. Services based on XMPP + IO Data demonstrate several advantages over traditional W3C Web services:

1. Services can be discovered 2. Asynchronous communication eliminates the need for ad-hoc solutions like polling 3. Input and output types defined in the service allows for generation of clients on the fly, as well as simplifies inclusion in pipelines and Workflows

The use of the standardized protocol XMPP opens up for integration with other technologies, for example SMS services to send notices upon service completion, or protocols specifically designed for the transfer of large files. XMPP also comes with features which are not available in SOAP/REST, such as communication between servers to form federated networks. It is straightforward to constrain servers and form a ’private cloud’, but equally easy to open them up to the global XMPP network and make services available for the public (see Figure 4.3). XMPP is an authenticated network, which means that users are required to log in with a username and password. This gives the opportunity to apply restrictions for services on user basis, as well

1The server ws1.bmc.uu.se is hosted and adminstered by the computer department at Uppsala Biomedical Center (BMC).

35 as for monitoring and logging of service utilization. While the above use cases are technically possible to achieve with SOAP, this is standardized in XMPP and does not require additional extensions. The many advantages over existing technologies make XMPP + IO Data to an interesting candidate for the next generation of Web services in bioinformatics.

4.3 Near-real time metabolic site predictions (Paper III) In Paper III we presented a fast and interactive implementation of site-of-metabolism (SOM) prediction based on an existing and validated algorithm to analyze ligands and visualize putative metabolic sites [185]. The method is called MetaPrint2D and is based on historic metabolic reactions which are preprocessed using circular fin- gerprints [161] and stored in a database. When a new compound is presented for the algorithm, the probability of metabolism occurring can be calculated for each atom as a ratio of the number of times an atom with this environment is present in the products and substrates part of the database. In this work we produced four databases, comprising the species Human, Dog, Rat, and All (the three species combined). A plugin for Bioclipse was developed which allows for near-real time calculations of MetaPrint2D directly in a chemical editor. The results are visualized graphically in single molecules or in collection of molecules, with colors representing the proba- bility of metabolism occurring at the atoms (see Figure 4.3). The time to predict new compounds is significantly lower than existing available software, making Bioclipse- MetaPrint2D the fastest available SOM tool, and the only one with near-real time updating when editing chemical structures. No access to networked computers is re-

Figure 4.4: Above: The structure of Val- proic acid with MetaPrint2D results visualized. Atoms colored in red indicates a high probabil- ity of metabolism occurring at that site, yellow indicates medium, green low, and no coloring very low or no probability of metabolism oc- curring at that atom. Right: Screenshot from Bioclipse with the re- sults from a MetaPrint2D calculation of a file comprising 5 well known drugs.

36 Figure 4.5: Illustration of the QSAR-ML implementation in Bioclipse. The architecture allows for integrated formation of data sets for QSAR by importing chemical structures, selecting descriptors from the Blue Obelisk Descriptor Ontology, cherry-picking local and remote de- scriptor providers, adding responses and metadata, and finally calculating and exporting the complete data set. quired, and the accuracy of the predictions have been shown to be on par with other tools in the field [186]. The results open up for using SOM in more areas than at present, and for the inclusion in earlier stages of drug discovery such as virtual screen- ing.

4.4 A new standard for QSAR (Paper IV) In Paper IV we presented a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR data sets, comprising an open XML format (QSAR-ML) and a descriptor ontology (Blue Obelisk Descriptor Ontology). The on- tology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a data set described by QSAR-ML makes its setup com- pletely reproducible. We also provided an implementation for Bioclipse that allows for setting up QSAR data sets, and exporting results in QSAR-ML as well as traditional comma-separated formats. Model-driven development was used for the Bioclipse im- plementation, and a model was crafted and transferred into an XML Schema [149]. Implementation code was generated with the Eclipse Modeling Framework [187], and the generated model code was extended with OSGi functionality to facilitate addition of new descriptor implementations, both from locally installed software and via re- mote Web services. A GUI was also developed to hide all the technical details for end users, and allow for importing of chemical structures, selecting descriptors from the Blue Obelisk Descriptor Ontology, cherry-picking local and remote descriptor providers, adding responses and metadata, and finally performing all calculations and exporting the complete data set in QSAR-ML (see Figure 4.5). Standardized QSAR opens up new ways to store, query, and exchange analysis, and makes it is easy to join, extend, combine and also to work collectively with data. The Bioclipse-based implementation greatly simplifies integrating different descriptor calculation softwares, and makes it straightforward to produce QSAR-ML compliant data sets.

37 4.5 A scripting language for the life sciences (Paper V) In Paper V we introduced a novel scripting language for the life sciences called Bio- clipse Scripting Language (BSL). Technically it consists of an extensible set of func- tions that provides users with functionality which can be integrated into other pro- gramming languages, and Bioclipse version 2 included a reference implementation based on Javascript. In Bioclipse 2, all functional code contributed by plugins is structured in Bioclipse Managers; e.g. code that provides access to 3D conformer generation using Balloon [171] is available in the BalloonManager. The Manager objects are published into the scripting environment, and hence the same objects that are called from the GUI are reachable from scripts (see Figure 4.6 and 4.7). Plugins can contribute new commands to BSL, and this is the recommended way of adding functionality to Bioclipse. An example is Jmol [167], which has its own scripting language exposed in BSL by a JmolManager. Another example is the BioWS plugin which makes the sequence alignment tool Kalign [188] available in BSL via a Web service provided by EBI [63]. The JavaScript Console is a canvas where users can enter commands in BSL and Javascript to execute functionality in Bioclipse (see Figure 4.2(a)). The GUI can also be used to interact with the Console, and it is possible to drag and drop resources into the JavaScript Console to facilitate the setup of commands. The JavaScript Editor can be used for writing scripts, containing one or more lines of code in BSL collected in an executable program, which are ideal to share between scientists (see Figure 4.7b). The use of open and collaborative services like Gist [189] and myExperiment [86] for sharing scripts is encouraged. Cross-domain scripts are powerful tools for integrating functionality in the life sci- ences. They provide functionality similar to Workflows, but are much faster to create. The flexibility is also higher in scripts as they allow for conditional loops and straight- forward conversion between different textual data types. When looking at Taverna Workflows today, many of them contain nodes at various places, comprising small

Figure 4.6: Overview of the Bioclipse architecture describing the use of Managers to collect functional code. This architecture makes all functionality available from both the GUI and the Bioclipse Scripting Language.

38 Figure 4.7: a) Three different commands to query public repositories for sequences using BSL. b) BSL script to query EMBL for two DNA sequences, align them using Kalign, and write the alignment to a FASTA file. c) The first page of a graphical wizard in Bioclipse for executing the same queries as in (a). scripts to handle such data conversion. The downside with scripts is that they require basic knowledge in programming, but they are very powerful for users who possess this skill. In the future we envision the possibility to convert BSL scripts to Work- flows, and also to produce Workflow nodes that are able to execute BSL scripts.

39

5. Concluding remarks

The emerge of eScience holds many promises, but it also requires substantial devel- opment to reach its full potential. It is important to structure data and information according to well defined standards so that analysis tools can make use of it, but due to the iterative nature of science it is also important to visually inspect, validate, and interact with the information during the process. Rich clients are ideal for these kinds of operations, and also for more exploratory ventures where the process is not known beforehand but discovered along the way. An illustrative example is the setup of QSAR data sets, which was demonstrated in Paper IV. Using the graphical tools of Bioclipse it is straightforward to experiment with different descriptors in a data set, and when satisfied the results can be exported in a standardized format (QSAR-ML). A second example is the integration of components in the Bioclipse Scripting Lan- guage, as demonstrated in Paper V. After successful data exploration, the protocol can easily be summarized in a reproducible script and exchanged between parties. The adoption of Web services in the scientific community is an important step to- wards software interoperability, but is not the answer to everything. In fact, I believe that local implementations are superior in many cases, provided that the problems of cross-platform deployment and application updates are resolved. One example is MetaPrint2D (Paper III) where a local implementation was chosen over a Web ser- vice, delivering unprecedented speed in site-of-metabolism predictions, and hence opening up for new applications of the algorithm. A service-based implementation would in this case have reduced the performance due to communication overhead, and would also require constant network connection with the associated security con- cerns. However, Web services do have many advantages, and are sometime the only option in cases where the underlying data or implementations are not distributable. This is common when services are based on large or proprietary data sources, im- plemented in a platform-specific programming language, or requires HPC resources. A mix of local and remote functionality maximizes the usefulness in end user tools, and Rich Clients are excellent for dealing with these tasks. Paper II showed how the shortcomings of today’s Web services can be overcome with the XMPP protocol, and the implementation in Bioclipse demonstrated how well suited Rich Clients are for handling service discovery and concurrent, long-running jobs. As the founder of the Bioclipse workbench (Paper I and V), I am proud of how the project has developed, and optimistic about the future. The decision to base the application on the Eclipse Rich Client Platform and hence the OSGi framework was very successful, and several other prominent software projects in the life sciences, such as Taverna [105] and Cytoscape [91], are now rewriting their applications to be based on this technology. The adoption of OSGi as a common architecture among software tools in the life sciences is promising, as it effectively allows for reuse and integration of software components between applications.

41 An important and neglected issue in eScience is long term sustainability. Relying on external parties for data and service provision is associated with a certain degree of risk. For example, data and functionality might be unavailable at the desired time of analyses due to e.g. network congestion or, more severely, changed or removed services. Versioning of data and services are very important in order to be able to audit and reproduce analyses at a later time, when data and services have been updated. Service mirroring, where services are duplicated over different systems, is another overlooked topic. Mirroring not only increases the speed of data transfers (as servers usually provide a faster connection if located geographically closer to the client), but also acts as a fail-safe system so that jobs can be sent to one of the mirrors if the main service goes down. Such an infrastructure is currently not commonplace in eScience, but is important for long term sustainability. When using XMPP services, several server components running on different platforms can offer the same service, effectively mirroring the services. Mirroring and load balancing with XMPP needs further research and development, but shows high potential for the future.

42 6. Acknowledgements

This work was carried out at the Department of Pharmaceutical Biosciences, Division of Pharmacology, Faculty of Pharmacy, Uppsala University.

I would like to thank the following people who have supported me during the years towards my thesis:

Jarl Wikberg, my supervisor and Professor. Thanks for all your help and support on my way towards becoming a scientist.

Dr. Christoph Steinbeck, European Bioinformatics Institute, Hinxton, UK.

Dr. Lars Carlsson, Ernst Ahlberg Helgee, and Dr. Scott Boyer, Global Safety As- sessment, AstraZeneca R&D, Mölndal, Sweden.

Professor Robert Glen and Samuel Adams, Unilever Center for Molecular Infor- matcs, Cambridge University, UK.

Dr. Johannes Wagener, Max von Pettenkofer-Institut, Ludwig-Maximilians- Universität, Munich, Germany.

Dr. Rajarshi Guha, NIH Chemical Genomics Center, MD, USA.

Professor Peter Murray-Rust, Cambridge University, UK.

Dr. Erik Bongcam-Rudloff and Alvaro Martinez Barrio, Swedish University of Agricultural Sciences (SLU), Sweden.

Nils-Einar Eriksson, Anders Lövgren, Emil Lundberg, and Gustavo Gonzáles- Wall, BMC Computing Department, Uppsala University.

Professor Johan Åqvist, Dr. Martin Nervall, and Göran Wallin, Department of Cell and Molecular Biology, Uppsala University.

Dr. Egon Willighagen for all collaborations on Bioclipse, CDK, QSAR, and XMPP.

My fellow Bioclipse developers and group members Jonathan Alvarsson, Carl Mäsak, Arvid Berg, as well as Sviatlana Yahorava, Aleh Yahorau, Maris Lapins, Peteris Prusis, and Muhammad Junaid.

All other people working at the Department of Pharmaceutical Biosciences at BMC, especially Erica Johansson, Magnus Jansson, and Agneta Hortlund.

43 Many other people have made substantial contributions to the Bioclipse project: Ste- fan Kuhn, Dr. Mark Rijnbeek, Dr. Gilleain Torrance, Dr. Tobias Helmus, Miguel Rojas, Dr. Thomas Kuhn, Eskil Andersson, Annsofi Andersson, Bjarni Juliusson, Sofia Burvall, Dr. Jerome Pansanel, and Samuel Lampa.

Hanna, Kim, and Darta for lightening up Wednesdays.

Martin "Puff" Eklund. It’s always christmas.

My parents Ebba and Lennart and my brothers, Per and Kalle.

All my friends.

And finally, Nora, for helping me with the completion of this thesis and for bearing with me during this, quite busy, period. Looking forward to spending more quality time with you!

The research presented in this thesis was supported by grants from Swedish VR-M (04X-05957) and Uppsala University (KoF 07).

44 Bibliography

[1] Lander,ESetal.Initial sequencing and analysis of the human genome. Nature 409(6822), 860–921 Feb (2001).

[2] Venter,JCetal.Thesequence of the human genome. Science 291(5507), 1304–1351 Feb (2001).

[3] David A et al., Wheeler. The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189), 872–876 Apr (2008).

[4] NCBI. The NCBI Short Read Archive (SRA) of Next- Generation Sequencing Data. http://www.ncbi.nlm. nih.gov/bookshelf/br.fcgi?book=newsncbi&part=Aug09.

[5] Complete Genomics Inc. vision. http://www.completegenomics.com/corporate/vision.aspx.

[6] 1000 Genomes Project. http://www.1000genomes.org.

[7] Brenton, Louie, Peter, Mork, Fernando, Martin-Sanchez, Alon, Halevy & Peter, Tarczy-Hornoch. Data integration and genomic medicine. J Biomed Inform 40(1), 5–16 Feb (2007).

[8] John P, Overington, Bissan, Al-Lazikani & Andrew L, Hopkins. How many drug targets are there? Nat Rev Drug Discov 5(12), 993–996 Dec (2006).

[9] Lars, Bertram & Rudolph E, Tanzi. Genome-wide association studies in Alzheimer’s disease. Hum Mol Genet 18(R2), R137–45 Oct (2009).

[10] Natalie G, Ahn & Katheryn A, Resing. Cell biology. Lessons in rational drug design for protein kinases. Science 308(5726), 1266–1267 May (2005).

[11] Michael Y, Galperin & Guy R, Cochrane. Nucleic Acids Research annual Database Issue and the NAR online Molecular Biology Database Collection in 2009. Nucleic Acids Res 37(Database issue), D1–4 Jan (2009).

[12] D G, Higgins & P M, Sharp. CLUSTAL: A package for performing multiple sequence alignment on a microcom- puter. Gene 73(1), 237–244 Dec (1988).

[13] Julie D, Thompson, Toby J, Gibson & Des G, Higgins. Multiple sequence alignment using ClustalW and ClustalX. Curr Protoc Bioinformatics Chapter 2, Unit 2.3 Aug (2002).

[14] Christine E, Gray & Craig J, Coates. Cloning and characterization of cDNAs encoding putative CTCFs in the mosquitoes, Aedes aegypti and Anopheles gambiae. BMC Mol Biol 6(1), 16 (2005).

[15] Dennis A, Benson, Ilene, Karsch-Mizrachi, David J, Lipman, James, Ostell & Eric W, Sayers. GenBank. Nucleic Acids Res 37(Database issue), D26–31 Jan (2009).

[16] Kulikova, Tamara et al. EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res 35(Database issue), D16–20 Jan (2007).

[17] H, Sugawara, O, Ogasawara, K, Okubo, T, Gojobori & Y, Tateno. DDBJ with new system and face. Nucleic Acids Res 36(Database issue), D22–4 Jan (2008).

[18] Consortium, UniProt. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res 37(Database issue), D169–74 Jan (2009).

[19] NCBI Protein database. http://www.ncbi.nlm.nih.gov/sites/entrez?db=protein.

[20] H M, Berman, J, Westbrook, Z, Feng, G, Gilliland, T N, Bhat, H, Weissig, I N, Shindyalov & P E, Bourne. The Protein Data Bank. Nucleic Acids Res 28(1), 235–242 Jan (2000).

45 [21] M, Kanehisa & S, Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28(1), 27–30 Jan (2000).

[22] G, Joshi-Tope, M, Gillespie, I, Vastrik, P, D’Eustachio, E, Schmidt, B, de Bono, B, Jassal, G R, Gopinath, G R, Wu, L, Matthews, S, Lewis, E, Birney & L, Stein. Reactome: A knowledgebase of biological pathways. Nucleic Acids Res 33(Database issue), D428–32 Jan (2005).

[23] Brazma, A et al. ArrayExpress–a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 31(1), 68–71 Jan (2003).

[24] Robert, Clark. DPRESS: Localizing estimates of predictive uncertainty. Journal of Cheminformatics 1(1), 11 (2009).

[25] Megan, Peach & Marc, Nicklaus. Combining docking with pharmacophore filtering for improved virtual screening. Journal of Cheminformatics 1(1), 6 (2009).

[26] Yanli, Wang, Jewen, Xiao, Tugba O, Suzek, Jian, Zhang, Jiyao, Wang & Stephen H, Bryant. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37(Web Server issue), W623– 33 Jul (2009).

[27] ChemSpider. http://www.chemspider.com.

[28] David S, Wishart, Craig, Knox, An Chi, Guo, Savita, Shrivastava, Murtaza, Hassanali, Paul, Stothard, Zhan, Chang & Jennifer, Woolsey. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34(Database issue), D668–72 Jan (2006).

[29] John, Overington. ChEMBL. An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory (EMBL-EBI). J Comput Aided Mol Des 23(4), 195–198 Apr (2009).

[30] Kirill, Degtyarenko, Janna, Hastings, Paula, de Matos & Marcus, Ennis. ChEBI: an open bioinformatics and cheminformatics resource. Curr Protoc Bioinformatics Chapter 14, Unit 14.9 Jun (2009).

[31] Roger, Collier. Drug development cost estimates hard to swallow. CMAJ 180(3), 279–280 Feb (2009).

[32] J., Drews. Drug discovery: a historical perspective. Science 287(5460), 1960–1964 March (2000).

[33] Michael D, Rawlins. Cutting the cost of drug development? Nat Rev Drug Discov 3(4), 360–364 Apr (2004).

[34] Brian L, Claus & Dennis J, Underwood. Discovery informatics: its evolving role in drug discovery. Drug Discov Today 7(18), 957–966 Sep (2002).

[35] Jeffrey, Augen. The evolving role of information technology in the drug discovery process. Drug Discov Today 7(5), 315–323 Mar (2002).

[36] S A, Shaikh, T, Jain, G, Sandhu, N, Latha & B, Jayaram. From drug target to leads–sketching a physicochemical pathway for lead molecule design in silico. Curr Pharm Des 13(34), 3454–3470 (2007).

[37] Aleksejs, Kontijevskis, Jan, Komorowski & Jarl E S, Wikberg. Generalized proteochemometric model of multiple cytochrome p450 enzymes and their inhibitors. J Chem Inf Model 48(9), 1840–1850 Sep (2008).

[38] C., Zins. Conceptual approaches for defining data, information, and knowledge. Journal of the American Society for Information Science and Technology 58(4) (2007).

[39] Data, Information, Knowledge. Nature Biotechnology 10(752) (1992).

[40] Erick, Antezana, Martin, Kuiper & Vladimir, Mironov. Biological knowledge management: the emerging role of the Semantic Web technologies. Brief Bioinform 10(4), 392–407 Jul (2009).

[41] A, Brazma. On the importance of standardisation in life sciences. Bioinformatics 17(2), 113–114 Feb (2001).

[42] MGED Society. http://www.mged.org/.

46 [43] Jones, Andrew R et al. The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nat Biotechnol 25(10), 1127–1133 Oct (2007).

[44] Taylor, Chris F et al. Promoting coherent minimum reporting guidelines for biological and biomedical investiga- tions: the MIBBI project. Nat Biotechnol 26(8), 889–896 Aug (2008).

[45] Chris F, Taylor, Henning, Hermjakob, Randall K Jr, Julian, John S, Garavelli, Ruedi, Aebersold & Rolf, Apweiler. The work of the Human Proteome Organisation’s Proteomics Standards Initiative (HUPO PSI). OMICS 10(2), 145–151 Summer (2006).

[46] M, Hann & R, Green. Chemoinformatics–a new name for an old problem? Curr Opin Chem Biol 3(4), 379–383 Aug (1999).

[47] Sharing publication-related data and materials: responsibilities of authorship in the life sciences. National Academies Press, Washington, D.C., (2003).

[48] Sharing data from large-scale biological research projects: a system of tripartite responsibility, (2003).

[49] NIH roadmap for bioinformatics and . http://nihroadmap.nih.gov/ bioinformatics/index.asp.

[50] P., Arzberger, P., Schroeder, A., Beaulieu, K., Bowker, L., Laaksonen, D., Moorman, P., Uhlir & P., Wouters. Promoting Access to Public Research Data for Scientific, Economic, and Social Development. Data Science Journal 3 (2004).

[51] T, Etzold, A, Ulyanov & P, Argos. SRS: information retrieval system for molecular biology data banks. Methods Enzymol 266, 114–128 (1996).

[52] M L, Hekkelman & G, Vriend. MRS: a fast and compact retrieval system for biological data. Nucleic Acids Res 33(Web Server issue), W766–9 Jul (2005).

[53] Damian, Smedley, Syed, Haider, Benoit, Ballester, Richard, Holland, Darin, London, Gudmundur, Thorisson & Arek, Kasprzyk. BioMart–biological queries made easy. BMC Genomics 10, 22 (2009).

[54] Hubbard,TJPetal.Ensembl 2009. Nucleic Acids Res 37(Database issue), D690–7 Jan (2009).

[55] R D, Dowell, R M, Jokerst, A, Day, S R, Eddy & L, Stein. The distributed annotation system. BMC Bioinformatics 2, 7 (2001).

[56] Dennis, Quan. Improving life sciences information retrieval using Semantic Web technology. Brief Bioinform 8(3), 172–182 May (2007).

[57] Lincoln, Stein. Creating a bioinformatics nation. Nature 417(6885), 119–120 May (2002).

[58] Debra L, Banville. Mining chemical and biological information from the drug literature. Curr Opin Drug Discov Devel 12(3), 376–387 May (2009).

[59] W3C. Simple Object Access Protocol (SOAP). http://www.w3.org/TR/soap/.

[60] Roy, Fielding. Architectural Styles and the Design of Network-based Software Architectures. PhD thesis, Univer- sity of California, Irvine, (2000).

[61] Pieter B T, Neerincx & Jack A M, Leunissen. Evolution of web services in bioinformatics. Brief Bioinform 6(2), 178–188 Jun (2005).

[62] Vasa, Curcin, Moustafa, Ghanem & Yike, Guo. Web services in the life sciences. Drug Discov Today 10(12), 865–871 Jun (2005).

[63] Alberto, Labarga, Franck, Valentin, Mikael, Anderson & Rodrigo, Lopez. Web services at the European Bioinfor- matics Institute. Nucleic Acids Res 35(Web Server issue), W6–11 Jul (2007).

[64] P, Rice, I, Longden & A, Bleasby. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16(6), 276–277 Jun (2000).

47 [65] Entrez. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, (1994).

[66] G D, Schuler, J A, Epstein, H, Ohkawa&JA,Kans. Entrez: molecular biology database and retrieval system. Methods Enzymol 266, 141–162 (1996).

[67] E., Sayers & D, Wheeler. Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils). NCBI, (2004).

[68] EMBRACE Network of Excellence. http://www.embracegrid.info/.

[69] Kerry K, Kakazu, Leo W K, Cheung & Wilkens, Lynne. The Cancer Biomedical Informatics Grid (caBIG): pioneering an expansive network of information and tools for collaborative cancer research. Hawaii Med J 63(9), 273–275 Sep (2004).

[70] Xiao, Dong, Kevin E, Gilbert, Rajarshi, Guha, Randy, Heiland, Jungkee, Kim, Marlon E, Pierce, Geoffrey C, Fox & David J, Wild. Web service infrastructure for chemoinformatics. J Chem Inf Model 47(4), 1303–1307 Jul-Aug (2007).

[71] Cameron, Kiddle, Mark, Fox & Rob, Simmonds. Running Simulations in a Grid Environ- ment. High Performance Computing Systems and Applications, Annual International Symposium on 0, 25 (2006).

[72] W Graham, Richards. Virtual screening using Grid computing: The screensaver project. Nat Rev Drug Discov 1(7), 551–555 Jul (2002).

[73] Karol, Estrada, Anis, Abuseiris, Frank G, Grosveld, Andre G, Uitterlinden, Tobias A, Knoch & Fernando, Ri- vadeneira. GRIMP: a web- and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data. Bioinformatics 25(20), 2750–2752 Oct (2009).

[74] Soojin, Lee, Min-Kyu, Cho, Jin-Won, Jung, Jai-Hoon, Kim & Weontae, Lee. Exploring protein fold space by secondary structure prediction using data distribution method on Grid platform. Bioinformatics 20(18), 3500– 3507 Dec (2004).

[75] Veera, Talukdar, Amit, Konar, Ayan, Datta & Anamika Roy, Choudhury. Changing from computing grid to knowl- edge grid in life-science grid. Biotechnol J 4(9), 1244–1252 Sep (2009).

[76] Enabling Grids for E-sciencE (EGEE). http://www.chemspider.com.

[77] The BioinfoGRID Project. http://www.bioinfogrid.eu/.

[78] Robert D, Stevens, Alan J, Robinson & Carole A, Goble. myGrid: personalised bioinformatics on the information grid. Bioinformatics 19 Suppl 1, i302–4 (2003).

[79] Hiroaki, Kitano. Computational systems biology. Nature 420(6912), 206–210 Nov (2002).

[80] Michael J, Ackerman. Open science: the coming paradigm shift. J Med Pract Manage 22(4), 212–213 Jan-Feb (2007).

[81] Cameron, Neylon & Shirley, Wu. Open Science: tools, approaches, and implications. Pac Symp Biocomput , 540–544 (2009).

[82] The OpenScience Project. http://www.openscience.org/.

[83] Aled, Edwards. Open-source science to enable drug discovery. Drug Discov Today 13(17-18), 731–733 Sep (2008).

[84] Rajarshi, Guha, Michael T, Howard, Geoffrey R, Hutchison, Peter, Murray-Rust, Henry, Rzepa, Christoph, Stein- beck, Jorg, Wegner & Egon L, Willighagen. The Blue Obelisk-interoperability in chemical informatics. J Chem Inf Model 46(3), 991–998 May-Jun (2006).

[85] Antony J, Williams. Internet-based tools for communication and collaboration in chemistry. Drug Discov Today 13(11-12), 502–506 Jun (2008).

[86] David, De Roure, Carole, Goble & Robert, Stevens. The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows. Future Generation Computer Systems 25(5), 561–567 May (2009).

48 [87] Open Source, Initiative. The Open Source Definition. http://opensource.org/docs/osd.

[88] Harry, Mangalam. The Bio* toolkits–a brief overview. Brief Bioinform 3(3), 296–302 Sep (2002).

[89] Stajich, Jason E et al. The Bioperl toolkit: modules for the life sciences. Genome Res 12(10), 1611–1618 Oct (2002).

[90] R C G, Holland, T A, Down, M, Pocock, A, Prlic, D, Huen, K, James, S, Foisy, A, Drager, A, Yates, M, Heuer & M J, Schreiber. BioJava: an open-source framework for bioinformatics. Bioinformatics 24(18), 2096–2097 Sep (2008).

[91] Paul, Shannon, Andrew, Markiel, Owen, Ozier, Nitin S, Baliga, Jonathan T, Wang, Daniel, Ramage, Nada, Amin, Benno, Schwikowski & Trey, Ideker. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11), 2498–2504 Nov (2003).

[92] C., Gille & C., Frommel. STRAP: editor for STRuctural Alignments of Proteins. Bioinformatics 17(4), 377–8 (2001).

[93] Michele, Clamp, James, Cuff, Stephen M, Searle & Geoffrey J, Barton. The Jalview Java alignment editor. Bioin- formatics 20(3), 426–427 Feb (2004).

[94] Christoph, Steinbeck, Yongquan, Han, Stefan, Kuhn, Oliver, Horlacher, Edgar, Luttmann & Egon, Willighagen. The Chemistry Development Kit (CDK): an open-source Java library for Chemo- and Bioinformatics. J Chem Inf Comput Sci 43(2), 493–500 Mar-Apr (2003).

[95] S., Krause, E.L., Willighagen & C., Steinbeck. JChemPaint - Using the Collaborative Forces of the Internet to Develop a Free Editor for 2D Chemical Structures. Molecules 5, 93–98 (2000).

[96] Mark, Reimers & Vincent J, Carey. Bioconductor: an open source framework for bioinformatics and computational biology. Methods Enzymol 411, 119–134 (2006).

[97] Jason E, Stajich & Hilmar, Lapp. Open source tools and toolkits for bioinformatics: significance, and where are we? Brief Bioinform 7(3), 287–296 Sep (2006).

[98] John, Quackenbush. Open-source software accelerates bioinformatics. Genome Biol 4(9), 336 (2003).

[99] Warren L, DeLano. The case for open-source software in drug discovery. Drug Discov Today 10(3), 213–217 Feb (2005).

[100] Werner J, Geldenhuys, Kevin E, Gaasch, Mark, Watson, David D, Allen & Cornelis J, Van der Schyf. Optimizing the use of open-source software applications in drug discovery. Drug Discov Today 11(3-4), 127–132 Feb (2006).

[101] Tim, Carver & Alan, Bleasby. The design of Jemboss: a graphical user interface to EMBOSS. Bioinformatics 19(14), 1837–1843 Sep (2003).

[102] A., Siepel, A., Farmer, A., Tolopko, M., Zhuang, P., Mendes, W., Beavis & B., Sobral. ISYS: a decentralized, component-based approach to the integration of heterogeneous bioinformatics resources. Bioinformatics 17(1), 83–94 (2001).

[103] Paul T, Shannon, David J, Reiss, Richard, Bonneau & Nitin S, Baliga. The Gaggle: an open-source software system for integrating bioinformatics software and data sources. BMC Bioinformatics 7, 176 (2006).

[104] C., Gille & P. N., Robinson. HotSwap for bioinformatics: a STRAP tutorial. BMC Bioinformatics 7, 64 (2006).

[105] Tom, Oinn, Matthew, Addis, Justin, Ferris, Darren, Marvin, Martin, Senger, Mark, Greenwood, Tim, Carver, Kevin, Glover, Matthew R, Pocock, Anil, Wipat & Peter, Li. Taverna: a tool for the composition and enactment of bioin- formatics workflows. Bioinformatics 20(17), 3045–3054 Nov (2004).

[106] Maven. http://maven.apache.org/.

[107] OSGi Alliance. http://www.osgi.org.

[108] Ildefonso, Cases, David G, Pisano, Eduardo, Andres, Angel, Carro, Jose M, Fernandez, Gonzalo, Gomez-Lopez, Jose M, Rodriguez, Jaime F, Vera, Alfonso, Valencia & Ana M, Rojas. CARGO: a web portal to integrate cus- tomized biological information. Nucleic Acids Res 35(Web Server issue), W16–20 Jul (2007).

49 [109] E., Badidi, C., De Sousa, B. F., Lang & G., Burger. AnaBench: a Web/CORBA-based workbench for biomolecular sequence analysis. BMC Bioinformatics 4, 63 (2003).

[110] MarkWEJ,Fiers, Ate, van der Burgt, Erwin, Datema, Joost C W, de Groot & RoelandCHJ,vanHam. High- throughput bioinformatics with the Cyrille2 pipeline system. BMC Bioinformatics 9, 96 (2008).

[111] Michael R., Berthold, Nicolas, Cebron, Fabian, Dill, Thomas R., Gabriel, Tobias, Kötter, Thorsten, Meinl, Peter, Ohl, Christoph, Sieb, Kilian, Thiel & Bernd, Wiswedel. KNIME: The Konstanz Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007). Springer, (2007).

[112] Paul, Fisher, Cornelia, Hedeler, Katherine, Wolstencroft, Helen, Hulme, Harry, Noyes, Stephen, Kemp, Robert, Stevens & Andrew, Brass. A systematic strategy for large-scale analysis of genotype phenotype correlations: iden- tification of candidate genes involved in African trypanosomiasis. Nucleic Acids Res 35(16), 5625–5633 (2007).

[113] Mark D, Wilkinson & Matthew, Links. BioMOBY: an open source biological web services proposal. Brief Bioin- form 3(4), 331–341 Dec (2002).

[114] Mark, Wilkinson, Heiko, Schoof, Rebecca, Ernst & Dirk, Haase. BioMOBY successfully integrates distributed heterogeneous bioinformatics Web Services. The PlaNet exemplar case. Plant Physiol 138(1), 5–17 May (2005).

[115] S, Pettifer, D, Thorne, P, McDermott, T, Attwood, J, Baran, J C, Bryne, T, Hupponen, D, Mowbray & G, Vriend. An active registry for bioinformatics web services. Bioinformatics 25(16), 2090–2091 Aug (2009).

[116] BioCatalogue. http://www.biocatalogue.org/.

[117] Eclipse Rich Client Platform. http://www.eclipse.org/home/categories/rcp.php.

[118] NetBeans Platform. http://platform.netbeans.org/.

[119] Eclipse. http://www.eclipse.org.

[120] Brazma, A et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 29(4), 365–371 Dec (2001).

[121] Lennart, Martens, Luisa Montecchi, Palazzi & Henning, Hermjakob. Data standards and controlled vocabularies for proteomics. Methods Mol Biol 484, 279–286 (2008).

[122] Taylor, Chris F et al. The minimum information about a proteomics experiment (MIAPE). Nat Biotechnol 25(8), 887–893 Aug (2007).

[123] W3C. Extensible Markup Language (XML). http://www.w3.org/XML/.

[124] Philipp N, Seibel, Jan, Kruger, Sven, Hartmeier, Knut, Schwarzer, Kai, Lowenthal, Henning, Mersch, Thomas, Dandekar & Robert, Giegerich. XML schemas for common bioinformatic data types and their application in workflow systems. BMC Bioinformatics 7, 490 (2006).

[125] P., Murray-Rust & H. S., Rzepa. Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles. Journal of Chemical Informatics and Computer Sciences 39, 928–942 (1999).

[126] Hucka, M et al. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics 19(4), 524–531 Mar (2003).

[127] Spellman, Paul T et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol 3(9), RESEARCH0046 Aug (2002).

[128] Hermjakob, Henning et al. The HUPO PSI’s molecular interaction format–a community standard for the represen- tation of protein interaction data. Nat Biotechnol 22(2), 177–183 Feb (2004).

[129] Microarray standards at last. Nature 419(6905), 323 Sep (2002).

[130] Ashburner, M et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25(1), 25–29 May (2000).

[131] R., McEntire. Ontologies in the life sciences. The Knowledge Engineering Review 17(1), 77–80 (2002).

50 [132] Smith, Barry et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25(11), 1251–1255 Nov (2007).

[133] Tim, Berners-Lee, James, Hendler & Ora, Lassila. The Semantic Web. Scientific American May (2001).

[134] W3C. Resource Description Framework (RDF) Model and Syntax Specification. http://www.w3.org/TR/ 1999/REC-rdf-syntax-19990222/.

[135] W3C. OWL Web Ontology Language Reference. http://www.w3.org/TR/owl-ref/.

[136] W3C. SPARQL Query Language for RDF. http://www.w3.org/TR/rdf-sparql-query/.

[137] Semantic Web Health Care and Life Sciences (HCLS) Interest Group. http://www.w3.org/2001/sw/ hcls/.

[138] Linking Open Drug Data (LODD). http://esw.w3.org/topic/HCLSIG/LODD.

[139] Bio2RDF. http://bio2rdf.org/.

[140] Francois, Belleau, Marc-Alexandre, Nolin, Nicole, Tourigny, Philippe, Rigault & Jean, Morissette. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform 41(5), 706–716 Oct (2008).

[141] Semantic Automated Discovery and Integration (SADI). http://sadiframework.org/.

[142] Bin, He, Mitesh, Patel, Zhen, Zhang & Kevin Chen-Chuan, Chang. Accessing the deep web. Commun. ACM 50(5), 94–101 (2007).

[143] CardioSHARE. http://cardioshare.icapture.ubc.ca/.

[144] Michael, DiBernardo, Rachel, Pottinger & Mark, Wilkinson. Semi-automatic web service composition for the life sciences using the BioMoby semantic web framework. J Biomed Inform 41(5), 837–847 Oct (2008).

[145] Paul M K, Gordon & Christoph W, Sensen. Seahawk: moving beyond HTML in Web-based bioinformatics analy- sis. BMC Bioinformatics 8, 208 (2007).

[146] Edward, Kawas, Martin, Senger & Mark D, Wilkinson. BioMoby extensions to the Taverna workflow management and enactment software. BMC Bioinformatics 7, 523 (2006).

[147] Mark, Wilkinson. Gbrowse Moby: a Web-based browser for BioMoby Services. Source Code Biol Med 1, 4 (2006).

[148] Unified Modeling Language. http://www.uml.org/.

[149] XML Schema language. http://www.w3.org/XML/Schema.

[150] Christoph, Helma. Lazy Structure-Activity Relationships (LAZAR) for the Prediction of Rodent Carcinogenicity and Salmonella Mutagenicity. Molecular Diversity 10, 147–158 (2006).

[151] Aliuska Morales, Helguera, Maykel Perez, Gonzalez, Maria Natalia, Dias Soeiro Cordeiro & Miguel Angel, Cabr- era Perez. Quantitative Structure - Carcinogenicity Relationship for Detecting Structural Alerts in Nitroso Com- pounds: Species, Rat; Sex, Female; Route of Administration, Gavage. Chem. Res. Toxicol. 21(3), 633–642 (2008).

[152] Simon, Spycher, Pavel, Smejtek, Tatiana I., Netzeva, & Beate I., Escher. Toward a Class-Independent Quantitative Structure-Activity Relationship Model for Uncouplers of Oxidative Phosphorylation. Chem. Res. Toxicol. 21(4), 911–927 (2008).

[153] R., Guha & S.C., Schürer. Utilizing High Throughput Screening Data for Predictive Toxicology Models: Protocols and Application to MLSCN Assays. J. Comp. Aid. Molec. Des. 22(6–7), 367–384 (2008).

[154] S.R., Johnson, X.Q., Chen, D., Murphy & O., Gudmundsson. A Computational Model for the Prediction of Aque- ous Solubility That Includes Crystal Packing, Intrinsic Solubility, and Ionization Effects. Mol. Pharmaceutics 4(4), 513–523 (2007).

[155] A., Yan & J., Gasteiger. Prediction of Aqueous Solubility of Organic Compounds Based on 3D Structure Repre- sentation. J. Chem. Inf. Comput. Sci. 43, 429–434 (2003).

51 [156] Peter, Gedeck & Richard A, Lewis. Exploiting QSAR models in lead optimization. Curr Opin Drug Discov Devel 11(4), 569–575 Jul (2008).

[157] E.O., Cannon, D.S., Bender, A.and Palmer & J.B.O., Mitchell. Chemoinformatics-Based Classification of Prohib- ited Substances Employed for Doping in Sport. J. Chem. Inf. Model. 46(6), 2369–2380 (2006).

[158] OECD. Decision-Recommendation of the Council on the Systematic Investigation of Existing Chemicals. http: //webdomino1.oecd.org/horizontal/oecdacts.nsf/linkto/C(87)90.

[159] FDA. Guidance for Industry: Genotoxic and Carcinogenic Impurities in Drug Substances and Products: Recommended Approaches. http://www.fda.gov/downloads/Drugs/ GuidanceComplianceRegulatoryInformation/Guidances/ucm079235.pdf.

[160] Jr, Valerio LG. In silico toxicology for the pharmaceutical sciences. Toxicol Appl Pharmacol Aug (2009).

[161] Robert C, Glen, Robert C, Andreas, Bender, Catrin H, Arnby, Lars, Carlsson, Scott, Boyer & James, Smith. Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. IDrugs 9(3), 199– 204 Mar (2006).

[162] Andreas, Bender, Hamse Y, Mussa, Robert C, Glen & Stephan, Reiling. Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J Chem Inf Comput Sci 44(5), 1708–1718 Sep-Oct (2004).

[163] Jarl E S, Wikberg & Felikss, Mutulis. Targeting melanocortin receptors: an approach to treat weight disorders and sexual dysfunction. Nat Rev Drug Discov 7(4), 307–323 Apr (2008).

[164] Ola, Spjuth, Martin, Eklund & Jarl E S, Wikberg. An Information System for Proteochemometrics. CDK News 2(2) (2005).

[165] Eclipse Public License. http://www.eclipse.org/legal/epl-v10.html.

[166] C., Steinbeck, C., Hoppe, S., Kuhn, M., Floris, R., Guha & E. L., Willighagen. Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. Curr Pharm Des 12(17), 2111–20 (2006).

[167] Jmol. http://jmol.sourceforge.net/.

[168] E.L., Willighagen. Processing CML Conventions in Java. Molecules 4 (2001).

[169] S., Pillai, V., Silventoinen, K., Kallio, M., Senger, S., Sobhany, J., Tate, S., Velankar, A., Golovin, K., Henrick, P., Rice, P., Stoehr & R., Lopez. SOAP-based services provided by the European Bioinformatics Institute. Nucleic Acids Res 33(Web Server issue), W25–8 (2005).

[170] Spring. http://www.springsource.org.

[171] Mikko J, Vainio & Mark S, Johnson. Generating conformer ensembles using a multiobjective genetic algorithm. J Chem Inf Model 47(6), 2462–2474 Nov-Dec (2007).

[172] Johannes, Wagener, Ola, Spjuth, Egon L, Willighagen & Jarl E S, Wikberg. XMPP for cloud computing in bioinfor- matics supporting discovery and invocation of asynchronous Web services. BMC Bioinformatics 10(279) (2009).

[173] Bioclipse wiki. http://wiki.bioclipse.net.

[174] Bioclipse blog. http://bioclipse.blogspot.com/.

[175] Planet Bioclipse. http://planet.bioclipse.net/.

[176] Bioclipse Bugzilla. http://bugs.bioclipse.net/.

[177] CVS - Concurrent Versions System. http://www.nongnu.org/cvs/.

[178] Subversion. http://subversion.tigris.org/.

[179] Git. http://git-scm.com/.

52 [180] D, Rognan. Chemogenomic approaches to rational drug design. Br J Pharmacol 152(1), 38–52 Sep (2007).

[181] Andrew L, Hopkins & Colin R, Groom. The druggable genome. Nat Rev Drug Discov 1(9), 727–730 Sep (2002).

[182] XMPP Standards Foundation. http://xmpp.org/.

[183] XEP-0244: IO Data. http://xmpp.org/extensions/xep-0244.html.

[184] Maris, Lapins, Martin, Eklund, Ola, Spjuth, Peteris, Prusis & Jarl E S, Wikberg. Proteochemometric modeling of HIV protease susceptibility. BMC Bioinformatics 9, 181 (2008).

[185] Scott, Boyer, Catrin Hasselgren, Arnby, Lars, Carlsson, James, Smith, Viktor, Stein & Robert C, Glen. Reaction site mapping of xenobiotic biotransformations. J Chem Inf Model 47(2), 583–590 Mar-Apr (2007).

[186] Lovisa, Afzelius, Catrin Hasselgren, Arnby, Anders, Broo, Lars, Carlsson, Christine, Isaksson, Ulrik, Jurva, Britta, Kjellander, Karin, Kolmodin, Kristina, Nilsson, Florian, Raubacher & Lars, Weidolf. State-of-the-art tools for computational site of metabolism predictions: comparative analysis, mechanistical insights, and future applications. Drug Metab Rev 39(1), 61–86 (2007).

[187] Eclipse Modeling Framework project. http://www.eclipse.org/modeling/emf/.

[188] Timo, Lassmann & Erik L L, Sonnhammer. Kalign–an accurate and fast multiple sequence alignment algorithm. BMC Bioinformatics 6, 298 (2005).

[189] Gist. http://gist.github.com/.

53  -                   /      0+1  21

    2  0+1  213   -  13  + +1  +221   +2,     4       2     5   26      , 3   +221    ,+   1 +    72   - +22      2  0+1  214 8  +13 &..#3    +,   +    972   - +22      2  0+1  21:4;

         ,+ +, 4++4    + , ++ -$.%*.#