COVER FEATURE From Molecule to Man: Decision Support in Individualized E-Health

Peter M.A. Sloot and Alfredo Tirado-Ramos, University of Amsterdam Ilkay Altintas, University of California, San Diego and University of Amsterdam Marian Bubak, AGH University of Science and Technology Charles Boucher, Utrecht University

Computer science provides the language needed to study and understand complex multiscale,multiscience systems.ViroLab,a grid-based decision-support system, demonstrates how researchers can now study diseases from the DNA level all the way up to medical responses to treatment.

omplex human systems include unique and dis- PUSHING AND PULLING tinguishable components—from biological An application pull has occurred in biomedicine with cells made of thousands of molecules, to the move to in silico studies, which augment in vivo and immune systems built from billions of cells, to in vitro studies by simulating more details of biomed- C our society of more than 6 billion interacting ical processes. Using these simulated processes helps individuals. Each gene in a cell, each cell in an immune medical doctors make decisions by exploring different system, and each individual in a society possesses char- scenarios. Preoperative simulation and visualization of acteristic behavior and provides unique contributions vascular surgery3 and expert systems for drug ranking4 to the system. are two examples of such processes. The complete cascade—from genome, proteome, At the same time, a technology push is occurring in metabolome, and physiome to health—forms multi- computing resources and data availability.5 In the field scale, multiscience systems and crosses many orders of of high-performance computing, as computing advanced magnitude in temporal and spatial scales,1 as Figure 1 from sequential to parallel to distributed, killer appli- shows. The interactions between these systems create cations moved from mathematics to physics, chemistry, exquisite multitiered networks, with each component biology, and now to medicine. In addition, advances in in nonlinear contact with many interaction partners. Internet technology and grid computing6 have made These networks aren’t just complicated, they’re com- huge amounts of data available from sensors, experi- plex. Understanding, quantifying, and handling this ments, and simulations. complexity is one of the biggest scientific challenges of Still, significant computational, integration, collabo- our time.2 ration, and interaction gaps exist between the observed provides the language needed to application pull and the technology push. study and understand these systems. Computer system architectures reflect the same laws and organizing prin- Bridging the gaps ciples used to build individualized biomedical systems, Closing the computational gap in systems biology which can account for variations in physiology, treat- requires constructing, integrating, and managing a ment, and drug response. plethora of models. A bottom-up, data-driven approach

40 Computer Published by the IEEE Computer Society 0018-9162/06/$20.00 © 2006 IEEE won’t work for this. Integrating often incompatible applications Genomics Proteomics Immunology Medical and tools for data acquisition, reg- istration, storage, , DNA Proteins Cellular Pharma- TreatmentTreatment organization, analysis, and pre- ceuticalceutical sentation requires using Web and grid services. Experimentation Even if we can solve the com- Mutations Protease CD-4 expressionExpression Vivo- putational and integration chal- Vivo Reverse # RNA particles Vitro-ExperimentationVitro lenges, we still need a system-level transcriptaseTranscriptase Silico- Silico approach to close the collabora- Time -14 tion and interaction gap. Such an 10 sec Years approach would involve sharing Space -10 Meters processes, data, information, and 10 m knowledge across geographic and organizational boundaries within Figure 1.Time and space.Studying drug response in infectious diseases requires multi- the context of distributed, multi- scale,multiscience models and techniques to cover the huge spatial and temporal disciplinary, and multiorganiza- scales. tional collaborative teams, or virtual organizations. Finally, we need intuitive meth- ods to dynamically streamline these processes depending on their avail- ability, their reliability, and the specific interests of medical doc- tors, surgeons, clinical experts, re- Patterns, Information, Knowledge searchers, and other end users. Scientific workflows, in which a Mine data Provide model output workflow language expresses the Simulation flow of data and action from one data step to another, provide one option for capturing such methods.7,8 Validate model Figure 2 illustrates a general scheme for conducting e-science research. Integrated system Feed model data and model ViroLab (www.virolab.org), a Design and information execute model grid-based decision-support sys- Integrate Integrate tem (DSS) for infectious diseases, consists of modules, such as those Databases that Figure 2 shows, for individu- alized drug ranking in human immunodeficiency virus (HIV) Information systems diseases. We used the complex Instruments and sensors HIV drug-resistance problem as a prototype for our system-level approach for two reasons. First, Figure 2.General architecture for e-science research.Information systems integrate HIV drug resistance is becoming available data with data from specialized instruments and sensors into distributed an increasing problem worldwide, repositories.The systems then execute computational models using the integrated data, with combination therapy with providing large quantities of model output data,which is mined and processed to antiretroviral drugs failing to extract useful knowledge. completely suppress the virus in a considerable number of HIV-infected patients. Second, COLLABORATIVE DSS HIV drug resistance is one of the few areas in medicine During the past decade, researchers have made sig- where genetic information is widely available and has nificant progress in treating patients with viral diseases. been used for many years. As a consequence, large num- For example, pharmaceutical companies now offer bers of complex genetic sequences are available, in addi- nearly 20 antiretroviral drugs for HIV treatment. tion to clinical data. However, to completely suppress the virus, patients must

November 2006 41 take a combination of at least two of the four different tions running on grid nodes that reside at University classes of antiretroviral drugs.9 College London and the University of Amsterdam. In a significant proportion of patients, however, the • The virtualized DSS automatically derives metarules. drugs fail to completely suppress the viral disease, result- • Intelligent system components from Amsterdam use ing in the rapid selection of drug-resistant viruses and first-order logic to clean rules, identify conflicts and loss of drug effectiveness. This complicates the clinician’s redundancy, and check logical consistency. decision process, since clinical interpretation is based on • The scientist validates new rules that the system auto- data sets relating mutations to changes in drug sensitiv- matically uploads into the virtualized DSS. ity and relating mutations present in the virus to clini- • The system presents a new ranking. cal responses to specific treatment regimens.

Interpretation tools Advanced environment In recent years, researchers have developed several ViroLab facilitates medical knowledge discovery and genotypic resistance-interpretation tools that help clin- decision support for drug resistance, providing medical icians and virologists choose effective therapeutic alter- doctors with a rule-based, distributed DSS to rank drugs natives. However, there’s significant targeted at patients. Its infrastructure discordance among available sys- provides virologists with an advanced tems for interpreting HIV genotypic ViroLab offers clinicians environment to study trends on an resistance, for example. There’s an a distributed virtual individual, population, and epidemi- urgent need for a joint effort to ological level. That is, by virtualizing develop, validate, and publish stan- laboratory securely the hardware, compute infrastruc- dardized rules, as well as definition accessible from their ture, and databases, the virtual labo- criteria for genotypic-resistance hospitals and institutes ratory will offer a user-friendly interpretation, and to provide acces- environment, with tailored workflow sible interpretation tools that help throughout Europe. templates to harness and automate make genotypic assay results more such diverse tasks as data archiving, clinically useful. integration, mining, and analysis; Applying artificial intelligence and computational modeling and simulation; and integrating biomedical techniques to biomedicine has resulted in the develop- information from viruses (proteins and mutations), ment of computer-based DSSs. Recent developments in patients (viral load), and the literature (drug-resistance distributed computing further allow the virtualization experiments). of the massive data, computational, and software A DSS and data analysis tools are at the center of the resources that complex DSSs require. ViroLab distributed virtual laboratory. One such inter- ViroLab’s goal is to provide a virtual laboratory where pretation tool, Retrogram, estimates the sensitivity for researchers and medical doctors have easy access to dis- available drugs by interpreting a patient’s genotype using tributed simulations and can share, process, and analyze mutational algorithms that experts developed based on virological, immunological, clinical, and experimental scientific literature, taking into account the published infectious disease data. Currently, virologists browse jour- data relating genotype to phenotype. The ranking is also nals, select results, compile them for discussion, and based on data from clinical studies of the relationship derive rules for ranking and making decisions. ViroLab between the presence of particular mutations and the advances the state of the art by offering clinicians a dis- clinical or virological outcome. tributed virtual laboratory securely accessible from their For the system to support grid-based distributed data hospitals and institutes throughout Europe. access and computation, virtualization of its compo- Under a typical usage scenario for ViroLab: nents is important. ViroLab includes advanced tools for biostatistical analysis, visualization, modeling, and sim- • A scientist from a clinical and epidemiological virol- ulation that enable prediction of the temporal virologi- ogy laboratory in Utrecht, Netherlands, securely cal and immunological response of viruses with complex accesses virus sequence, amino acid, or mutations mutation patterns for drug therapy, as Figure 3 shows. data from a hospital AIDS lab in Rome using grid technology components running in Stuttgart, ViroLab architecture Germany. In ViroLab, each experiment is a set of interconnected • The scientist applies quality indicators needed for activities. The ViroLab system’s design guarantees the data-provenance tracking using provenance-server interaction between a user and running applications, components running in Krakow, Poland. similar to methods used in real experiments, so the user • Researchers use this data as input to (molecular can change a selected set of input data or parameters at dynamics) simulations and immune system simula- runtime.

42 Computer In addition to the DSS, patient data- bases, data analysis tools, and simula- Central services tion software, ViroLab’s runtime Data system consists of: archive

• a distributed, fault-tolerant reg- Biostatistics istry for storing, updating, and publishing semantic information Scientific about available resources and exe- Pharmaceutical Rule base data literature cuted applications; • a tool to compose new experiments Single hospital or modify experiments already per- Enhancing formed; Virology data Data Retrogram+ ranking with Immunology data anonymizer • an execution engine to enact work- simulations flows according to data and action flow; and • a scheduler for dynamic selection PCR Sequencing ATG Amino acids Mutations of resources for efficiently running Automation and virtualization a given experiment. Lab work

ViroLab workflows enable dynamic workflow execution, lazy scheduling, Figure 3.ViroLab data and control flow schematic.Manual wet lab is automated and runtime recomposition. They also and virtualized,and the resulting data is fed to anonymizing components,as well support two levels of abstraction as directly to the DSS to be ranked.Simulation components enhance output rank- needed to operate separately on ings,which are stored before being applied to rule-based algorithms and then fed abstract workflows (workflow tem- back for prediction of the virus’s drug sensitivity. plates) and on concrete workflow instances (executables). Development of a virtual • keeping track of the level of information to be saved, laboratory faces some major challenges, however, • the format of information and where to save it, including: • dynamic data and parameter changes during runtime and in time, • the highly distributed and heterogeneous nature of • saving workflow instances, and virological, immunological, clinical, and experi- • the information on how and by whom the run was mental data; made. • the high dimensionality and complexity of the genetic and patient data; and Technical requirements for building such a system • the inaccessibility and (lack of) interoperability of include: advanced modeling, simulation, and analyses tools. • efficient data management; Recent advances in grid computing tackle these prob- • integration and analysis; lems by virtualizing the resources (data, instruments, • error detection; compute nodes, tools, and users) and making them • recovery from failures; transparently available. In grid computing, the virtual • logging information for each workflow; organization is the basic unit. Such an organization is a • allowing status checks on running workflows; set of grid entities—individuals, institutions, applica- • on-the-fly updates; tions, services, or resources—that are related to each • detached execution of data- and compute-intensive other by some level of trust. Figure 4 summarizes these tasks; ideas. • visualization and image processing on the data flow- ViroLab users can verify and identify the data’s ori- ing through the analysis steps; gin and rerun experiments when required. ViroLab • semantics; and extends this feature by categorizing the level of infor- • metadata-based data access, authentication, and mation, including the data and workflow process. authorization. The collected data-provenance information is archived in ViroLab’s portal and accessible through search and Introducing different heterogeneous distributed net- discovery methods. Examples of provenance informa- work computing systems, data sources, and instruments tion are: creates additional technical challenges.

November 2006 43 Virtual lab management Application Presentation PDA Web Session Data Rule-based Scientific tools browser client manager Runtime access system system Biostat Stochastic Clinical Rule Rule Rule PDA engines derivers elucidators applications modeling CAs VIZ SIM portal applications driver driver Collaboration Provenance

Grid middleware GLOBUS, EDG, CrossGrid, other

Individual Patient patient cohort Medical knowledge data

data DB AAA, VO management, data access and logging for provenance Derived rules Stanford virology Intermediate Hospital DB lab DB PFT DB Measured data experimental computing Scientific data and storage Medical Exp. Immunology Pharmacology literature elements apparatus apparatus DB DB

Computational Data

resources resources Grid resources Virtual organization

Figure 4.ViroLab system architecture.Distributed resources (computing elements,data,and storage) that the biomedical applica- tions use are coordinated with the grid middleware and a grid runtime system.

ViroLab interactivity helping scientists carry out their scientific processing In the ViroLab context, the availability of grid services flows and run their analyses on both local and distrib- and tools for interactive compute- and data-intensive uted resources. Virtual organization members with applications presents an important research problem. access to the resources the tasks were distributed to can Here we build on the European Union IST CrossGrid reuse, share, and modify a process flow once it has been Project,10 which developed a unified approach for run- developed. ning interactive distributed applications on the grid by providing solutions to the following issues: Workflows as system science language An increasing number of computational tools for dis- • automatic porting of applications to grid environ- tributed computing in science have become available in ments; recent years. However, they’re mostly at an infrastruc- • user interaction services for interactive startup of tural level, making it difficult for the domain scientist applications, online output control, parameter study, to use them. Scientific workflow environments11,12 and runtime steering; improve this situation by allowing scientists to use dif- • advanced user interfaces that enable easy plug-in of ferent tools and technologies in a user-friendly, visual applications and tools, like interactive performance programming environment. These environments pro- analysis combined with online monitoring; vide domain-independent, customizable GUIs for com- • scheduling of distributed interactive applications; bining different e-science technologies along with • benchmarking and performance prediction; and efficient methods for using them, thus increasing effi- • optimization of data access to different storage ciency and promoting scientific discovery. systems. A custom-built approach isn’t sufficient for increas- ingly complex applications. Service-based distributed We recently tested these functionalities in a system applications are ideal for automating and generalizing that supports grid-based vascular reconstruction scientific workflows. Researchers can use them to com- through bypass surgery by automating the process flow bine data integration, analysis, and visualization steps of MRI scan data, 3D visualization, and bypass cre- into larger, automated “knowledge discovery pipelines” ation and evaluation.3 The developed computational and “grid workflows.” components were executed efficiently as a custom-built One goal in building ViroLab’s interactive scientific application using the CrossGrid infrastructure, thus workflow environment was to add flexibility and

44 Computer extensibility, providing service-oriented interfaces response to any pathogen, HIV’s fast mutation rate, and through a workbench-style collaborative portal so that a fair amount of spatial localization, which can occur those with the right privileges can use the set of appli- in the lymph nodes. Ordinary (or partial) differential cations and data sets. An important issue is for users to equation models can’t sufficiently describe the two be able to register and publish derived data and extreme timescales involved in HIV infection (days and processes and to keep track of the provenance of infor- decades) or the implicit spatial heterogeneity. mation flowing through the generated pipelines, as well To study the dynamics of drug therapy for HIV infec- as accessing existing (patient and scientific literature) tion, we developed a nonuniform cellular automata data and acquiring new data from scientific instru- model that simulates four phases: acute, chronic, drug ments. These domain-independent features can then be treatment response, and AIDS onset. Researchers also customized by adding domain-spe- can use this model to study three dif- cific components and semantic ferent drug therapies: monotherapy, annotation of the components and Directly applying combined drug therapy, and highly data being used. well-known mathematical active antiretroviral therapy. Our model for predicting the immune Semantic assistance approaches to analyze system’s temporal behavior to drug To automate the construction of the HIV-1 genotype therapy qualitatively corresponds to workflow applications, the system results in clinical data.13 needs to generate ontological descriptions of services, system com- many problems. Biostatistics ponents, and their infrastructure. The biostatistical analysis of the The OntoGrid project (www. HIV-1 genotype data sets aims to ontogrid.net) and the Knowledge-Based Workflow identify patterns of mutations (or naturally occurring System for Grid Applications (www.kwfgrid.net) both polymorphisms) associated with resistance to antivi- demonstrate these abilities. Semantic data usually is ral drugs and to predict the degree of in vitro or in vivo stored as a registry that contains Web Ontology sensitivity to available drugs from an HIV-1 genetic Language (OWL) descriptions of service class func- sequence. Analyzing this highly dimensional data pre- tionality, instance properties, and performance records. sents a statistical challenge.14 The user provides a set of initial requirements about the Directly applying well-known mathematical ap- workflow use, then the system builds an abstract work- proaches to analyze the HIV-1 genotype results in many flow using the knowledge about services’ functionality problems stemming from the fact that in HIV DNA that service providers have supplied to the registry. analysis, relevant mutations—a set of mutations asso- Subsequently, the system must apply semantic infor- ciated specifically with the drug resistance—are the main mation on service properties, which results from ana- scope of interest. These mutations might exist in differ- lyzing the monitoring data of services and resources, ent positions over the amino-acid chains. Moreover, the to steer running workflows that still have multiple sheer complexity of the disease and data require devel- possibilities of concrete Web service operations. The opment of a reliable statistical technique for its analysis system can select the preferable service class by com- and modeling.4 paring semantic descriptions of the available services classes and matching the classes’ features to the actual DSS and presentation requirements. The output of our initial ViroLab version consists of a prediction of the virus’s drug sensitivity generated by PRELIMINARY RESULTS comparing the viral genotype to a relational database ViroLab uses statistical and immunological models to containing a large number of phenotype-genotype pairs. study the dynamics of the HIV populations and molec- The decision software interprets a patient’s genotype by ular dynamics models to study drug affinities, in addi- using rules developed by experts on the basis of the lit- tion to rule-based and parameter-based decision erature, taking into account the relationship of the geno- support. To enhance the analysis of highly dimensional, type and phenotype. In addition, the output is based on complex data, we added cellular automata and molec- available data from clinical studies and on the relation- ular dynamics modeling of HIV infection and AIDS ship between the presence of genotype and the clinical onset to ViroLab. outcome. Researchers can use a Proxy and Java 2 Micro Edition HIV simulation method to access ViroLab from mobile devices, thus ViroLab uses a mesoscopic model to study the HIV lowering system access barriers. A mininavigator script infection’s evolution and the onset of AIDS. The model communicates patient data with the remote server, takes into account the global features of the immune where the ranking takes place.

November 2006 45 ith the increasing availability of genetic infor- Grid for Interactive Applications,” Proc. Int’l Conf. Compu- mation and extensive patient records, tational Science LNCS 2657, Springer, 2003, pp. 207-213; W researchers can now study diseases from the www.crossGrid.org. DNA level all the way up to medical responses. Resolving 11. B. Ludäscher et al., “Scientific Workflow Management and the long-standing challenges of individual-based, tar- the Kepler System,” Concurrency and Computation: Practice geted treatments is coming within reach. It’s necessary & Experience, Wiley & Sons, 2006, pp. 1039-1065. to provide integrating technology to the medical doctors 12. M. Bubak et al., “Workflow Composer and Service Registry and researchers bridging the gaps in multiscale models, for Grid Applications,” Future Generation Computer Sys- data fusion, and cross-disciplinary collaboration. tems, Jan. 2005, pp. 79-86. Although the ViroLab research is still in its infancy, 13. P.M.A. Sloot, F. Chen, and C.A. Boucher, “Cellular Automata results indicate that our personalized drug ranking pro- Model of Drug Therapy for HIV Infection,” S. Bandini, B. totype is viable and extensible. The system remains Chopard, and M. Tomassini, eds., Proc. 5th Int’l Conf. Cel- under development, with new functionalities being lular Automata for Research and Industry (ACRI 02), added from usability studies in a network of European Springer, LNCS 2493, Springer, 2002, pp. 282-293. hospitals. ■ 14. T.E. Scheetz et al., “Gene Transcript Clustering: A Compari- son of Parallel Approaches,” Future Generation Computer Systems, May 2005, pp. 731-735. Acknowledgments The authors thank the ViroLab consortium, in par- ticular, Dr. D. van de Vijver from University Medical Center Utrecht. This research was supported by the Dutch Virtual Laboratory for e-Science project and the Peter M.A. Sloot is a computational sciences professor at European Union ViroLab grant INFSO-IST-027446. We the University of Amsterdam’s Informatics Institute. He thank Carl Kesselman and Ian Foster for proofreading received a PhD in computer science from the University of and offering many valuable suggestions to improve the Amsterdam. Sloot is a member of the IEEE. Contact him at overall quality of the article. [email protected].

Alfredo Tirado-Ramos is a PhD candidate at the University References of Amsterdam. Tirado-Ramos is a member of the IEEE and 1. A. Finkelstein et al., “Computational Challenges of System the American Medical Informatics Association. Contact him Biology,” Computer, May 2004, pp. 26-33. at [email protected]. 2. A-L. Barabási, “Taming Complexity,” Nature Physics, Nov. 2005, pp. 68-70. Ilkay Altintas leads the San Diego Center’s 3. A. Tirado-Ramos et al., “An Integrative Approach to High- Scientific Workflow Automation Technologies Lab at the Performance Biomedical Problem Solving Environments on University of California, San Diego, and is assistant direc- the Grid,” Parallel Computing, Sept./Oct. 2004, pp. 1037- tor of the National Laboratory for Advanced Data 1055. Research. Altintas is an external PhD student in computa- 4. P.M.A. Sloot et al., “A Grid-Based HIV Expert System,” J. tional sciences at the University of Amsterdam’s Informat- Clinical Monitoring and Computing, Oct. 2005 pp. 263-278. ics Institute. She is a member of the IEEE and the ACM. 5. A.J.G. Hey and A.E. Trefethen, “The Data Deluge: An e-Sci- Contact her at [email protected]. ence Perspective,” F. Berman, G.C. Fox, and A.J.G. Hey, eds., Grid Computing–Making the Global Infrastructure a Real- Marian Bubak is an adjunct professor at the Institute of ity, Wiley & Sons, 2003, pp. 809-824. Computer Science and a staff member of the Academic 6. I. Foster, C. Kesselman, and S. Tuecke, “The Anatomy of the Computer Centre Cyfronet at the AGH University of Sci- Grid: Enabling Scalable Virtual Organizations,” Int’l J. Super- ence and Technology. He received a PhD in computer sci- computer Applications, fall 2001, pp. 200-222; http://globus. ence from AGH. He is a member of the CoreGRID org/alliance/publications/papers/anatomy.pdf. Integration Monitoring Committee. Contact him at bubak@ 7. L. Altintas et al., “A Framework for the Design and Reuse of agh.edu.pl. Grid Workflows,” Proc. Scientific Applications of Grid Com- puting, (SAG 04), LNCS 3458, Springer, 2005, pp. 119-132. Charles Boucher is an associate professor in the Depart- 8. F. Neubauer, A. Hoheisel, and J. Geiler, “Workflow-Based Grid ment of Medical Microbiology at Utrecht University. Applications,” Future Generation Computer Systems, Jan. Boucher received an MD and a PhD from the University of 2006, pp. 6-15. Amsterdam. He is a member of the National Institutes of 9. S.G. Deeks, “Treatment of Antiretroviral-Drug-Resistant HIV- Health, the International Virology AIDS Clinical Trials 1 Infection,” Lancet, 13 Dec., 2003, pp. 2002-2011. Groups, and the International AIDS Society. Contact him 10. M. Bubak, M. Malawski, and K. Zajac, “Architecture of the at [email protected].

46 Computer