Number 82, July 2010 ERCIM NEWS European Research Consortium for Informatics and Mathematics www.ercim.eu

Special theme: Computational Biology

Also in this issue:

Keynote: Computational Biology – On the Verge of Widespread Impact by Dirk Evers

European Scene: Scientometry Leading us Astray

R&D and Technology Transfer: Continuous Evolutionary Automated Testing for the Future Internet Editorial Information

ERCIM News is the magazine of ERCIM. Published quarterly, it reports on joint actions of the ERCIM partners, and aims to reflect the contribution made by ERCIM to the European Community in Information Technology and Applied Mathematics. Through short articles and news items, it provides a forum for the exchange of information between the institutes and also with the wider scientific community. This issue has a circulation of 9,000 copies. The printed version of ERCIM News has a production cost of €8 per copy. Subscription is currently available free of charge.

ERCIM News is published by ERCIM EEIG BP 93, F-06902 Sophia Antipolis Cedex, France Tel: +33 4 9238 5010, E-mail: [email protected] Director: Jérôme Chailloux ISSN 0926-4981

Editorial Board: Central editor: Peter Kunz, ERCIM office ([email protected]) Local Editors: Austria: Erwin Schoitsch, ([email protected]) Belgium:Benoît Michel ([email protected]) Denmark: Jiri Srba ([email protected]) Czech Republic:Michal Haindl ([email protected]) France: Bernard Hidoine ([email protected]) Germany: Michael Krapp ([email protected]) Greece: Eleni Orphanoudakis ([email protected]) Hungary: Erzsébet Csuhaj-Varjú ([email protected]) Ireland: Ray Walsh ([email protected]) Italy: Carol Peters ([email protected]) Luxembourg: Patrik Hitzelberger ([email protected]) Norway: Truls Gjestland ([email protected]) Poland: Hung Son Nguyen ([email protected]) Portugal: Paulo Ferreira ([email protected]) Spain: Christophe Joubert ([email protected]) Sweden: Kersti Hedman ([email protected]) Switzerla nd: Harry Rudin ([email protected]) The Netherlands: Annette Kik ([email protected]) United Kingdom: Martin Prime ([email protected]) W3C: Marie-Claire Forgue ([email protected])

Contributions Contributions must be submitted to the local editor of your country

Copyright Notice All authors, as identified in each article, retain copyright of their work

Advertising For current rates and conditions, see http://ercim-news.ercim.eu/ or contact [email protected]

ERCIM News online edition The online edition is published at http://ercim-news.ercim.eu/

Subscription Subscribe to ERCIM News by sending email to [email protected] or by filling out the form at the ERCIM News website: http://ercim-news.ercim.eu/

Next issue October 2010, Special theme: “Cloud Computing”

Cover image: Three-dimensional homology model of the human Kinesin-like protein KIF3A (See the article "Protein Homology Modelling - Providing Three-dimensional Models for Proteins where Experimental Data is Missing" by Jürgen Haas and Torsten ERCIM NEWS 82 July 2010 Schwede on page 20 Keynote

Dr. Dirk J. Evers Director, Computational Biology Illumina, Inc.

Computational Biology – On the Verge of Widespread Impact

omputational Biology is defined as the science of technology goes well beyond the core definition given understanding complex biological phenomena by the above. Canalysis of multi-sample and multi-variate quantita- tive data. With the advent of high-throughput measuring There is a strong European tradition of Computational devices, particularly the new generation of DNA sequencers, Biology and Bioinformatics going back to the inception of the cost of generating data has dropped several orders of the first Bioinformatics curriculum by Prof. Robert magnitude during the last 10 years. There is no sign of this Giegerich (Bielefeld University, Germany) in the late 1980s trend slowing down. What will be the effect of cheap, fast, which I was lucky enough to attend. Of course, research in simple, robust, high-quality DNA sequencing? Mathematics and Informatics for Biology and Biochemistry Computational Biology applications will become ubiquitous. goes back much further than that, with too many prominent It is well-known that Biotech and Pharma have been utilizing examples to single out any one specifically. EU funding Computational Biology approaches successfully for some organizations will need to take great care and oversight to time, but now we see the technology and research spreading compete successfully with North America and China in this into industries such as Environment Management, Agrotech, . Biofuels, Personalized Medicine and Consumer Genomics, to name just a few. To be clear, in the next decade the impact of this science on government decisions as well as personal lifestyle and med- In preparing this keynote I perused keynote contributions in ical decisions will be profound. The EU, governments and previous editions of the ERCIM news. Here are some of the societies will need to ensure that they are prepared for future recent topics: Simulating & Modelling, Digital Preservation, advances in high-throughput instrumentation and Towards Green ICT, Future Internet Technology, The Sensor Computational Biology. Research organisations and industry Web, Safety-Critical Software, and Supercomputing at will have to provide safe technology and robust data inter- Work. It is very clear that the advances described therein are pretation. Educational institutions will need to update cur- of fundamental importance if Computational Biology and ricula to enable graduates, medical doctors, and scientists to Molecular Biology are to achieve their full impact in the harness the new technology and scientific findings at their coming years. We will succeed only if we can acquire sam- disposal. ples ubiquitously, simulate and model precisely, compute efficiently and safely. We must also archive the high vol- Computational Biology Scientists beware! Your science is umes of data efficiently while providing authorized and fast leaving the safe haven of Research-Use-Only to applications access, and present the data via the Internet in a meaningful of widespread public use. way to all interested parties. Computational Biology is an interdisciplinary science by definition. Nonetheless, it is clear that the future complexity of this scientific area and its Dirk J. Evers

ERCIM NEWS 82 July 2010 3 JointContents ERCIM Actions

SPECIAL THEME

Computational Biology coordinated by: Gunnar Klau, CWI, The Netherlands and Jacques Nicolas, INRIA, France 2 Editorial Information Introduction to the Special Theme KEYNOTE 10 Computational Biology by Gunnar Klau and Jacques Nicolas 3 Computational Biology – On the Verge of Widespread Impact Invited articles by Dirk J. Evers, Director, Computational Biology, 12 Pathway Signatures Illumina, Inc. by Harvey J. Greenberg

13 Oops I Did it Again... Sustainability in Sequence JOINT ERCIM ACTIONS Analysis via Software Libraries by Knut Reinert 6 Fourth ERCIM Workshop on eMobility by Torsten Braun Sequences and next generation sequencing 14 Application of Graphic Processors for Analysis of 7 SERENE 2010 - Second International Workshop on Biological Molecules Software for Resilient Systems by Witold R. Rudnicki and £ukasz Ligowski by Nicolas Guelfi 16 Analyzing the Whole Transcriptome by RNA-Seq Data: The Tip of the Iceberg EUROPEAN SCENE by Claudia Angelini, Alfredo Ciccodicola, Valerio Costa and Italia De Feis 8 Scientometry Leading us Astray by Michal Haindl 17 Reliable Bacterial Genome Comparison Tools by Eric Rivals, Alban Mancheron and Raluca Uricaru 9 ERCIM “Alain Bensoussan” Fellowship Announcement Structural biology 19 ModeRNA builds RNA 3D Models from Template Structures by Magdalena Musielak, Kristian Rother, Tomasz Puton and Janusz M. Bujnicki

20 Protein Homology Modelling - Providing Three-dimensional Models for Proteins where Experimental Data is Missing by Jürgen Haas and Torsten Schwede

Drug and discovery 22 Modelling of Rapid and Slow Transmission Using the Theory of Reaction Kinetic Networks by Dávid Csercsik, Gábor Szederkényi and Katalin M. Hangos

23 SCAI-VHTS - A Fully Automated Virtual High Throughput Screening Framework by Mohammad Shahid, Torbjoern Klatt, Hassan Rasheed, Oliver Wäldrich and Wolfgang Ziegler

25 Searching for Anti-Amyloid Drugs with the Help of Citizens: the “AMILOIDE” Project on the IBERCIVIS Platform by Carlos J. V. Simões, Alejandro Rivero, Rui M. M. Brito

Understanding gene expression 26 Swarm Intelligence Approach for Accurate Gene Selection in DNA Microarrays by José García-Nieto and Enrique Alba

4 ERCIM NEWS 82 July 2010 R&D AND TECHNOLOGY TRANSFER

28 From Global Expression Patterns to Gene Co-regulation 50 Continuous Evolutionary Automated Testing for the in Brain Pathologies Future Internet by Michal Dabrowski, Jakub Mieczkowski, Jakub Lenart by Tanja E. J. Vos and Bozena Kaminska 52 Robust Image Analysis in the Evaluation of Gene Networks Expression Studies 30 Testing, Diagnosing, Repairing, and Predicting from by Jan Kalina Regulatory Networks and Datasets by Torsten Schaub and Anne Siegel 53 Autonomous Production of Sport Video Summaries by Christophe De Vleeschouwer 31 MCMC Network: Graphical Interface for Bayesian Analysis of Metabolic Networks 54 IOLanes: Advancing the Scalability and Performance of by Eszter Friedman, István Miklós and Jotun Hein I/O Subsystems in Multicore Platforms by Angelos Bilas Modelling and simulation of biological systems 33 Drug Dissolution Modelling 55 InfoGuard: A Process-Centric Rule-Based Approach for by Niall M. McMahon, Martin Crane and Heather J. Ruskin Managing Information Quality by Thanh Thoa Pham Thi, Juan Yao, Fakir Hossain, Markus 34 Computational Modelling and the Pro-Drug Approach Helfert to Treating Antibiotic-Resistant Bacteria by James T. Murphy, Ray Walshe and Marc Devocelle 57 Integration of an Electrical Vehicle Fleet into the Power Grid 36 Computational Systems Biology in BIOCHAM by Carl Binding, Olle Sundström,Dieter Gantenbein and by François Fages, Grégory Batt, Elisabetta De Maria, Bernhard Jansen Dragana Jovanovska, Aurélien Rizk and Sylvain Soliman 59 Graph Transformation for Consolidation of 37 Prediction of the Evolution of Thyroidal Lung Nodules Sessions Results Using a Mathematical Model by Peter Dolog by Thierry Colin, Angelo Iollo, Damiano Lombardi and Olivier Saut 60 Extracting Information from Free-text Mammography Reports 39 Dynamical and Electronic Simulation of Genetic by Andrea Esuli, Diego Marcheggiani and Fabrizio Networks: Modelling and Synchronization Sebastiani by Alexandre Wagemakers and Miguel A. F. Sanjuán 61 Parental Control for Mobile Devices 40 Formal Synthetic Immunology by Gabriele Costa, Francesco la Torre, Fabio Martinelli and by Marco Aldinucci, Andrea Bracciali and Pietro Lio' Paolo Mori

Epidemiology 63 Preschool Information When and Where you Need it 42 An Ontological Quantum Mechanics Model of Influenza by Stina Nylander by Wah-Sui Almberg and Magnus Boman EVENTS 43 Epidemic Marketplace: An e-Science Platform for Epidemic Modelling and Analysis 64 Announcements by Fabrício A. B. da Silva, Mário J. Silva and Francisco M. Couto IN BRIEF

45 Cross-Project Uptake of Biomedical Text Mining 66 Training and Development of Semantic Technology Results for Candidate Gene Searches Applied to Systems Biology by Christoph M. Friedrich, Christian Ebeling and David Manset 66 Spanish Pilot Project for Training and Development of Technology Applied to Systems Biology Text mining, semantic technologies, databases 47 Ontologies and Vector-Borne Diseases: New Tools for 67 VISITO Tuscany Old Illnesses by Pantelis Topalis, Emmanuel Dialynas and Christos Louis 67 Van Dantzigprijs for Peter Grünwald and Harry van Zanten 48 Unraveling Hypertrophic Cardiomyopathy Variability by Catia M. Machado, Francisco Couto, Alexandra R. 67 MoTeCo - A New European Partnership for Fernandes, Susana Santos, Nuno Cardim and Ana T. Freitas Development in Mobile Technology Competence

ERCIM NEWS 82 July 2010 5 Joint ERCIM Actions

Fourth ERCIM Workshop on eMobility

by Torsten Braun

The 4th ERCIM Workshop on eMobility was held at Luleå University of Technology, Sweden on 31 May 2010, the day prior to the Wired/Wireless Internet Communications (WWIC) 2010 conference. The twenty six participants engaged in very intense discussions, making the workshop a lively event.

As in previous years, the program included both invited pre- sentations and presentations of papers that had been selected after a thorough review process. The focus of the invited pre- sentations was on current EU FP7 projects, in which ERCIM EMobility members are actively involved.

Marc Brogle (SAP Zürich) opened the workshop with a pres- entation of the ELVIRE (ELectric Vehicle communication to Infrastructure, Road services and Electricity supply) project, which aims to utilize ICT to allow smooth introduction of electric vehicles. Drivers will be supported in mobility plan- ning and during travelling with an electric car. While ELVIRE started recently, the EU-Mesh project (Enhanced, Ubiquitous, and Dependable Broadband Access using MESH Networks), presented by Vasilios Siris (FORTH), will finish soon. Its goal is to develop, evaluate, and trial software modules for building dependable multi-radio multi-channel mesh networks that provide ubiquitous and high speed Workshop participants. broadband access. Another project in the area of wireless net- works has been SOCRATES (Self-Optimisation and self- ConfiguRATion in wirelEss networkS, presented by Hans van den Berg, TNO). This project aims to use self-optimisa- tion, self-configuration and self-healing to decrease opera- whilst another focussed on the integration of IMS with Web tional expenditure by reducing human involvement in net- 2.0 technologies in order to support communications work operation tasks, while optimising network efficiency between family members. The final session focused on and service quality. Two of the presented projects focussed “Wireless Networks and Geocasting”. Topics included: on wireless sensor networks. The first, called GINSENG and routing and reliability in mobile ad-hoc networks, geocasting presented by Marilia Curado (U Coimbra), investigates tech- for vehicular ad-hoc networks, deployment support for wire- nological challenges of wireless sensor networks in indus- less mesh networks, realistic traffic models for WiMAX or trial environments such as oil refineries. The second, called cellular networks as well as resource discovery in cognitive WISEBED (Wireless Sensor Network Testbeds) and pre- wireless networks. sented by Torsten Braun (U Bern), developed a pan- European wireless sensor network, currently consisting of A nice social event attended by nearly all participants under several hundred sensor nodes, which can be accessed by the bright evening sun in North Sweden concluded the event, sensor network researchers via the Internet. Finally, Yevgeni which was perfectly organized by the local hosts under the Koucheryavy (Tampere UT) gave a talk on the recently guidance of Evgeny Osipov. Again, printed workshop pro- established COST Action IC 0906 on “Wireless Networking ceedings could be handed out to all participants. for Moving Objects”, an initiative launched by ERCIM eMobility members. Link: http://wiki.ercim.eu/wg/eMobility In addition to the invited talks three technical sessions were held in the afternoon. Twelve out of 16 submitted long, short, Please contact: and abstract papers were selected. The first session on Torsten Braun “Mobile Networks” looked into various aspects of vehicle- ERCIM e-Mobility Working Group coordinator to-vehicle/business communication and scheduling on LTE Bern University / SARIT, Switzerland uplinks. Another talk presented a testbed to evaluate IMS (IP E-mail: [email protected] Multimedia Subsytem) implementations, whilst a separate session was devoted to IMS. One presentation investigated issues of the integration of IMS and digital IP-TV services,

6 ERCIM NEWS 82 July 2010 and with student and early career scientists. This event shows SERENE 2010 - Second a real engagement of the ERCIM Working Group with the International Workshop discipline. It should be renewed for future editions. Workshop on Software Engineering By category, we had twelve technical papers, one industrial/experience paper; and three project papers. No for Resilient Systems papers were submitted in the student and tools categories. We offered four reviews per paper and were met with very high by Nicolas Guelfi compliance by the programme committee members. We accepted eleven submissions. This permitted a programme of The SERENE 2010 workshop, organized by the ERCIM one full day on the technical papers and a half day on the Working Group on Software Engineering for Resilient shorter papers. Systems was held in London at Birbeck College on 13- 16 April 2010. The workshop was co-chaired by Dr. Two keynotes where organized both from "non-traditional" Giovanna Di Marzo Serugendo, Computer Science and fields - one from business information systems by Andreas Information Systems - Birkbeck - University of London, Roth and one from swarm robotics by Alan F T Winfield. and Dr. John Fitzgerald, School of Computing Science, Newcastle University. A two-day Spring School on SERENE 2010 had 34 registrations and was a succesfull resilience and self-* was organized for the first time event. For the SERENE 2011 we received two proposals: ahead. An additional event, an evening lecture given by one from Professor Vyacheslav Kharchenko, National Robin Bloomfield of City University, London, held jointly Aerospace University "Khai", department of computer sys- with BCS-FACS and the UK Safety-Critical Systems Club, tems and communications, to host the next SERENE work- was added to the workshop programme and addressed shop in Kirovograd (Ukraine); and one from Professor modelling on infrastructure interdependencies and their Buchs, University of Geneva, Switzerland. Eventually it was impact on resilience. decided that SERENE 2011 will be held in Geneva in September 2011, co-chaired by Professor Didier Buchs and Spring school Dr. Elena A. Troubitsyna, Abo Akademi University, Finland. Thirty participants registered for the Spring School. The four speakers - Paola Inverardi, University of L’Aquila, Italy; Jeff Link: Magee, Imperial College London, UK; Aad van Moorsel, http://serene2010.uni.lu/ University of Newcastle, UK; Mauro Pezzè, University of Lugano, Switzerland, and University of Milano Bicocca, Please contact: Italy - were uniformly excellent and the of interaction Nicolas Guelfi from participants was good. Several of the invited speakers SERENE Working Group coordinator remarked on the collaborative atmosphere. The event has Luxembourg University helped to position SERENE better with the invited speakers E-mail: [email protected]

ERCIM-sponsored Conferences SAFECOMP 2010

Joint MFCS & CSL 2010 The 35th International Symposium on Mathematical Foundations of Computer Science is organized in parallel with the 19th EACSL Annual Conferences on Computer Science Logic (CSL 2010) 23-27 August 2010. The federated MFCS & CSL 2010 conference and its satellite workshops will be held at the Faculty of Informatics of Masaryk University which is part of the Czech ERCIM member CRCIM. More information: http://mfcsl2010.fi.muni.cz/mfcs

ECCV 2010 The 11th European Conference on Computer Vision (ECCV 2010) will be hosted by the Greek ERCIM member FORTH, Crete, Greece, from 5-11 September 2010. ECCV is a selective single-track conference on computer vision where high quality previ- ously unpublished research contributions on any aspect of computer vision will be presented. More information: http://www.ics.forth.gr/eccv2010/

SAFECOMP 2010 SAFECOMP, the International Conference on Safety, Reliability and Security of Computer Systems, is held this year from 14-17 September in Vienna, at the Schönbrunn Palace Conference Center, a very attractive venue. Co-organizer is the ERCIM Working Group on Dependable Embedded Software-Intensive Systems. ERCIM and the Austrian ERCIM member AARIT are co-sponsors of the event and represented with a booth. More information: http://www.ocg.at/safecomp2010

ERCIM NEWS 82 July 2010 7 European Scene

Opinion example, eight unverified software pieces (category R) will Scientometry outweigh a scientific breakthrough published in the most prestigious high-impact journal within the research area. Leading us Astray Some research entities have already opted for such an easy route to the government research pouch. by Michal Haindl Publication cultures vary between disciplines; while some The entire spectrum of recent scientific research is too prefer top conferences (computer graphics), elsewhere pres- complex to be fully and qualitatively understood and tige goes to journals (pattern recognition) or books (history assessed by any one person, hence, scientometry may and the arts). Some disciplines have one or two authors per have its place, provided it be applied with common article (mathematics) while others typically have large sense, being constantly aware of its many limits. author teams (medicine). These factors are not taken into However, scientometry, when applied as the sole basis consideration by the current system. for evaluating performance and funding of research institutions, is our open admission that we are unable to Finally, this system recklessly ignores significant differences recognize an excellent scientist even if he is working between the financing systems of different types of research nearby. entities. While universities have a major source of financing from teaching, the 54 research institutes of the Academy of In recent years Czech scientists have been evaluated by the Sciences of the Czech Republic (ASCR) rely solely on powerful Research and Development Council (RDC), an research. Thus last year’s application of the scientometric cri- advisory body to the Government of the Czech Republic. terion resulted in anticipated negative consequences for the The RDC processes regular annual analyses and assessments Czech research community in general, but especially for its of research and development work and proposes financial most effective contributor - the Academy of Sciences. ASCR, budgets for single Czech research entities. The RDC’s over- with only 12% of the country’s research capacity, produces ambitious aim was to produce a single unique universal 38% of the country’s high impact journal papers, 43% of all numerical evaluation criterion, a magic number used to iden- citations, and 30% of all patents. Nevertheless the criterion led to 45% drop in funding for ASCR in 2012. It is little 10 wonder then, that our research community has become ϑ = X ni [ci ωi + di] destabilized, that researchers, for the first time in his- i=1 tory, have participated in public rallies, and that young i result type ωi di ci researchers are losing motivation to pursue scientific 1 impacted journal 295 10 ci ∈< 0; 1 > careers. Over 70 international and national research 2 EU/US/JP patent | Nature, Science, Proc.Nat.Acad.Sci. 500 0 1 3 acknowledged journal (Scopus / ERIH) 8 0 1.5 (h) | 1 bodies have issued strong protest against the govern- 4 reviewed journal 4 0 2.5 (h) | 1 ment’s policies (see details on http://ohrozeni.avcr.cz 5 book(world lang.) | F,G,H,N,L,R | other patent 40 0 1 /podpora/), and several governmental round tables | 6 book 20 0 2 (h) 1 have been held in an attempt to solve this crisis. 7 proceeding article 8 0 1 8 CZ/national patent 200 0 1 9 pilot plant 100 0 1 What is the moral from the described scientometric 10 classified research report 50 0 1 experiment? Any numerical criterion requires humble Figure 1: RDC scientometric criterion, where ni is number of results, h humanities, application, careful feedback on its parameters, and F design, G prototype, H legal norm, N methodology, L map, and R software result, wide adoption by the scientific community. We must respectively. The coefficient ci value depends on a journal impact rank inside its discipline. never forget that any criterion is only a very approxi- mate, and considerably limited, model of scientific reality and will never be an adequate substitute for peer tify scientific excellence and to determine the distribution of review by true experts. In order to avoid throwing out the research funding. The resulting simplistic scientometric cri- baby with the bath water we should consult any criterion on a terion is shown in Figure 1. discipline-specific basis, and only as an auxiliary piece of information. A criterion may be useful to compare extreme The RDC noticed difficulties in mutual comparison of scientific performance, but is hardly suitable for assessing humanities, natural, and technical sciences so they intro- subtle differences or breakthrough scientific achievements.

duced a corrective coefficient (ci) favouring less prestigious Whilst a single number has alluring simplicity, it also carries journals and books in humanities. Apart from the fact that it pricey long-lasting consequences. Excellent research teams is impossible to compare a range of diverse disciplines using are easy to damage by ill-conceived scientometry, but much a single evaluation criterion, this system ignores some harder to rebuild. common assessment criteria (eg citations, grant acquisition, prestigious keynotes, comparison with the state-of-the-art, Links: editorial board memberships, PhDs). Furthermore, single http://www.vyzkum.cz/

weights (ωi) are unsubstantiated, and acquired scores cannot http://ohrozeni.avcr.cz/podpora/ be independently verified. Please contact: Negative consequences have been revealed over the last Michal Haindl - CRCIM (UTIA) three years. Significant rank changes indicate that scientific Tel: +420-266052350 excellence can easily be simulated with a cunning tactic. For E-mail: [email protected] 1 8 ERCIM NEWS 82 July 2010 "Alain Bensoussan" Fellowship Programme

Application deadline: Twice per year 30 April and 30 September

Photo by courtesy of INRIA.

Who can apply? The Fellowships are available for PhD holders from all over the world. ERCIM encour- ages not only researchers from academic institutions to apply, but also scientists work- ing in industry.

Why to apply? The Fellowship Programme enables bright young scientists from all over the world to work on a challenging problem as Fellows of leading European research centres. The programme offers the opportunity to ERCIM Fellows: - to work with internationally recognized experts - to improve their knowledge about European research structures and networks - to become familiarized with working conditions in leading European research centres - to promote cross-fertilisation and cooperation, through the Fellowships, between research groups working in similar areas in different laboratories.

What is the duration? Fellowships are generally of 18 month duration, spent in two of the ERCIM institutes. In particular cases a Fellowship of 12 month duration spent in one institute might be offered.

How to apply? Applications have to be submitted online. The application form will be available one month prior to the application deadline at http://www.ercim.eu/activity/fellows/

Which topics/disciplines? The Fellowship Programme focuses on topics defined by the ERCIM working groups and projects managed by ERCIM such as Biomedical Informatics, Computing and Statistics, Constraints Technology and Applications, Embedded Systems, Digital Libraries, Environmental Modelling, E-Mobility, Formal Methods, Grids, Security and Trust Management, and many other areas in which ERCIM institutes are active.

http://www.ercim.eu/activity/fellows/

ERCIM NEWS 82 July 2010 9 Special Theme: Computational Biology

Computational Biology

Introduction to the Special Theme

by Gunnar W. Klau and Jacques Nicolas

In the life sciences, conventional wet lab experimentation is As already mentioned in the keynote by Dirk Evers we cur- being increasingly accompanied by mathematical and compu- rently observe a neat evolution of the research field that results tational techniques in order to deal with the overwhelming from the maturation and large diffusion of high-throughput complexity of living systems. Computational biology is an technologies in biological laboratories. New types of data interdisciplinary field that aims to further our understanding of have to be taken into account: researchers are becoming living systems at all scales. New technologies lead to massive increasingly aware of the importance of the RNA world and amounts of data as well as to novel and challenging research epigenomics in explaining cellular behaviour that can not be questions and more and more biological processes are analyzed solely understood by purely genetic aspects. Metabolomics is and modelled with the help of mathematics and can be simu- the first method developed that is capable of obtaining high lated in silico. throughput data at the phenotypic level. Examples of joint European research efforts in this direction include the Computational biology is one side of a two-sided domain and European Research Council Advanced Grants “RIBO- generally refers to the fundamental research field. The term GENES” (The role of noncoding RNA in sense and antisense bioinformatics is frequently used for the more engineering- or orientation in epigenetic control of rRNA genes) and the oriented side of the domain and deals with the production of FP7 project “METAcancer (Identification and validation of practical software enabling original analysis of biological new breast cancer biomarkers based on integrated data. This ERCIM News special theme starts with two invited metabolomics)”. articles that reflect this complementarity. Harvey Greenberg presents his view on the positive contribution of Operations These recent developments make more systemic approaches Research (OR) to biology. OR is in itself a powerful, interdis- possible, where several methods and sources of data have to ciplinary domain that led, among many other important contri- be combined. This has an impact on the importance of devel- butions to computational biology, to the recent concept of oping data integration environments, adding semantics to pathway signatures. Knut Reinert describes the importance of observations (ontologies, text mining) and offering sophisti- good practices in bioinformatics by means of cated navigation utilities through the web. It has also fostered library design for sequence analysis. In fact, the next a number of studies in high-performance computing such as European infrastructure in bioinformatics, which is currently distributed or grid computing or the use of graphical pro- discussed in the project ELIXIR (European Life Sciences cessing units. These techniques are no longer restricted to Infrastructure for Biological Information), has put emphasis demanding applications in structural modelling but have on the necessity of shared efforts for setting common stan- become necessary in many other domains of computational dards and tools for data management. biology.

10 ERCIM NEWS 82 July 2010 Biology High-throughput Systems biology, Diseases Drugs and gene Structural biology technologies dynamical mining Mathematics/ networks Computer Science Mathematical articles by: articles by: article by: article by: modelling and Csercsik et al p. 22 Almberg et al p. 42 McMahon et al p. 33 Bujnicki et al, p. 19 simulation Friedman et al p. 31 Colin et al p. 37 Wagemakers et al p. 39 Statistics and article by: article by: article by: article by: optimization Angelini et al p. 16 Greenberg p. 12 Aldinucci et al p. 40 Bujnicki et al, p. 19

Computational High Performance article by: article by: article by: article by: computing Rudnicki et al p. 14 Simões et al p. 25 Shahid et al p. 23 Simões et al p. 25

Data and metadata article by: article by: articles by: article by: integration Dabrowski et al p. 28 Dabrowski et al p. 28 Topalis et al p. 47 Friedrich et al p. 45 Biology Freitas et al p. 48 da Silva et al p. 43 Sequence and articles by: article by: articles by: graph algorithms Rivals p. 17 Simões et al p. 25 Bujnicki et al, p. 19 Reinert p. 13 Haas et al p. 20 Simões et al p. 25 Artificial article by: articles by: article by: article by: intelligence García-Nieto et al Fages et al p. 36 Aldinucci et al p. 40 Murphy et al p. 34 p. 27 Schaub et al p. 30

Many innovative methods in the field come from health appli- 23 short articles on new approaches, frameworks and applica- cations. A number of cohort studies are currently being carried tions in the domain of computational biology. From the per- out, for example within the 1000 genomes project or the Apo- spective of mathematics and computer science, the articles sys (Apoptosis Systems Biology Applied to Cancer and AIDS ) cover topics such as “mathematical modelling and simulation”, project. For the first time, they will give access to an in depth “statistics and optimization”, “high-performance computing”, study of the correlations between a variety of health-related “data and metadata integration”, “sequence and graph algo- factors like human individual genomic variations, nutrition, rithms” and “artificial intelligence”. From the perspective of environment and diseases. Deciphering the relationships biology, the covered topics include: “high-throughput tech- between genotype and phenotype is a major challenge for the nologies”, “systems biology, dynamical networks”, “diseases”, coming years that researchers are just starting to explore. “drugs and gene mining”, and “structural biology”. The table Advances will be obtained by more connections between dis- above gives an overview of this two-dimensional scheme. tant research fields, and techniques from image analysis and data mining can be expected to play increasing roles in compu- tational biology in the future. Links: 1000 genomes: http://www.1000genomes.org In addition, such global studies will more generally be benefi- ELIXIR: http://www.elixir-europe.org cial for the fields of population genetics and ecology. Sets of METAcancer: http://www.metacancer-fp7.eu/ cells in a tissue and sets of bacteria in a selected habitat can Apo-sys: http://www.apo-sys.eu/ now be studied at the finest level of molecular interaction, cre- MetaHIT: http://www.metahit.eu/ ating pressure to develop research on multi-scale modelling MetaExplore: http://www.rug.nl/metaexplore/ and model reduction techniques. Here, examples of European projects are MetaHIT and Metaexplore for the metagenomics Please contact: of the human intestinal tract and the study of the enzymatic Gunnar Klau potential in cryptic biota, respectively. CWI, The Netherlands E-mail: [email protected] This special theme features a selection of articles that covers a number of areas of Computational Biology and provides a nice Jacques Nicolas snapshot of the research variety carried out in this area in INRIA, France Europe. In addition to the two invited contributions, it contains E-mail: [email protected]

ERCIM NEWS 82 July 2010 11 Special Theme: Computational Biology

Pathway Signatures

by Harvey J. Greenberg

How can operations research (OR), traditionally applied to management problems, help us to understand biological systems? OR emerged from WW II as not only a grab-bag of methods, but more importantly as a multi-disciplinary approach to problem-solving. Modern systems biology shares those same problem attributes, and the OR community is increasingly contributing to this frontier of medical research.

I entered computational biology when I more OR can contribute. It is perhaps mize reliability (probability of suc- visited Sandia National Laboratories fortuitous that when I was a student, OR cessful regulation) in a cell-signaling (SNL) in 2000. Seeing the work of SNL was almost synonymous with systems network. A simplified version is to researchers and learning about the prob- engineering (which has a different choose the cone of linear coefficients lems showed me how operations research meaning in today’s world of computers). associated with an extreme pathway for applies, particularly mathematical pro- which that pathway is uniquely optimal, gramming. This became the focus, not One trick I learned from OR practice is to hence its signature. Among the infinitely only of my new research, but also of my turn questions around, “Why is this many, I favored certain properties (from teaching and service. The following year true?’’ We do not fully understand why biological principles) and chose signa- I created the University of Colorado certain pathways are used, and research ture values that minimize total similarity Center for Computational Biology and has focused on estimating parameters of between pairs of extreme pathways. initiated a series of workshops and new biochemical interactions among mole- This is key – the dissimilarity courses in the departments of cules. Turning this around, I ask, “Why is strengthens the idea of a signature, sepa- Mathematics, Computer Science and this particular pathway used?” This gave rating one pathway from the others. Engineering, and Biology. I was the main rise to what I call pathway signatures – beneficiary! I continued to learn through the natural circumstances that trigger one Non-extreme pathways have more than research, teaching, and many new collab- particular pathway but not others. one representation as combinations of orations, particularly within the CU med- extremes, and I studied approaches to ical research community. In 2003 I vis- I setup a mathematical program and see what signature offers the most ited Bernhard Palsson’s Systems Biology identified pathways that are optimal for insight into natural choices. With help Research Institute at San Diego and later a particular objective (defined by (and data) from Palsson’s Lab, I enu- visited Leroy Hood’s Institute for parameters). An example of an objective merated extreme metabolic pathways Systems Biology. That is when I learned is to maximize ATP production in a (viz., red blood cell), on which I based about systems biology and how much metabolic network; another is to maxi- my original studies. Another pursuit is

Figure 1: Metabolic pathways. From KEGG database http://www.genome.jp/kegg/

12 ERCIM NEWS 82 July 2010 to consider nonlinear objectives, such about pathways with incomplete infor- In conclusion, I see the future of sys- as quadratic growth, for which each mation about their parts and interac- tems biology broadening its network non-extreme pathway is optimal. tions. In particular, there has been constructions to mixed-scale. The Understanding why it is better than any research into assembly – turning genes immune system, for example, needs to of the extreme pathways that define it into pathways. I turn this into building a be represented by interactions among might shed light on the underlying highway system and ask, “What junc- molecules, cells, tissues, and organs. A biology. From the few experiments I tions and connections provide the most drug designed to block some pathways ran, redundancy was a paramount sig- efficient map for travel?” In this sense, or enhance others can assume multiple nature criterion. For flux balance pathway inference, or construction, is targets where its behavior depends on the analysis this means that the defining about optimal routing. Again, the idea environmental context of the organism. input-output matrix of the reactions had of a signature is to find a biologically We have seen such [re-]design prob- metabolite generation/consumption meaningful objective for which the net- lems in OR! dependent upon others. At the very work is designed (or revised) optimally. least, we can use this inference as a Much has been done in road and com- Links: measure of approximation error. When puter network design using OR with http://gcrg.ucsd.edu/ the linear approximation is reasonable, multiple objectives: minimize cost, http://www.systemsbiology.org/ we can investigate how the system is minimize travel time, and maximize affected by the imbalance that occurs reliability, to name a few. There are Please contact: without fixed system-outputs. many optimization principles at work in Harvey J. Greenberg, Professor the design of life, so the natural laws of Emeritus Another part of the story is the Pathway economics fit well into the OR descrip- University of Colorado Denver Inference Problem: Infer knowledge tive motto: “the science of better.” E-mail: [email protected]

Oops I Did it Again..... Sustainability in Sequence Analysis via Software Libraries by Knut Reinert

Maybe you like Britney Spears. Maybe even her music. Maybe you are a Britney Spears fan working as part of a group on sequence analysis algorithms in computational biology. But am I right in assuming that you don’t like to hear the above song title quoted by your coworkers or programmers when they could have been spending their time doing something productive or creative? If I am right, then you might want to read on because I will tell you how you, or your coworkers can avoid reinventing the wheel or writing a lot of inefficient scripts in sequence analysis.

Next generation sequencing (NGS) is a important to scientists working in the may not necessarily reflect the reality of term coined to describe recent techno- life sciences. At the same time, how- real world data. Efforts from the life sci- logical advances in DNA sequencing. ever, we need to process it fast and effi- ences side, in contrast, may result in Next generation sequencing allows us to ciently, after all it fills the racks of ter- analysis pipelines that compute a solu- sequence about 200.000.000.000 base abyte disks quickly. tion to the problem but are not state-of- pairs within approximately one week. the-art in run time or memory consump- That’s a two with many zeros. To In order to analyse the ever-growing tion, and hence cannot be applied to the explain it in different terms, whilst the volume of sequencing data it is essential large data volumes. The goal is obvi- human genome project spent many years that scientists in the life sciences and ously to use fast implementations of and billions of dollars to sequence about bioinformaticians work closely together efficient algorithms to be able to cope 30 billion base pairs, we can now per- with scientists in computer science. with the volume of sequence data that form the equivalent amount of work This can often be problematic since NGS can produce. This can be achieved within a day for a handful of dollars. both communities approach the prob- through algorithm libraries that collect lems at hand quite differently (see efficient implementations of state-of- DNA as mass data has wonderful prop- Figure 1). While the the experimental- the-art algorithms, the work of algo- erties. While the sequencing machines ists have a holistic top-down view on rithm and skilled program- churn out terabytes of data, a single byte what they want to achieve in a particular mers, and make them available to the of this data can be very important. It can analysis, computer scientists usually data analysts and bioinformaticians. decide whether you have a disease or work on a specific, small part of a larger Apart from the obvious benefit of being not, whether the drugs you take do their analysis problem. On the computer sci- able to efficiently compute solutions, job or not, whether you live or die. So ence side this often results in highly the use of software libraries also avoids we have to treat the data carefully. It is efficient, but specialized algorithms that the Britney Spears effect, namely doing

ERCIM NEWS 82 July 2010 13 Special Theme: Computational Biology

many things all over again, because Analysis pipelines many algorithms needed are already Experimentalists available in the library. Maintainable tool

For the past seven years the Algorithmic Algorithm libraries Prototype implementation Bioinformatics group at the FU Berlin has been working on a comprehensive Algorithm design algorithm library for sequence analysis. Computer Scientists Theoretical Considerations The SeqAn project fills the gap between the experimentalists and algorithm designers. SeqAn is a C++ library and Figure 1: Top-down versus bottom-up approach. has a unique generic design based on: • the generic programming paradigm java.org/wiki/Main_Page) and BioPerl • graph types for many purposes like • a new technique for defining type (http://www.bioperl.org/wiki/Main_Page). automata, alignment graphs, or HMMs hierarchies called template SeqAn is intended to embrace high-per- • standard algorithms on graphs. subclassing formance algorithms from the computer • global interfaces science field and to cover a wide range SeqAn has growing user community • metafunctions, which provide of topics of sequence analysis. It offers throughout the EU and US and is constants and dependent types at a variety of practical state-of-the-art actively developed by 4-6 scientists compile time. algorithmic components that provide a mainly at the FU Berlin. Potential users sound basis for the development of might be interested in the SeqAn book The design of SeqAn differs from sequence analysis software. These (see link below). SeqAn is partially sup- common programming practice, in par- include: ported by the DFG priority program ticular SeqAn does not use object-ori- • data types for storing strings, 1307 “Algorithm Engineering”. ented programming. The main conse- segments of strings and string sets quence of this is that • functions for all common string Links: SeqAn implements features like poly- manipulation tasks SeqAn project http://www.seqan.de morphism at compile time thus • data types for storing gapped sequences http://crcpress.com/product/isbn/97814 avoiding costly run time operations like and alignments in memory and on disk 20076233 consulting a lookup table to find the • algorithms for computing optimal appropriate virtual function, as it is nec- sequence alignments Please contact: essary in object-oriented programming. • algorithms for exact and approximate Knut Reinert This sets it also apart from other frame- (multiple) pattern matching Algorithmic Bioinformatics, Freie works in the field like Galaxy • algorithms for finding common Universität Berlin, Germany (http://bitbucket.org/galaxy/galaxy-cen- matches and motifs in sequences Tel: +49 30 838 75 222 tral/wiki/Home) or BioJava (http://bio- • string index data structures E-mail: [email protected]

Application of Graphic Processors for Analysis of Biological Molecules

by Witold R. Rudnicki and £ukasz Ligowski

Graphic processors are used to improve the efficiency of a computational process in molecular biology in a project carried out by our team at the Interdisciplinary Centre for Mathematical and Computational Biology of the University of Warsaw, in cooperation with Professor Bertil Schmidt’s team from the School of Computer Engineering of the Nanyang Technological University in Singapore. The project aims to increase the efficiency of one of the most important algorithms in bioinformatics, the Smith Waterman algorithm.

The Smith Waterman (SW) algorithm, an giving less precise answers but in a frac- gene assembly from short fragments in algorithm for performing local sequence tion of time. Availability of the fast ver- the new generation sequencing experi- alignment, is used to establish similarity sion of SW algorithm with efficiency ments or high sensitivity search for between biological macromolecules, comparable to that of BLAST would homologous proteins. such as proteins and nucleic acids. This enable researchers to take advantage of is an exact algorithm giving the best pos- the greater sensitivity of the SW algo- The evolution of graphics processing sible result, but is much slower than rithm in the tasks currently performed units (GPU) has resulted in transforma- BLAST (Basic Local Alignment Search with heuristic algorithms. This may be tion from chips specialised in the lim- Tool) - an alternative heuristic algorithm particularly important, for example, for ited number of operations required for

14 ERCIM NEWS 82 July 2010 graphics, to massively parallel proces- deletions or insertions in the proteins. The auxiliary variable from the preceding sors capable of delivering trillions of score of the alignment is the sum of the column are stored in the shared floating point operations per second similarity scores for amino acid pairs memory. The algorithm achieves an (teraflops). The GPU’s very high com- along alignment after subtracting the approximately 100-fold increase in putational power is best utilised when penalties for introducing gaps. The simi- speed compared withthe single core of and all threads perform identical tasks larity score for a pair of amino acids is CPU, and its overall efficiency is on par requiring many operations on the rela- proportional to the logarithm of proba- with BLAST - the heuristic algorithm tively small amount of data, which can bility of mutation between these amino executed on CPU. be stored on on chip (in registers and a acids in proteins. small pool of the very fast shared The sw-gpu algorithm is used as an memory) The best practice in GPU pro- The Smith Waterman algorithm finds engine of our similarity search server gramming is therefore to use data paral- best local alignment by identifying all accessible at http://bioinfo.icm.edu.pl/ lelism to formulate the problem, load possible alignments. From among these services/sw-gpu/ the data to registers and shared memory, it selects the optimal alignment. This is and perform all computations utilising achieved using a dynamic programming We are currently working on using SW registers and shared memory. approach. First a rectangular matrix is algorithm in an iterative way, analogous built, then each cell with index [i,j] is to the PSI-BLAST extension of the GPU has the potential to be particularly effective for applications such as scan- Database ning of large databases of sequences of biological macromolecules, because GPU works best for data-parallel algo- rithms, with a small but computation- ally intensive kernel.

Optimisation of the SW algorithm on a central processing unit (CPU) is based on vectorization and improving effi- y r u Q ciency of the alignment of a single pair of sequences using some tricks, the effi- cacy of which is data-dependent. In par- ticular these optimisations work poorly on short sequences, as well as on sets of sequences which are similar to one other. A GPU version of the algorithm, in contrast, performs a more efficient data base scan by performing thousands Figure 1: Processing of the dynamic programming matrix in horizontal bands. Cells, which of individual scans in parallel. No data- are already processed, active cells and cells waiting for processing are displayed in dark gray, dependent optimisations are employed, red and light gray, respectively. Data in orange cells to the left of active cells is stored in hence the performance of the algorithm shared memory. Cells containing data accessed on beginning of the loop are highlighted in is stable both for similar and short deep dark gray. sequences.

Our algorithm scans several thousand filled with a similarity score for the i-th BLAST algorithm. In this approach the sequences in parallel. The smallest residue of the query sequence and j-th results of the search are used to develop chunk of sequences which are residue of the target sequence. Then, an improved position-specific similarity processed together comprises 256 starting from the upper left corner cor- function, which is then used for a refined sequences. These are processed on a responding to the start of both search. The procedure is repeated sev- single multiprocessor. The chips of cur- sequences, one determines the best pos- eral times, until no new sequences are rent high-end GPUs contain at least 16 sible alignment ending in each cell. An found. Another application field is identical multiprocessors, therefore a appropriate score is then assigned to the assembling of the DNA/RNA sequences GPU performs alignment of at least cell. To process the cell one needs to from short fragments which are gener- 4096 sequences in parallel. know information on score and auxil- ated in the new sequencing methods iary variable in three adjacent cells - (next generation sequencing). Protein sequences are represented as a upper, left and upper-left. string of 20 characters corresponding to Link: 20 types of amino acid. Homology In our algorithm the SW matrix is http://bioinfo.icm.edu.pl/ services/sw-gpu/ between proteins is inferred from the processed in 12-cell high bands, as similarity of their sequences - if sequence shown in Figure 1. Each band is Please contact: similarity is higher than expected by processed column by column. The Witold R. Rudnicki, chance then the proteins are homologous. global memory access is only required University of Warsaw, Poland The alignments can contain gaps – these twice, once at the entrance to the main Tel: +48 22 5540 817 represent mutations, which correspond to loop and once at the exit. The score and E-mail: [email protected]

ERCIM NEWS 82 July 2010 15 Special Theme: Computational Biology

Analyzing the Whole Transcriptome by RNA-Seq Data: The Tip of the Iceberg

by Claudia Angelini, Alfredo Ciccodicola, Valerio Costa and Italia De Feis

The recent introduction of Next-Generation Sequencing (NGS) platforms, able to simultaneously sequence hundreds of thousands of DNA fragments, has dramatically changed the landscape of genetics and genomic studies. In particular, RNA-Seq data provides interesting challenges both from the laboratory and the computational perspectives.

Gene transcription represents a key step laborious time- and cost-effective steps disciplinary background. In addition, in the biology of living organisms. for the cloning of fragments prior to since the output of an RNA-Seq experi- Several recent studies have shown that, sequencing. Hence, the recent introduc- ment consists of a huge number of short at least in eukaryotes, virtually the entire tion of massively parallel sequencing on sequence reads - up to one billion per length of non-repeat regions of a NGS platforms has completely revolu- sequencing run - together with their genome is transcribed. The discovery of tionized molecular biology. base-call quality values, terabytes of the pervasive nature of eukaryotic tran- storage and at least a cluster of com- scription and its unexpected level of RNA-Seq is probably one of the most puters are required to manage the com- complexity - particularly in humans – is complex of the various “Seq” protocols putational bottleneck. helping to shed new light on the molec- developed so far. Quantifying gene ular mechanisms underlying inherited expression levels within a sample or Recently, the Institute of Genetics and disorders, both mendelian and multifac- detecting differential expressions Biophysics (IGB) and the Istituto per le torial, in humans. among samples, with the possibility of Applicazioni del Calcolo (IAC), both simultaneously analysing alternative located in the Biotechnological Campus Prior to 2004, hybridization and tag- splicing, allele-specific expression, of the Italian National Research Council based technologies, such as microarray RNA editing, fusion transcripts and in Naples, have started a close collabo- and Serial/Cap Analysis of Gene expressed single nucleotide polymor- ration on RNA-Seq data which aims to Expression, offered researchers phisms, is crucial in order to study fill the gap between data acquisition and intriguing insights into human genetics. human disease-related traits. statistical analysis. In 2009 IGB, a Microarray techniques, however, suf- former participant in the Human fered from background cross-hybridiza- To handle this novel sequencing tech- Genome project, acquired the SOLiD tion issues and a narrow detection range, nology, molecular biology expertise system 3.0, one of the first and most whilst tag-based approaches required must be combined with a strong multi- innovative platforms for massively par- allel sequencing installed in Italy. IAC has great experience in developing sta- tistical and computational methods in bioinformatics and is equipped with two powerful clusters of workstations capable of handling massive computa- tional tasks.

The collaboration started with two pilot whole-transcriptome studies on human cells via massively parallel sequencing aimed at providing a better molecular picture of the biological system under study and at setting up an efficient com- putational open-source pipeline to downstream data analysis.

The computational effort focuses on the use of efficient software, the implemen- tation of novel algorithms and the development of innovative statistical techniques for the following tasks:

a)alignment of the short reads to the corresponding reference genome b)quantifying gene expressions c)procedures of normalization to Figure 1: An example of the computational pipeline for the analysis of RNA-Seq data. compare samples obtained in

16 ERCIM NEWS 82 July 2010 different runs and assessment of the the large amount of data available to pro- The results we obtain from the compu- quality of the experiments vide a better understanding of the human tational analysis of the two pilot proj- d)identification of novel transcribed transcriptional landscape , the useful ects will be validated by quantitative regions and refinement of previously genetic information generated in a single real-time polymerase chain reaction annotated ones experiment clearly represents only “the (PCR) and, where deemed crucial for e)identification of alternative spliced tip of the iceberg”. Much more research the analysis, the related protein prod- isoforms and assessment of their will be needed to complete the picture. ucts will be assessed by Western Blot. abundance; Biological validation will provide fun- f) detection of differential For steps a) and f), we are integrating damental feed-back for optimizing the genes/isoform expression under two some open source software into our parameters of the computational or more experimental conditions pipeline. These are well-consolidated analysis. g)implementation of user-friendly phases and there are several methods interfaces for data and available in the literature. Points b) – e) Please contact: analysis. are far more difficult as statistical Claudia Angelini or Italia De Feis methodologies are still lacking. The IAC-CNR, Italy Each of these tasks requires the integra- mathematical translation of the concept E-mail: c.angelini| [email protected] tion of currently available tools with the of “gene expression” and its modelling development of new methodologies and needs to be reassessed since we are now Alfredo Ciccodicola or Valerio Costa computational tools. Despite the faced with discrete variables; we are IGB-CNR, Italy unprecedented level of sensitivity and now applying innovative methods. E-mail: ciccodic|[email protected]

Reliable Bacterial Genome Comparison Tools by Eric Rivals, Alban Mancheron and Raluca Uricaru

Some bacterial species live within the human body and its immediate environment, and have various effects on humans, such as causing illness or assisting with digestion, while others colonize a range of other ecological niches. The availability of whole genome sequences for many bacteria opens the way to a better understanding of their metabolism and interactions, provided these genome sequences can be compared. In collaboration with biologists and mathematicians, a bioinformatics team at Montpellier Laboratory of Informatics, Robotics, and Microelectronics (LIRMM) has been developing novel tools to improve bacterial genome comparisons.

With the successful sequencing of the comparisons are necessary in order to tionally relevant conservation or evolu- entire genome sequence of some survey the biodiversity of genotypes tion in non-gene coding DNA regions. species, the 1990s heralded a post- within populations, and to design vac- Note that handling multiple compar- genomic era for biology: for the first cines . isons makes most of these formulations time, the complete gene set of a species NP-hard, and, as with most evolu- was amenable to - mostly in silico - Given that genomes are DNA sequences tionary questions, benchmarks are investigations. Through comparative evolving under both point mutations, missing since evolution occurs once and genomics approaches, seminal works which alter a base at a time, and block cannot be replayed. Thus, currently have, for instance, determined the core rearrangements, ie duplication, translo- available solutions are time consuming, gene set across nine bacteria, ie those cation, inversion, or deletion of a DNA require fine parameter tuning, deliver genes indispensable to the life of any of segment, comparing genomes is formu- results that are hard to visualise and these species.The availability of whole lated as a string algorithm problem. interpret, and furthermore, do not fit the genome sequences has consequently whole spectrum of applications. opened up new research avenues. The Two lines of approach have been fol- current revolution of sequencing tech- lowed: (i) extend gene sequence align- In the framework of the project niques provides such an improvement in ment algorithms by making them more "Comparisons of Complete Genomes yield that it brings within reach the efficient and coupling them with a pro- (CoCoGen)" supported by the French sequencing of thousands of bacterial, or cedure to deal with some types of National Research Agency, we designed even longer, genomes within a few rearrangements, (ii) consider only block novel algorithms for comparing com- years. However, the exploitation of the rearrangements on a sequence of plete bacterial genomes. information encoded in these sequences ordered markers (typically genes), but requires an increased capacity to com- disregard the DNA sequence. Type two Genome alignment algorithms first pare whole genome sequences. Indeed, approaches adequately model changes search for anchors, which are pairs of even finding the genes within the in the gene order, but require well anno- matching genomic regions, and chain genome starts by comparing it to that of tated genes and the knowledge of their them to maximize the coverage on both the evolutionary most closely related evolutionary relationships across genomes. A chain is an ordered subset of species, if available. Today, genome species, and are unable to report func- non-overlapping collinear anchors,

ERCIM NEWS 82 July 2010 17 Special Theme: Computational Biology

Local Similarity Region z 10. 20. 30. }| 40. 50. 60. { 70. AE014291 CTCGTATGGGCATTTCCAGCGAAAGCGCGAAGCAATATCGCGTTTGGAAACGCAACGCATAAAGTCGGCC X non conserved CP001578 TCTGCTAGAGCATTTCCAGCAAAAGTGCGAAGCAGTTTTGCGTTGGATAATGCGACAAAACAAAGAGGTA X conserved

| MEM{z } | MEM{z } | MEM{z }

Figure 1: Different anchors for whole genome alignment. Alignment of fragments of Brucella suis and B. microtii genomes showing on one hand Maximal Exact Matches (MEM) of length > 5 and a (much longer) Local Alignment.Figure Identical 1: bases todo are shown in blue background color.

Figure 2: Output of QOD, when used to compare pairwise Brucella suis and B. microtii genomes. The top diagram represents B. suis genome as a central horizontal black and blue bar, and each other colored horizontal bar depicts a maximal region of similarity with B. microtii genome. The high genome coverage by such bars indicates the proximity of the two species. Lower parts of the window allows to browse the segmentation together with intersecting annotations of the target genome.

meaning that the succession of regions knowledge of which genomic regions distinct set of reference genomes allows appears in the same order on both encode the corresponding genes com- us to rapidly determine which genomic genomes. Current tools use pairs of short pared to less virulent or non-pathogenic regions distinguish a subset of patho- exact or approximate matches as anchors strains. These genes may be relevant in genic strains/species. (Mumer, Mauve, MGA), to be sure to the process of diagnosis, prognosis, and find secure anchors in quite divergent treatment of disease. For instance,1 a Application of our methods to the com- regions. Instead, we focus on local align- recent approach to vaccine design relies parison of multiple bacterial strains that ments (LA) found using sensitive spaced strongly on this type of genome com- are pathogenic for ruminants has seeds filtration. The length of local align- parison. As many applications of com- enabled discovery of the new genes that ments adapts to the divergence level of parative genomics aim at finding such were overlooked by several previous compared regions, and moreover, LA are distinguishing regions between strains genome annotation projects. It has also selected and ranked according to their or species, we conceived methods that facilitated identification of a gene whose statistical significance (see Figure 1). can directly pinpoint such regions. In sequence varies among strains, and was However, an important disadvantage is existing algorithms, genome compar- known to code for a protein that gener- that the regions of distinct anchors may ison is formulated as a maximization ates important immunological reactions overlap within a genome. By adapting a problem and used heuristics tend to in related bacterial species (the human Maximum Independent Set algorithm on incomporate unreliably aligned region and dog pendant pathogens). trapezoid graphs, we exhibited a pairs in the final output. Consequently, chaining algorithm that allows overlaps distinguishing regions may be missed between anchors that depend propor- due to misalignment, since an anchor Link: tionnally on the anchor lengths. This can be found by chance in or around http://www.lirmm.fr/~rivals/CoCoGEN/ improved significantly the coverage such regions. We propose another obtained with LA, and overcomes the genome segmentation algorithm that Please contact: overlap created by prevalent genomic automatically partitions the target Eric Rivals structures like tandem repeats. genome into regions that are shared or The Montpellier Laboratory of not shared with k reference genomes Informatics, Robotics, and Unravelling the pathogenicity mecha- (see software QOD in Figure 2). Microelectronics (LIRMM), France nisms of some bacterial strains requires Performing several comparisons with a E-mail: [email protected]

18 ERCIM NEWS 82 July 2010 ModeRNA builds RNA 3D Models from Template Structures by Magdalena Musielak, Kristian Rother, Tomasz Puton and Janusz M. Bujnicki

Biological functions of many ribonucleic acid (RNA) molecules depend on their three-dimensional (3D) structure, which in turn is encoded in the RNA sequence. We have developed ModeRNA, a program that constructs 3D models of RNAs based on experimentally determined “template” structures of other, related RNAs. This approach is less time and cost intensive than experimental methods.

RNAs are linear polymers comprising ModeRNA builds RNA structures rithm for superposition, the Full Cyclic tens to thousands of nucleotide residues starting with the easiest part: Coordinate Descent algorithm for that function as building blocks. There nucleotides identical between the target closing gaps, and the NeRF algorithm to are four basic building blocks: adeno- and the template are placed in exactly construct atom coordinates. To identify sine, guanosine, cytidine and uridine, the same position. Then, single nucleotides, a subgraph matching pro- but they are often enzymatically modi- nucleotide substitutions are introduced, cedure has been implemented. The pro- fied in the cell to form more than one for example an adenine can be gram also contains a multitude of func- hundred derivatives with different exchanged for a guanine. Finally, parts tions, eg, to analyse the geometry of chemical structures. RNA molecules and of the RNA structure that are not present existing structures, to find interatomic their complexes are main players in the in the template are modelled. For that clashes, or simply remove unwanted production of proteins and other purpose, ModeRNA uses a library of nucleotide residues. These functions processes in cells. One prominent over 100,000 structural fragments, from are available via a scripting interface. example is transfer RNA (tRNA), a mol- which the program chooses the one that ModeRNA has been written in the ecule with a complex 3D structure, used to decipher the genetic code and trans- late the genetic information in mes- senger RNA (mRNA) into protein sequences. Determining 3D structures by experimental methods is time-con- suming and expensive, compared to methods used for obtaining linear RNA sequences. Thus far only 150 tRNA structures have been solved experimen- tally, as opposed to more than 300,000 known nucleotide sequences.

Because all tRNAs are evolutionarily related, their structures are similar to one other. We can use experimental 3D data of one tRNA as a template to create a model of another related tRNA with a known, different sequence. In addition to the template structure, information about correspondencies of nucleotide residues between the target sequence Figure 1: ModeRNA builds a 3D model of an RNA molecule based on a template structure and the template sequence (sequence and an alignment of two sequences. alignment) is required. The alignment is interpreted as a set of instructions regarding which nucleotide residues of geometrically fits best into the insertion Python language, using the BioPython the template are to be replaced by which site in the model. Because even a well- library for basic tasks like parsing residues of the target. One can think of suited fragment may cause small distor- structural data from the PDB format. this concept as creating different pic- tions of bond lengths and angles, ModeRNA is available under the GPL tures from a set of jigsaw puzzle pieces ModeRNA can optimize atom coordi- Open Source license. The program is by editing an existing puzzle – for nates to achieve a stereochemically rea- being used by several lab members and instance replacing a zebra by a lion or sonable conformation. Referring to the collaborators, who provide constant adding a mountain in the background in puzzle analogy, this would be inserting feedback and suggestions for improve- a savannah landscape. a set of coherent pieces and applying a ment. rasp to make their edges fit to the This technique, called homology model- remainder of the puzzle. To test ModeRNA, we constructed a ling or comparative modelling, has been To manipulate 3D coordinates of atoms, series of 9801 tRNA models for 100 implemented in the ModeRNA program. ModeRNA employs the Kabsch algo- structures determined by X-ray crystal-

ERCIM NEWS 82 July 2010 19 Special Theme: Computational Biology

lography. We calculated the root mean some, or if it is free from interactions. somes with more than 1000 nucleotides. square deviation (RMSD) of the atomic We can model these subtle differences Thus comparative modeling is a method coordinates of modelled versus experi- by choosing a template structure that is of choice for 3D structure prediction of mental structures. The results showed in the desired state. large structured RNAs. that the RMSD between the models and the original structures correlates well As the template has such an important In summary, ModeRNA is a tool for with the RMSD among experimentally influence on model quality, a striking construction of 3D models for RNA solved structures. The best models question is: “Do we really need a tem- sequences using structures of another, reached an RMSD up to 1 Å, with a plate, or could we just connect related RNA as building blocks (tem- majority around 4-5 Å. Obviously, the nucleotides from scratch?” In fact, plates). ModeRNA also facilitates RNA quality of the model depends on how methods like the MC-Fold/MC-Sym 3D structure analysis. similar the template is to the target, pipeline have been used successfully to which highlights the importance of the model small RNA structures (12-50 Link: choice of the right template. However, it nucleotides) from nothing more than a http://iimcb.genesilico.pl/moderna/ must be remembered that the RNA nucleotide sequence. However, when structure can change depending on the the sequence is longer, building a struc- Please contact: functional state. For example the anti- ture without further knowledge becomes Janusz M. Bujnicki codon loop and acceptor stem regions computationally unfeasible. For many International Institute of Molecular and of tRNA can adopt different conforma- larger RNA families known 3D struc- Cell Biology, Poland tions depending on whether the mole- tures are available, among them tRNA Tel: +48 22 597 07 50 cule is bound to a protein, or to the ribo- having around 75 nucleotides, and ribo- E-mail: [email protected]

Protein Homology Modelling - Providing Three-dimensional Models for Proteins where Experimental Data is Missing

by Jürgen Haas and Torsten Schwede

A linear sequence of amino acid letters or a three-dimensional arrangement of atoms in a polypeptide chain? Most biologists and bioinformaticians will have their preferred view when imagining a “protein”. Although these views represent two sides of the same coin, they are often difficult to reconcile. There are two main reasons for this: a lack of experimental structural information for the majority of proteins, and a different “culture” in handling data between the two communities, which results in a number of technical hurdles for somebody trying to bridge the gap between the two paradigms. Protein homology modelling resources establish a natural interface between sequence-based and structure- based approaches within the life science research community.

Natural proteins are exciting molecules information to define the three-dimen- nations are possible, the number of dif- which - in contrast to the regular helical sional structure when a protein is syn- ferent three-dimensional protein struc- DNA molecule - have characteristic, thesized by a cell. tures (“folds”) actually observed in well-defined three-dimensional struc- nature is limited. This allows for tures. It is the individual structure of a Mind the gap homology (or comparative) modelling protein, with its intricate network of Crucial to the understanding of molec- methods to build computational models atomic interactions formed by the back- ular functions of proteins are insights for proteins (“targets”) based on evolu- bone and side chains atoms of the amino gained from their three-dimensional tionary related proteins for which acids in a polypeptide chain, which structures. However, while thousands of experimental structures are known allow it to perform highly specific func- new DNA sequences coding for pro- (“templates”). Hence experimental tions in a living organism. These include teins are sequenced every day, experi- structure determination efforts and mechanical force generated by motor mental elucidation of a protein structure homology modelling complement each proteins, enzymes degrading nutrients, is still an expensive and laborious other in the exploration of the protein antibodies recognizing foreign sub- process, typically taking several weeks. structure universe. stances such as a pathogen, or olfactory Not surprisingly, direct experimental receptors sensing the smell of perfume structural information is, to date, only The SWISS-MODEL Expert System in one’s nose. But the most exciting available for a small fraction of all pro- For an experimental scientist, being property of these molecular machines is teins, and this gap is widening rapidly. interested in obtaining a three-dimen- that they don’t need specific tools to be sional structural model of a protein to assembled: The linear sequence of Whilst theoretically a nearly unlimited study a biological question should not amino acids alone contains sufficient number of amino acid sequence combi- require becoming an expert in molec-

20 ERCIM NEWS 82 July 2010 ular modelling or programming the nec- essary software tools. The aim of the SWISS-MODEL expert system is to provide a user-friendly Web-based system for protein structure homology modelling which is usable from any PC equipped with an Internet connection – without the need for installing complex software packages or huge databases. Sixteen years ago, SWISS-MODEL was the first fully automated protein modelling service on the Internet, aiming to make protein structure model- ling easily accessible to life science research. Today SWISS-MODEL is one of the most widely used Web-based modelling services world wide. SWISS- MODEL hides the complexity of a stack of specialized modelling software, mir- have thus developed the Protein Model rors public databases of protein Portal (PMP) by federating the avail- sequences and structures, automates able distributed resources. PMP is part procedures for data updates, and has of the Nature – PSI Structural Biology Figure 1: Three-dimensional homology tools for result visualization. Users Knowledgebase (PSI-SBKB) project, model of the human Kinesin-like protein interact with a set of partially or fully and is developed by the Computational KIF3A, a protein involved in the microtubule- automated workflows in a personalized Structural Biology Group at the Swiss based translocation. The model was generated Web-based workspace. SWISS- Institute of Bioinformatics (SIB) and using the experimental structure of a related MODEL is developed by the the Biozentrum of the University of protein, the motor domain of the human Computational Structural Biology Basel, Switzerland in collaboration with Kinesin-like protein KIF3B sharing 66% Group at the Swiss Institute of the PSI SGKB partner sites, specifically sequence identity, as template. Bioinformatics (SIB) and the Rutgers, The State University of New Biozentrum of the University of Basel, Jersey. The PSI-SGKB informs about Switzerland. advances in structural biology and structural genomics in an integral The Protein Model Portal manner, including not only newly deter- queries are transparently mapped to var- Diversity is essential to the success of mined three-dimensional structures, but ious database accession code systems, any biological entity. However, there is also the latest protocols, novel materials providing dynamic functional annota- reason to doubt that the same is true for and technologies. The current release of tion for the target sequences. the technical diversity observed among the portal (May 2010) consists of 12.7 typical bioinformatics resources, espe- million model structures provided by For the first time it is now possible to cially in the field of molecular model- different partner resources for 3.4 mil- query all participating federated struc- ling. A simple question like “What is lion distinct protein sequences. PMP is ture resources - both experimental and known about the three-dimensional available at http://www.proteinmodel- computational models - simultaneously structure of a given protein?” can easily portal.org and from the PSI Structural and compare the available structural end in an hour-long odyssey through the Biology Knowledgebase (http://kb.psi- information in a single interface. The Internet for proteins which are not fully structuralgenomics.org). dynamic mapping of the growing uni- experimentally characterized. While all verse of all protein sequences even experimental data is collected centrally In order to integrate the meta-informa- allows searching for features which in the wwPDB – currently approxi- tion on protein models available at the were not implemented in the original mately 60,000 experimental structures different partner sites – using heteroge- data sources. Sometimes, the whole is are deposited in this database - compu- neous data structures and incompatible greater than the sum of its parts. tational structure models are generated naming conventions and accession code by many specialized computational systems – we created a common inde- Links: modelling groups spread out in the aca- pendent reference system based on http://www.proteinmodelportal.org demic community. The heterogeneous cryptographic md5 hashes for all cur- http://kb.psi-structuralgenomics.org user interfaces and formats used at dif- rently known, naturally occurring http://www.wwpdb.org ferent sites using various incompatible amino acid sequences. This database is accession code systems are an addi- continuously updated while new Please contact: tional challenge in accessing structure sequences are being discovered, and Jürgen Haas and Torsten Schwede model information. allows us to dynamically federate infor- Biozentrum University of Basel mation from different providers, SIB Swiss Institute of Bioinformatics In order to provide a single interface to without the need for the distributed Tel: +41 61 2671581 access both experimental structures and resources to change their mode of oper- E-mail: [email protected], computational models for a protein we ation. Using portal technologies, [email protected]

ERCIM NEWS 82 July 2010 21 Special Theme: Computational Biology

Modelling of Rapid and Slow Transmission Using the Theory of Reaction Kinetic Networks

by Dávid Csercsik, Gábor Szederkényi and Katalin M. Hangos

With their interdisciplinary background and interest in nonlinear process systems, the Process Control Research Group at Computer and Information Research Institute Hungarian Academy of Sciences carries out research in modelling, analysis, representations and control of reaction kinetic systems, and their application in systems biology.

The modelling and analysis of signalling pathways in living organisms is one of the most challenging problems in sys- tems biology that can be handled using the theory of chemical reaction networks (CRNs) obeying the mass-action law. In addition to being used to describe pure chemical reactions, such networks are also widely used to model the dynamics Figure 1: The reaction of intracellular processes, metabolic or scheme of the kinetic cell signalling pathways. This model model describing fast (G class enables the use of the deficiency- protein coupled) and based, multi-stability-related results of slow (β-arrestin coupled) Martin Feinberg et al., which provide transmission). very strong theorems about qualitative behaviour of the modelled system, based only on the structure of the reaction net- work, independently of its parameters. depicted in mathematical terms using respectively). An example of a reaction hypergraphs, where the edges may be scheme is shown in Figure 1. In the case of reaction kinetic models, adjacent to more than two vertices. The we consider a system of n chemical vertices of a reaction scheme corre- One recent project in which CNRs have species participating in an r reversible spond to the non enzymatic complex been applied to biological systems is the steps reaction network. For graphical type components, while the hyper- modelling of rapid (G protein coupled) representation of the kinetic system, edges describe chemical reactions (not and slow (β-arrestin coupled) transmis- reaction schemes can be used, which necessarily reaction steps). An enzyme- sion. Until the 2000s the most widely describe the structure of the enzymatic catalytic reaction corresponds to a pair accepted classic paradigm of signalling, and non-enzymatic reactions in a com- of hyper-edges with different directions, related to G protein coupled receptors, pressed way (not depicting every single both adjacent to three components S, P hypothesized that the most important reaction). Reaction schemes can be and E (substrate, product and enzyme, elements which contribute to informa-

Figure 2: The simulation results of the model describing fast (G protein coupled) and slow (β-arrestin coupled) transmission.

22 ERCIM NEWS 82 July 2010 tion transfer to the internal system of the Laboratory of the Department of include administration of the key hor- cell are the α and βγ subunits of G pro- Human Morphology and Developmental mone GnRH (which acts via G protein teins. In recent years, it has been shown Biology (Hungarian Academy of coupled receptors) or its analogues. The that, in addition to taking part in Sciences and Semmelweis University), importance of slow transmission is also receptor desensitization and attenuation we have developed a simplified becoming evident in other areas of of G protein coupled signalling, dynamic model that describes the physiology and medicine, for example β-Arrestins also form an endocyctic dynamic behaviour of G protein sig- recent studies suggest that ?-arrestins protein complex. This complex initiates nalling. The model takes into account may play a central role in diabetes mel- a G protein independent transmission the effect of slow transmission, RGS litus and insulin resistance. and regulation of extracellular regulated mediated feedback regulation and kinase (ERK), an important kinase that- ERK-phosphatase mediated feedback Link: plays a central role in the intracellular regulation. The parameters of the model http://daedalus.scl.sztaki.hu/PCRG/ signalling network (ERK is also acti- have been determined via numerical vated by G protein coupled pathways). optimization. Please contact: The recognition that a single receptor Dávid Csercsik acts as a multiple source of signalling The proposed reaction kinetic model, SZTAKI, Hungary pathways, and various drugs binding to depicted in Figure 1 as a reaction Tel: +36 1 279 6163 this receptor, might differentially influ- scheme, gives rise to an acceptable E-mail: [email protected] ence each pathway (in contrast to qualitative approximation of the G pro- pathway-specific drugs), led to the tein dependent and independent ERK Gábor Szederkényi reassessment of the efficacy concept. activation dynamics that is in good SZTAKI, Hungary These recent biological findings led to agreement with the experimentally Tel: +36 1 279 6163 the concept of a dynamical model, observed behaviour (see Figure 2). E-mail: [email protected] capable of describing the interactions of the two convergent, but qualitatively The developed and validated model Katalin M. Hangos different signalling mechanisms. could potentially be applied to disorders SZTAKI, Hungary of the reproductive neuroendocrine Tel: +36 1 279 6101 In cooperation with the Neuromorpho- system. In cases of polycystic ovary E-mail: [email protected] logical and Neuroendocrine Research syndrome for instance, treatment may

SCAI-VHTS - A Fully Automated Virtual High Throughput Screening Framework by Mohammad Shahid, Torbjoern Klatt, Hassan Rasheed, Oliver Wäldrich and Wolfgang Ziegler

One of the major challenges of in-silico virtual screening pipelines is dealing with increasing complexity of large scale data management along with efficiently fulfilling the high throughput computing demands. Despite the new workflow tools and technologies available in the area of computational chemistry, efforts in effective data management and efficient post-processing strategies are still ongoing. SCAI-VHTS fully automates virtual screening tasks on distributed computing resources to achieve maximum efficiency and to reduce the complexities involved in pre- and post-processing of large volumes of virtual screening data.

Virtual screening is an important and protein 3D structures available in the the workflow is very cumbersome. complementary step in the modern drug Protein Data Bank (PDB) and millions First, there is the subtle issue of han- discovery process. Small molecules in of small molecules in compound data- dling huge amounts of data in the form virtual compound databases are compu- bases, which are publicly available. of large numbers of input/output files. tationally screened against specific bio- Furthermore, there are billions of vir- Data management during pre- and post- logical protein targets by computing the tual compounds that could be obtained processing stages is the most tedious, binding energy of these molecules inside from combinatorial chemistry space. laborious and time-consuming work. the protein active site. Scoring and Even simulating a few million of such Other important issues include the ranking is performed to filter and select large datasets increases the demand for manual distribution of the workload on drug-like molecules, which are active computing resources, as well as the the available computing resources, and against the biological targets of interest. effort involved with management of tracking and monitoring of the running The three dimensional structures of both large datasets. tasks. Furthermore, is an the protein targets and small molecules important issue that requires great effort are required to perform virtual screening Without a framework that fully auto- in failure management, including iden- in the structure based drug discovery mates the virtual screening workflow, tification and resubmission of tasks that process. There are more than 60,000 manual execution and management of are failed or lost for various reasons.

ERCIM NEWS 82 July 2010 23 Special Theme: Computational Biology

Figure 1: of the automated virtual screening framework.

Many recent research projects have execution of the screening applications tion of other applications such as molec- been demonstrating the role and rele- that perform simulations, post-pro- ular dynamics simulations. vance of using Grid technologies for cessing and collection of the results. virtual screening. Grid technologies can The architecture of the automated vir- The SCAI-VHTS framework relieves help accelerate the screening process as tual screening framework (see Figure 1) the researcher from dealing with the well as providing required resources provides a single point of interaction to challenges of deploying large scale vir- such as computing power and large distributed computing resources while tual screening experiments and the scale data storage that meet the hiding the complexity of the underlying tedious and laborious work involved in demands of CPU- and data-intensive infrastructure. management of large volumes of biomedical applications. In research input/output data. As a result, the environments Grids are established as a Within this framework two compute researcher can concentrate on his pri- useful technology that allows sharing and data intensive applications FlexX mary goals: finding and evaluating new and on demand utilisation of geographi- and AutoDock have been integrated drugs. SCAI-VHTS can assist with data cally dispersed heterogeneous resources with the components for performing management, analysis, and automated including high performance computing pre-processing of the input data as well knowledge discovery by facilitating resources. Grid technologies are applied post-processing of the results. Pre-pro- remote collaborations of distributed and to perform high throughput computa- cessing includes formatting the input heterogeneous Grid resources, thus cre- tion in the computational chemistry data for the respective application while ating a virtual laboratory for biological domain. The use of this technology post-processing includes filtering of research and development. enables researchers to collaborate in a virtual screening results by applying virtual laboratory that provides an inte- post-docking strategies. Post-pro- grated resource platform. A framework cessing is mainly comprised of ranking for integrating tools and methodologies by the built-in scoring function of the for efficient deployment of virtual application as well as re-ranking by a screening experiments forms the basis modified protein-ligand interaction fin- Links: of such a virtual screening laboratory. gerprints approach. UNICORE: http://www.unicore.eu SCAI-VHTS, a fully automated virtual The researcher simply has to submit a PHOSPHORUS: high throughput screening framework, single virtual screening job on the Grid http://www.ist-phosphorus.eu/ enables the researchers to perform large through a graphical UNICORE client scale virtual screening experiments that plug-in, after which the Meta Please contact: include complex workflows with great Scheduling Service performs work load Mohammad Shahid, ease of use. The framework is based on distribution, manages execution and Fraunhofer SCAI, Germany, the functionalities provided by the UNI- monitoring on the computing resources Tel: +49 2241 14 2777 CORE Grid middleware, together with available. Wrapper programs config- E-mail: [email protected] SCAI’s Meta Scheduling Service ured locally on the compute clusters extended in the PHOSPHORUS project further facilitate maximum utilisation Wolfgang Ziegler, (see links below). A virtual screening of the compute resources as well as the Fraunhofer SCAI, Germany, workflow typically consists of pre-pro- management of the distributed virtual Tel: +49 2241 14 2248 cessing the input data, the distribution screening data. The modular workflow E-mail: of the data to the computing systems, is fully extensible to allow easy integra- [email protected]

24 ERCIM NEWS 82 July 2010 Searching for Anti-Amyloid Drugs with the Help of Citizens: the “AMILOIDE” Project on the IBERCIVIS Platform by Carlos J. V. Simões, Alejandro Rivero, Rui M. M. Brito

The current “target-rich and lead-poor” scenario in drug discovery, together with the massive financial resources required to develop a new drug, mean that new approaches and renewed efforts by researchers in academia are required, in particular in the area of neglected and rare diseases. Virtual screening and volunteer computing are extremely useful tools in this fight, and together have the potential to play a crucial role in the early stages of drug development. From a social and economic point of view, amyloid neurodegenerative diseases, including Alzheimer´s, Parkinson´s, familial amyloid polyneuropathy and several others, currently represent important targets for drug discovery. Here, we provide a short account of our current efforts to develop new compounds with anti-amyloid potential using a volunteer computing network.

Transthyretin Amyloid whereby the protein dissociates, par- pounds by filtering an initial collection Despite recent advances in our under- tially unfolds and aggregates. These of approximately 11 million molecules standing of biological systems, the dis- events can be modulated by the binding deposited in the ZINC database. The fil- covery and development of a new thera- of thyroxine-like molecules to the two tering process involved combining rules peutic agent is still a slow, difficult and binding pockets of TTR. Several of for bioavailability with our knowledge very expensive process. This is mainly these small molecules have been shown of the physico-chemical properties of due to the low success rates associated to stabilize the protein and thereby the known TTR stabilizers. However, with finding the “magic bullet” for a decrease or even prevent amyloid for- the docking of approximately 2.3 mil- given biological target. Computational mation, but in general they are associ- lion compounds to TTR on a single methodologies play an increasing role in ated with undesirable side effects, low desktop computer would take about 43 drug discovery but, due to limitations of solubility, and/or inability to strongly years of CPU time. the approximations used, it is clear that a bind TTR in the plasma. However, a “one recipe fits all” approach is far from Phase III clinical trial has recently been The AMILOIDE project takes advan- suitable. concluded with one of these molecules, tage of the expansion of the Ibercivis with very promising results. network to Portugal, implementing At the Structural and Computational AutoDock4 on a large-scale volunteer Biology group of the Center for Virtual Screening and Volunteer computing platform. Powered by the Neuroscience and Cell Biology, Computing: the AMILOIDE-Ibercivis BOINC middleware, Ibercivis was ini- University of Coimbra, we devised a project tially developed at the Institute for research program to find new lead com- In order to search for new lead com- Biocomputation and Physics of pounds for the treatment of amyloid dis- pounds capable of stabilizing the native Complex Systems, University of eases, in particular, familial amyloid form of TTR, we envisaged a rigorous Zaragoza, and it is guided by two main polyneuropathy (FAP), a genetic neu- computationally-driven search to mini- objectives: i) to provide an easily oper- rodegenerative disease that is prevalent mize the occurrence of false positives. able distributed computing platform in Portugal and with other foci around Multiple combinations of docking algo- available to multiple scientific projects; the world. Known as a highly impairing rithms and scoring functions were vali- ii) to provide a science communication disease, it is characterized by loss of dated by assessing the software’s ability forum where citizens and scientists may sensation to temperature and pain in the to reproduce the binding mode of interact. lower limbs during its early stages, later known TTR binders in experimental X- evolving to muscular weakness and gen- ray structures. Docking algorithms such In AMILOIDE-Ibercivis, each BOINC eral autonomic dysfunction. The only as AutoDock4, FRED, eHiTS and Work Unit is comprised of one TTR effective treatment known to date is GOLD were tested, along with their in- receptor structure with its respective liver transplantation, since the liver is built and additional scoring functions. 3D maps of atomic affinity, one ligand the main site of synthesis of the protein For TTR, the combination of structure, and docking parameter files. transthyretin, the causative agent of AutoDock4 with a scoring function The Work Units are sent to the network FAP. called DrugScore-CSD produced the clients where AutoDock4 carries out most reliable results. the calculations. When these are con- Transthyretin (TTR) is a protein of the cluded, the results are sent back to one blood plasma known to bind and trans- In parallel, restricting “chemical space” of the Ibercivis servers and later trans- port the hormone thyroxine. The cyto- to a biologically relevant and syntheti- ferred to the researchers’ machines for toxic role of TTR on the peripheral cally accessible set of molecules is an analysis. Here, the predicted com- nerves involves the formation of amy- essential step in a Virtual Screening plexes of TTR are submitted to loid aggregates and amyloid fibrils (VS) campaign. We have generated a rescoring with DrugScore-CSD. The through a series of molecular events, tailored library of 2,259,573 com- most promising molecules are pur-

ERCIM NEWS 82 July 2010 25 Special Theme: Computational Biology

Figure : Physiopathological model for amyloid fibril formation by Transthyretin (TTR) (top panel, top row) and possible points of therapeutic intervention (top panel, bottom row): stabilization of the native tetramer; inhibition of aggregation; solubilization of aggregates and fibrils. Stabilization of the native tetramer is currently the strategy of choice, due to the large amount of experimental data available. Virtual high- throughput docking of small organic molecules to Transthyretin (bottom left panel) is being conducted on the large volunteer computing network Ibercivis (bottom right panel).

chased and tested experimentally for mostly due to aborted jobs (74%), fol- with better solubility, lower fraction of activity. lowed by computation errors (23%). potentially toxic halogen atoms and Download errors account for only 3% better combination of binding pocket During the period between October of the failures. The computing per- affinities. 2009, when the first production Work formance of this experiment is stag- Unit of the AMILOIDE-Ibercivis gering: the Ibercivis platform provided Link: project was launched, and May 2010, the AMILOIDE project with over http://www.ibercivis.net all 2,259,573 Work Units were sent out 350,000 hours of CPU time during the to BOINC clients on Ibercivis. Of first seven months of execution. Please contact: these, approximately 80% have been Rui M. M. Brito successfully returned. This represents The returned docking results are very Center for Neuroscience and Cell an average validation rate of 8,600 promising, indicating that the top- Biology, University of Coimbra, Work Units per day, or 258,000 Work ranked compounds hold distinctive fea- Portugal Units per month. Client errors are tures from the known TTR binders, E-mail: [email protected]

26 ERCIM NEWS 82 July 2010 Swarm Intelligence Approach for Accurate Gene Selection in DNA Microarrays by José García-Nieto and Enrique Alba

DNA microarrays have emerged as powerful tools in current genomic projects since they allow scientists to simultaneously analyse thousands of genes, providing important insights into the functioning of cells. Owing to the large volume of information that can be generated by a microarray experiment, the challenge of extracting the specific genes responsible for a given illness can only be solved by using automatic means. This challenge has driven our research group at the University of Málaga to design swarm intelligence approaches with the aim of performing accurate biological selection from gene expression datasets (AML-ALL leukemia, colon tumour, lung cancer, etc.).

A DNA microarray consists of a series of present distinct profiles for different ture selection is a complex problem that thousands of DNA molecules located in classes of samples. has proved to be NP-hard, and there- different positions within a matrix struc- fore, efficient, automated and intelligent ture. These DNA molecules are mixed In this context, machine learning tech- approaches are needed to tackle it. with cellular cDNA molecules during a niques have been applied to handle hybridization process, after which, abun- large datasets, since they are capable of Our research addresses all these issues. dant sequences generate strong signals isolating useful information by We use swarm intelligence algorithms while rare sequences generate weak sig- rejecting redundancies. In practice, in order to perform an efficient gene nals. Microarrays are normally used to computational feature selection (gene selection from large microarray compare gene expression intensity within selection, in biology) is often consid- datasets. Swarm intelligence a sample, and to look at differences in the expression of specific genes among sam- ples. A sample is a test focussing on one disease or on healthy-unhealthy tissues. This is especially appropriate in cancer analysis, since it allows discrimination between tumour tissue and normal tissue. Several gene expression profiles obtained from tumours such as leukaemia, colon, and lung cancers are ready for research in computational biology, and we use them in our research.

The vast amount of data involved in a typical microarray experiment usually requires complex statistical analyses, with the goal of performing a supervised division of the dataset into correct classes. The key issue in this classifica- tion is to identify representative gene subsets that may be later used to predict class membership for new external sam- ples. These subsets should be as small as possible in order to develop fast processes for the future class prediction Figure 1: Complete process for optimal gene selection. Once the expression data are generated, done in an actual laboratory. The main our swarm intelligence algorithm obtains optimal subsets of representative genes that are difficulty in microarray classification offered to the human specialist as the genes responsible for the illness. versus classification of other datasets (found in other domains) is the avail- ered as a necessary pre-process step approaches are computational proce- ability of a relatively small number of prior to analysing large datasets, in dures that model the collective behav- samples in comparison to the huge order to reduce the size of the dataset. iour of decentralized and self-organized number of genes in each sample (a typ- Feature selection for gene expression systems found in nature (ant colonies, ical microarray can have a few tens of analysis often uses supervised classifi- bee swarms, bird flocking, etc.) to solve tested sampled for several thousands of cation methods such as K-Nearest an optimization or search problem. genes). In addition, expression data are Neighbour (K-NN), Support Vector Using this approach, it is possible to highly redundant and noisy, and most Machines (SVM), and Artificial Neural reach optimal or quasi-optimal solu- genes are believed to be uninformative, Networks (ANN) to discriminate a type tions to a given problem. Our goal is to as only a small proportion of genes may of tumour. Nevertheless, optimal fea- minimise classification error whilst

ERCIM NEWS 82 July 2010 27 Special Theme: Computational Biology

using the smallest possible set of genes As shown in Figure 1, reported solutions Link: to explain the results provided by a by PSO (codifying gene subsets) are NEO Research Group: microarray. Swarm intelligence can evaluated by means of their classification http://neo.lcc.uma.es result in savings of both time and accuracy using SVM and cross-valida- resources in comparison to exhaustive tion. The contribution of our swarm intel- and traditional search techniques. ligence approach is notable, since it offers Please contact: Essentially, our model consists of a par- an improvement on existing state of the José García-Nieto ticle swarm optimization (PSO) algo- art algorithms in terms of computational University of Málaga/SpaRCIM, Spain rithm, in which a feature selection mech- effort and classification accuracy (see the Tel: +34 952133303 anism facilitates identification of small links section). Furthermore, the gene E-mail: [email protected] samples of informative genes among ensembles found by this technique can be thousands of genes. successfully interpreted in the light of Enrique Alba independent existing results from University of Málaga/SpaRCIM, Spain biology, ie they are biologically (not just Tel: +34 952132803 computationally) meaningful. E-mail: [email protected]

From Global Expression Patterns to Gene Co-regulation in Brain Pathologies

by Michal Dabrowski, Jakub Mieczkowski, Jakub Lenart and Bozena Kaminska

Understanding how multiple genes change expression in a highly ordered and specific manner in healthy and diseased brains may lead to new insights into brain dysfunction and identification of promising therapeutic targets, for example key signalling molecules and transcriptional regulators. To this end, we have been performing integrated analyses of high- throughput datasets, genomic sequence-derived data and functional annotations.

The computational work of our group expression, genomic sequence- positions of insulators. The data layer of (Laboratory of Transcription Regulation, derived and chromatin modification NRGP is illustrated in Figure 1A. The Nencki Institute) focuses on devel- data. oping tools that permit a systemic view Nencki Regulatory Genomics Portal of global transcriptional/epigenetic We have previously established a local will be an internet service providing changes and applying these tools to the Transcription regulatory regions and access to the its data layer and tools for data from experimental models studied motifs (TRAM) database; of putative analysis of user-supplied gene expres- in our laboratory, in particular: animal transcription regulatory regions (AVID- sion data, aiming at identification of models of stroke, brain tumours and in VISTA conserved non-coding regions) functions and/or cis-regulatory mecha- vitro experimental models of brain and TFBS motif instances (Genomatix) nisms common to many genes. The tumour-host interactions. In this way for all pairs of corresponding (ortholo- service will be available by web computational results can guide experi- gous) genes in human and rat. This browser for the users want to analyze mental effort, which in turn verifies the database is now being updated, the data with the help of provided algo- computational predictions. extended, and prepared for future auto- rithms, and by a database client for the mated actualization and public release, ones who need direct access to the Computational/bioinformatics activities as a part of the Nencki Regulatory stored information. DAS technology of our group focus on the following Genomics Portal (NRGP) project, will be used to make our data layer areas: started in early 2010 and scheduled for accessible from the Ensembl genome • building a comprehensive database of four years. The data layer of the NRGP browser. predicted or experimentally identified will incorporate additional vertebrate cis-regulatory regions, transcription species; and alternative sources of puta- Architecture of NRGP, based on MVC factor binding sites (TFBS) and tive regulatory regions; both computa- (Model-View-Control) model, will con- chromatin modifications in tionally predicted (BlastZNet, cisRED sist of the following layers: vertebrates (human, mouse, rat) promoters) and established experimen- • Data layer – responsible for all data • functional interpretation of global tally (Enhancer VISTA, Ensembl storage and actualization gene expression or other high- funcgen). These regulatory regions will corresponding to changes made in its throughput data in the context of be scored with each of the three major sources like Ensembl or motif databases of functional annotations TFBS motif libraries, both commercial libraries (Gene Ontology) or biological (Transfac, Genomatix) and public • Application layer – divided into the pathways (KEGG) (JASPAR). We will implement an inno- front-end responsible for • identification of transcriptionally co- vative way of mapping regulatory implementation of main algorithms regulated genes from global gene regions to the genes, taking into account associated with the refinement of

28 ERCIM NEWS 82 July 2010 A. Data layer C. Example pathway

Insulators Gene to funcgen VISTA Enhancers M T regulatory J a R A region cisRED Promoters t A S B N P compara BlastZNet a S -!xed A s F "ank R e A - insulator AVID/VISTA C core window D. Example motifs’ effect on gene expression M3 B. Function overview Vizualization /access to genomic features Possible inputs from the User 0.05 (local or remote) Subsystem DAS - expression dataset - list of genes - list of genomic locations M2 - ChIP-seq/chip result Reneries -0.05 0.05 0.10 0.15 of - expression - function -0.05 - regulation

-0.10

Figure 1: Functionality of the future Nencki Regulatory Genomics Portal.

gene expression, regulation and t-test and estimation of fold change bi-orthogonality of gene expression function; and the back-end – between analyzed groups. data can emerge as a result of providing services related to the biological processes occurring in actualization, like searching for the • Refinement of function – permits different cell types, with signal TFBS motifs identification of functional passing between them. In • Presentation layer – responsible for annotations statistically associated collaboration with Dr. Norbert Dojer GUI (Graphical User Interface), with a set of genes with a particular (Institute of Informatics, University adapted to run in web browser. pattern of expression, with standard of Warsaw) we used our method, in techniques (Fisher Exact Test, the framework of Bayesian Networks In order to achieve high level of quality Wilcoxon signed rank test). We are learning, to dissect transcriptional of supported analysis NRGP will inte- currently working on a novel analysis regulation in brain following stroke grate worldwide known and used com- of changed signal (ACS) algorithm and seizures (Figure 1D). Moreover, ponents like Mathematica, R statistical for identification of signalling we show that effects of motif programming language, and commer- pathways affected by changes in gene multiplicity on gene expression cial libraries. Moreover, the team devel- expression, which takes into account analyzed in subspace follow the oping the service consists of experts in established topology of signalling predictions of the linear response scientific areas like: biology of gene pathways, is threshold-free, and does model of gene regulation. expression, and informatics of software not require the assumption of engineering, and mathematics of independence between genes. Figure In our research we collaborate with the machine learning and data mining. 1C shows visualization of changes in Computational Biology Group led by expression of the genes in a particular Prof. Jerzy Tiuryn (Institute of NRGP will provide the following func- signalling pathway. Bioinformatics, Uniwersity of Warsaw) tionalities (see Figure 1B): and with the Linneus Center for • Refinement of regulation – aims at Bioinformatics (Uppsala, Sweden) • Refinement of expression – starting identification of TFBS motifs and/or directed by Prof. Jan Komorowski. from an expression dataset uploaded chromatin modifications associated by the user, this module permits with a particular pattern of gene Links: identification and visualization of expression, by finding motifs over- http://www.nencki.gov.pl/pl/struktura/b common patterns of gene expression, represented in putative regulatory iologia_komorki/lab_02.html with standard techniques of analysis regions (Fisher Exact Test), followed http://www.biomedcentral.com/1471- microarray gene expression data by linear or logistic regression to 2105/7/367 (analysis of variance, several study the effect of motif multiplicity http://www.biomedcentral.com/1471- clustering algorithms, SVD/PCA). In on the given pattern of expression. 2105/11/104 a recent paper, linked below, we show We previously reported analysis of that probe set filtering increased cis-regulation in subspaces of Please contact: correlations between microarray and conserved eigensystems (bi- Michal Dabrowski qRT-PCR results in two types of orthogonal components, also called The Nencki Institute, Poland studies: detection of differential SVD modes). In a recent paper, Tel: 4822 58 92 232 expression computed by p-values of linked below, we demonstrate how E-mail: [email protected]

ERCIM NEWS 82 July 2010 29 Special Theme: Computational Biology

Testing, Diagnosing, Repairing, and Predicting from Regulatory Networks and Datasets

by Torsten Schaub and Anne Siegel

We use expressive and highly efficient tools from the area of Knowledge Representation for dealing with contradictions occurring when confronting observations in large-scale (omic) datasets with information carried by regulatory networks.

The availability of high-throughput dictions between distant nodes. In con- accuracy of over 90% on the entire net- methods in molecular biology has led to trast to available probabilistic methods, work of E.Coli along with published a tremendous increase of measurable this model is particularly well-suited for experimental data. Notably, such data along with resulting knowledge dealing with qualitative knowledge (for genome-wide predictions can be com- repositories, gathered on the web usu- instance, reactions lacking kinetic puted in a few seconds. ally within biological networks. details) as well as incomplete and noisy However, both measurements and bio- data. Indeed, SCM is based on influence From the application perspective, the logical networks are prone to consider- (or signed interaction) graphs, a distinguishing novel features of our able incompleteness, heterogeneity, and common representation for a wide range approach are as follows: (i) it is fully mutual inconsistency, making it difficult of dynamical systems, lacking or automated, (ii) it is highly efficient, (iii) to draw biologically meaningful conclu- abstracted from detailed quantitative it deals with large-scale systems in a sions in an automated way. descriptions. global way, (iv) it detects existing inconsistencies between networks and Further probabilistic and heuristic By combining SCM with efficient datasets, (v) it diagnoses inconsisten- methods exploit disjunctive causal rules Boolean constraints solvers, we address cies by isolating their source, (vi) it to derive regulatory networks from high- the problem of detecting, explaining, offers a flexible concept of repair to throughput -static- experimental data. and repairing contradictions (called overcome inconsistencies in biological For instance, disjunctive causal rules on inconsistencies) in large-scale biological networks, and finally (vii) it enables influence graphs were originally intro- networks and datasets by introducing a prediction of unobserved variations duced in random dynamical frameworks declarative and highly efficient (even in the presence of inconsistency). to study global properties of large-scale approach based on Answer Set networks, using a probabilistic Programming [1]. Moreover, our The efficiency of our approach stems approach. These were demonstrated approach enables the prediction of from advanced Boolean Constraint mainly on the transcriptional network of unobserved variations and has shown an Technology, allowing us to deal with yeast. However, these methods are mostly data driven, and they lack the ability to perform corrections in a fast Figure 1: A graphical representation of and global way. In contrast, efficient identified inconsistencies in an model-driven approaches based on Escherichia coli network. model checkers - such as multi-valuated logical formalisms - are available to confront networks and measured data. These however, make use of time-series observations and can only be applied to small-scale parametered systems, since they need to consider the full dynamics of the system.

We have proposed an intermediate approach to perform diagnosis on large- scale static datasets. We use a Sign Consistency Model (SCM), imposing a collection of constraints on experi- mental measurements together with information on cellular regulations between network components.

The main advantage of SCM lies in its global approach to confronting networks and data, since the model allows the propagation of static information along the network and localization of contra-

30 ERCIM NEWS 82 July 2010 problems consisting of millions of vari- Physiology in Potsdam within the [4] http://www.cs.uni-potsdam.de/wv ables. Although the basic tools [1] are GoFORSYS Project [5] as well as [5] http://www.irisa.fr/symbiose implemented in C++ we have improved Institut Cochin, Paris [6]. The members [6] http://www.goforsys.org their accessibility by providing a of the group include Sylvain Blachon, [7] http://www.cochin.inserm.fr Python library as well as a correspon- Martin Gebser, Carito Guziolowski, ding Web service [2]. Jacques Nicolas, Max Ostrowski, Torsten Schaub, Anne Siegel, Sven Our project is a joint effort between the Thiele, and Philippe Veber. Please contact: Knowledge Representation and Torsten Schaub Reasoning group [3] at the University Links: Universität Potsdam, Germany of Potsdam and the SYMBIOSE Team [1] http://en.wikipedia.org/wiki/ E-mail: [email protected] [4] at IRISA and INRIA in Rennes. Our Answer_set_programming techniques have been developed in [2] http://potassco.sourceforge.net Anne Siegel strong collaboration with the Max- [3] http://www.cs.uni-potsdam.de/ CNRS/IRISA, Rennes, France Planck-Institute for Molecular Plant bioasp/sign_consistency.html E-mail: [email protected]

MCMC Network: Graphical Interface for Bayesian Analysis of Metabolic Networks by Eszter Friedman, István Miklós and Jotun Hein

The Data Mining and Web Search Group at the SZTAKI in collaboration with the Genome Analysis and Bioinformatics Group at the Department of Statistics, University of Oxford, developed a Bayesian Markov chain Monte Carlo tool for analysing the evolution of metabolic networks.

“Nothing in biology makes sense except those that are common in all networks) unfolded. It is therefore crucial to infer in the light of evolution”. The famous and which are non-essential (ie those that the evolution of metabolic networks quote by Theodosius Dobzhansky are missing in at least one of the net- using statistical methods that properly (1900-1975) has been the central thesis works). Sometimes there is more than handle the uncertainty that inevitably of comparative bioinformatics. In this one possible metabolic network that can occurs during analysis. Bayesian field, the biological function, structure synthesise or degrade a specific chem- methods collect the a priori knowledge or rules are inferred by comparing enti- ical. These alternative solutions can be into an ensemble of distributions of ties (DNA sequences, protein sequences, transformed into each other, and the random variables, set up a random metabolic networks, etc.) from different ensemble of all possible reactions form a model describing the changes, and cal- species. The observed differences complicated network (see Figure 1). culate the posterior probabilities of between the entities can be used for pre- what could happen. The relationship dicting the underlying function, struc- The central question is: what are the between the prior and posterior proba- ture or rule that would be too expensive possible evolutionary pathways through bilities is described by the Bayes the- and laborious to infer directly in lab. which one metabolic network might orem as shown in Figure 2. Since the These comparative methods have been evolve into another? This question is integral in the denominator is typically very successful in silico approaches, for especially important to understand in hard to calculate, and the Bayes the- example, in protein structure prediction. the fight against drug-resistant bacteria. orem is often written in the form shown The idea can be used for inferring meta- Drugs that are designed to protect us in Figure 3. bolic networks, too. against illness-causing bacteria block an enzyme that catalyzes one of the The Bayesian theorem in this form can Metabolic networks are under continuous reactions of the metabolic network of be used in Monte Carlo methods to evolution. Most organisms share a the bacteria. The bacteria, however, can sample from the posterior distribution. common set of reactions as a part of their avoid the effects of the drug by devel- The Markov chain Monte Carlo metabolic networks that relate to essen- oping an alternative metabolic pathway. (MCMC) method sets up a Markov tial processes. A large proportion of reac- If we understand how the alternative chain that converges to the desired dis- tions present in different organisms, how- pathways evolve we may be able to tribution. After convergence, samples ever, are specific to the needs of indi- design a combination of drugs from from the Markov chain follow the pre- vidual organisms or tissues. The regions which the bacteria cannot escape scribed distribution. of metabolic networks corresponding to through the development of alternative these non-essential reactions are under pathways. MCMC Network implements the continuous evolution. By comparing above-described Bayesian MCMC metabolic networks from different Analysis of past events is always cou- framework for inferring metabolic net- species, we can find out which parts of pled with some uncertainty about the works. We model the evolution of net- the metabolic network are essential (ie nature and order of events that works with a time-continuous Markov

ERCIM NEWS 82 July 2010 31 Special Theme: Computational Biology

Figure 1: The pentose phosphate pathway of Yersinia pseudotuberculosis strain YPIII. From the KEGG database, http://www.kegg.com

model. In this model, chemical reac- gram. On the graphical interface, the In the last decade, Bayesian methods tions can be deleted from or inserted users can monitor the progress of the have been extremely successful for into the reaction network. Constraints Markov chain. The results of the analysing sequence evolution. The evo- such that the reaction network must be analysis can be visualized on the graph- lution of metabolic networks can now functioning after inserting or deleting a ical interface, and can also be saved as a also be analysed within the Bayesian reaction are added to the model. The text file for further analysis. framework using our software. There is ??parameter set contains the insertion a continuous debate about the major and deletion rates of the chemical reac- governing rules in the evolution of net- tions. We know a priori that the impor- works. For example, it is still unclear tant parts of the reaction networks are which rules result in networks being usually compact parts of the network. scale-free. Our software is a useful tool The insertion and deletion rates on these Figure 2: The relationship between the prior for answering such questions. important parts are smaller. We can and posterior probabilities. D is the observed express this knowledge with a data (the metabolic networks), Y is the hidden Markovian Random Field (MRF) prior data (the metabolic networks in the ancestral distribution. In MRF, the prior proba- species that we cannot observe) and Θ is the Links: bility for a small insertion and deletion set of parameters describing the evolution of MCMC Network: rate of a reaction is higher if the neigh- metabolic networks (the rates with which the http://www.ilab.sztaki.hu/~feszter/ bour reactions also have small insertion reactions of the network are deleted or mcmcNetwork/ and deletion rates. Our approach is the inserted). P(Y, Θ | D) is the probability that first implementation of a network evo- the observed networks evolved from the set of KEGG database:http://www.kegg.com/ lution in which both the variation of networks Y in a mode described with insertion and deletion rates and correla- parameters Θ. P(Y, D | Θ ) (called likelihood) Data Mining and Web Search Group: tion between the rates of neighbour is the probability that the random model with http://datamining.sztaki.hu chemical reactions are considered. parameters Θ generates the observed and unobserved data, and P(Θ) is the prior Genome Analysis & Bioinformatics MCMC Network is not only the first distribution of the parameters. Group: http://www.stats.ox.ac.uk/ implementation of a sophisticated research/genome model for chemical network evolution, it also has a user-friendly graphical Please contact: interface. XML files downloaded from Figure 3: Bayes theorem in the form as used Eszter Friedman the Kyoto Encyclopedia of Genes and in Bayesian statistics. Here ∝ stands for SZTAKI, Hungary Genomes (http://www.kegg.com/) can ‘proportional to’. The theorem is also used in Tel: +36 1 279 6172 be directly used as input files of the pro- a narrative form posterior ∝ likelihood x prior. E-mail: [email protected]

32 ERCIM NEWS 82 July 2010 Drug Dissolution Modelling by Niall M. McMahon, Martin Crane and Heather J. Ruskin

Recent and continuing work in Dublin City University aims to demonstrate the utility of mathematical and numerical methods for pharmaceutics.

Controlled drug delivery is important for Modelling (Sci-Sym) at Dublin City many illnesses. It is often important to University has sought to improve on the ensure that a fixed concentration of drug PSUDO models of dissolution from is available throughout administration of multi-layer tablets in pharmaceutical test the drug. One type of drug delivery devices; this work applies to the initial system is the typical tablet that delivers and final phases of a dissolving tablet's drug as its surface dissolves. Dissolution life. Results from numerical finite differ- testing of drug delivery systems is criti- ence models were used to assess how cally important in pharmaceutics. There changing the surface boundary condi- are three reasons for carrying out disso- tions, in particular, affects the estimate lution tests: (i) to ensure manufacturing for drug release in the initial stages of consistency, (ii) to understand factors dissolution. Experimental data from our that affect how the drug enters the blood colleagues in the School of Pharmacy stream and (iii) to model drug perform- and Pharmaceutical Sciences in Trinity ance in the body. One industry aim is to Figure 1: 1-, 3- and 5-layer cylindrical College Dublin was used to benchmark reduce uncertainty during the develop- tablets. Dark coloured layers consist of drug. our results. How one particular surface ment of a new drug. Mathematical and Light coloured layers consist of inert boundary condition is implemented can computational simulation will help to materials. lead to relative errors in the calculated achieve this. drug dissolution rate of about 10% and consequently, is important. However, our Work in Trinity College Dublin, and mental work, in Trinity College Dublin, work suggests that much of the physics is elsewhere, identified three examples of (ii) efforts to understand the hydrody- captured by the current models. dissolution physics that are not captured namics of typical test devices, again by the well-known Higuchi model. mostly in Trinity College Dublin, and The second part of our work looked at These are: (1) pH changes close to the (iii) improvements to the analytical and dissolution from small spherical particles surface of the dissolving tablet, (2) the numerical models of mass transfer; this of drug moving about in the test appa- effect of particle sizes and (3) the com- work was led by our team in Dublin City ratus; this relates to one possible end- plex hydrodynamics found in typical test University. In addition to the authors of state of tablet dissolution, ie fragmenta- devices. The effect of particle size is this article, the team included Dr. Ana tion. This work involved inserting parti- interesting; large particles of fast-dis- Barat who investigated the use of proba- cles into a flow field, again provided by solving non-drug additives increase the bilistic, Monte Carlo based simulation. our colleagues in Trinity, and estimating drug dissolution rate. Understanding the dissolution rate, due mostly to con- how dissolution properties affect drug Current work vection. One finding was that for parti- delivery rates during dissolution seemed Recent work by the Centre for Scientific cles of diameter 100 microns and a good place for theoreticians to start. Computing & Complex Systems smaller, radial diffusion dominates as a

The 1998 Parallel Simulation of Drug Dissolution (PSUDO) project in the Centre for High-Performance Computing at Trinity College Dublin was formed to make improved dissolu- tion models. The team modelled simple cylindrical compacts dissolving in a typ- ical test apparatus. The tablets contained equally-spaced, alternating layers of drug and inert material. This configura- tion was used because it was a simple starting point and techniques used to model this system might be applied to more complex . Finite element code also existed to model dissolution from a layered surface. Figure 2: Drug concentration close to the surface of a 1-layer tablet calculated using finite differences. The surface of the tablet is at the bottom of the image and the solvent is flowing The useful results from the PSUDO from left to right. The concentration scale runs from 0 (no drug present) to 5 (saturated project prompted: (i) continued experi- concentration of drug).

ERCIM NEWS 82 July 2010 33 Special Theme: Computational Biology

mass transfer mechanism. This may help These two studies, ie (1) drug dissolu- Links: simplify future models. tion from a multi-layered tablet in a Centre for Scientific Computing & dissolution test apparatus and (2) the Complex Systems Modelling, Dublin Most of our work was implemented mass transfer from small particles City University: http://www.sci- using Python scripts; we used Crank- moving in the test apparatus, each sym.dcu.ie/ Nicolson finite difference schemes, form a part of a future complete second order accurate in space, first framework for simulating drug disso- School of Pharmacy and order in the time-like sense. The lution in a test apparatus. Work is Pharmaceutical Sciences, Trinity velocity field within the test apparatus being continued in DCU by Ms. College Dublin: was produced by D'Arcy et al using the Marija Bezbradica, a PhD candidate http://www.pharmacy.tcd.ie/ commercial CFD code Fluent. LPA and funded by an IRCSET Enterprise Verlet integration schemes were used to Partnership Scholarship. Institute for Numerical Computation calculate the motion of a particle and Analysis (INCA): through the USP velocity field. Mass Acknowledgements http://www.incaireland.org/ transfer rates were calculated using The National Institute for Cellular empirical correlations. Biotechnology (NICB) supported much of this research. In addition, the Please contact: Future work authors acknowledge the additional Niall McMahon, In the near future, computing capability, support of Dublin City University and Dublin City University and INCA - industry needs and a growing accept- the Institute for Numerical Institute for Numerical Computation ance of the utility of cross-disciplinary Computation and Analysis (INCA). and Analysis, Ireland. interaction by professions with strong Professor Lawrence Crane of INCA E-mail: [email protected] traditions of experiment, eg, pharma- was deeply involved in this work as an cists, will see the rapid expansion in the advisor. Dr. Anne-Marie Healy and Dr. Martin Crane, use of simulation in pharmaceutical Deirdre D'Arcy et al. of the School of Dublin City University, Ireland. research and development. At present Pharmacy and Pharmaceutical E-mail: [email protected] half of the top 40 pharmas use computer Sciences at Trinity College, Dublin, simulation. By 2015 it is expected that worked with us and allowed us to Heather Ruskin, computer simulation will be a core access and use their computed flow- Dublin City University pharmaceutical research tool. fields. E-mail: [email protected]

Computational Modelling and the Pro-Drug Approach to Treating Antibiotic-Resistant Bacteria

by James T. Murphy, Ray Walshe and Marc Devocelle

As antibiotic resistance continues to be a major problem in the health-care sector, research has begun to focus on different methods of treating resistant bacterial infections. One such method is called the β-lactamase-dependent pro-drug delivery system. This involves delivering inactive precursor drug molecules that are activated by the same system that normally confers resistance on bacterial cells. In theory this approach seems very promising as it would exploit one of the bacteria’s main resistance strategies. However, the complex system dynamics involved are difficult to understand by straightforward experimental observations. Therefore, new computational models and tools are needed to analyse the complex system dynamics involved in this approach to treating antibiotic resistant bacteria such as MRSA. We give an overview here of our work in this area.

A computational model has been devel- than the “sum of its parts”. This concentrations (MIC, a simple labora- oped previously, called Micro-Gen, approach allows the unique dynamics tory measure of antibiotic efficacy) for which simulates the growth of bacterial within a bacterial colony to be simu- various common antibiotics, including cells in culture and their interactions lated by taking into account subtle dif- penicillin G and cephalothin, against with anti-microbial drug molecules. The ferences within a population or across methicillin-resistant Staphylococcus program uses an agent-based modelling an environment. aureus (MRSA) bacteria. However, an approach whereby the individual bacte- important strength of the model is that it rial cells are represented by unique soft- Micro-Gen has been used in previous can be used to examine new approaches ware agents that are capable of flexible, studies to examine the resistance mech- for treating antibiotic-resistant bacteria autonomous action within a simulated anisms involved in the response of bac- and give insight into potential novel environment. The agent-based approach terial populations to antibiotic treat- drug treatment strategies. This can aid means that the system as a whole can ment. Studies showed that it could accu- in rational by allowing a exhibit a complex behaviour that is more rately predict the minimum inhibitory greater understanding of the underlying

34 ERCIM NEWS 82 July 2010 overall “emergent” dynamics of a popu- lation. In this case, the individual bacte- rial cells are represented by software agents that store physical traits such as their energy state or amount of antibi- otic damage. The agents also have behavioural rules associated with them that dictate their actions during the sim- ulation. The environment of the simula- tions is represented by a discrete, two- dimensional grid containing diffusible elements such as nutrients, β-lactamase enzymes and the anti-microbial (pro-) drug compounds.

The key interaction of the model that Figure 1: Simple schematic representation of activation of pro-drug molecule as a result of determines the response to pro-drug cleavage by a β-lactamase enzyme (blue) released by the bacterial cell. Activation results in the treatment is the reaction between the β- release of the active anti-microbial component (red) which enters the cell and destroys it. lactamase enzyme and the pro-drug (see Figure 1). This reaction is the activation step that triggers the release of an active drug compound. The success of the pro- mechanistic principals determining of β-lactamase. Therefore, by designing drug approach requires the rapid and response to treatment. pro-drugs that specifically target these specific release of the active drug com- bacteria it would introduce an evolu- pound in the vicinity of the bacterial Recent research has focussed on tionary selective pressure contrary to cells. The model contains a quantitative adapting the existing model to explore a that of existing β-lactam antibiotics. representation of this reaction based on novel approach to treating antibiotic- Michaelis-Menten kinetic theory. resistant bacteria called the β-lacta- The dynamics between the negative mase-dependent pro-drug delivery evolutionary selective pressure from It is early days in the research of pro- system. This approach involves admin- pro-drugs and positive selective pres- drug compounds and only limited labo- istering a substrate-like pro-drug mole- sure from β-lactam antibiotics would be ratory data is available. However, the cule that contains a β-lactam ring struc- an important factor to consider when power of the computational approach ture (a structure contained in many assessing the possible evolution of drug for elucidating the mechanisms of common antibiotics including penicillin resistance in bacteria in response to action of novel drug compounds can be and its derivatives). The presence of a these two different therapeutic strate- an important asset to have. In conjunc- β-lactam ring structure means that the gies. However, the complex interplay of tion with laboratory testing, important pro-drug effectively mimics a normal biophysical, pharmacokinetic, pharma- insights can be made into the complex antibiotic, such as penicillin, and is sus- cological and epidemiological factors interplay of the different components in ceptible to cleavage by β-lactamase that would contribute to this is difficult the pro-drug delivery system using an enzymes released from antibiotic- to predict. Therefore it is important to agent-based modelling approach. Initial resistant bacterial cells. However, this develop novel modelling approaches testing with our own model has cleavage does not result in the destruc- that can be used to analyse the complex involved carrying out simulations based tion of the drug (as is the case with the system dynamics involved and to on data from real-life pro-drug candi- antibiotics) but instead triggers the develop theories that can then be tested dates in order to compare the model selective release of a molecule with in the lab. That is the purpose of our output to experimental results (in anti-microbial properties that kills or research: to develop new models that press). This has demonstrated its useful- inhibits the growth of the bacterial cells. may be suitable to approach this ness in providing an integrated under- problem. standing of the unique dynamics of the This is considered a promising thera- pro-drug delivery system in order to peutic approach because many bacterial A detailed description of the Micro-Gen assess its effectiveness as a viable alter- species have evolved to produce β-lac- model is beyond the scope of this native treatment strategy for microbial tamase enzymes in response to pro- article, but for further information the infectious diseases. longed clinical exposure to β-lactam reader is directed towards the publica- antibiotics. The bacterial cells produce tion list on our website (see link) docu- Links: β-lactamase enzymes as a defence menting the overall program structure http://www.computing.dcu.ie/~jamurphy/ mechanism because the enzymes cleave along with an analysis of the mecha- http://sci-sym.computing.dcu.ie/ the β-lactam ring structure present in nistic basis for its output. Micro-Gen is the antibiotic molecules, rendering an agent-based model, which means Please contact: them inactive. In the USA, it is esti- that it represents the system from the James T. Murphy mated that greater than 95% of all S. bottom-up, looking at the individual Dublin City University, Ireland. aureus bacterial isolates possess resist- components of a population and how Tel: +353 1 700 6741 ance to penicillin, due to the expression their interactions contribute to the E-mail: [email protected]

ERCIM NEWS 82 July 2010 35 Special Theme: Computational Biology

Computational Systems Biology in BIOCHAM

by François Fages, Grégory Batt, Elisabetta De Maria, Dragana Jovanovska, Aurélien Rizk and Sylvain Soliman

The application of programming concepts and tools to the analysis of living processes at the cellular level is a grand challenge, but one that is worth pursuing, as it offers enormous potential to help us understand the complexity of biological systems. To this end, the Biochemical Abstract Machine (Biocham) combines biological principles with formal methods inspired by programming, to act as a modelling platform for Systems Biology. It is being developed by the Contraintes research team at INRIA.

Biologists use diagrams to represent available in this format in model reposi- • the stochastic semantics (CTMC on complex systems of interaction between tories, such as www.biomodels.net, and numbers of molecules) molecular species. These graphical an increasing collection of ODE simula- • the Boolean semantics notations encompass two types of infor- tion or analysis software platforms are (asynchronous Boolean state mation: biochemical reactions (eg, pro- now compatible with SBML. transitions on the presence/absence tein complexation, modification, of molecules). binding to a gene, etc.) and regulations Since 2002, we have been investigating (of a reaction or transcription). Based the transposition of programming con- Of the three levels, the Boolean seman- on these structures, mathematical cepts and tools to the analysis of living tics is the most abstract. It can be used models can be developed by equipping processes at the cellular level. Our to analyse large interaction networks such molecular interaction networks approach relies on a logical paradigm without known kinetics. These formal with kinetic expressions leading to for systems biology which is based on semantics have been related within the quantitative models. There exist two the following definitions: framework of abstract interpretation, general categories of quantitative • biological model = (quantitative) showing for instance, that the Boolean model: ordinary differential equations state transition system semantics is an abstraction of the sto- (ODEs), which offer a continuous inter- • biological properties = temporal chastic semantics, ie that all possible pretation of the kinetics; and contin- logic formulae stochastic behaviours can be checked in uous-time Markov chains (CTMCs) • biological validation = model-checking the Boolean semantics, and that if a which provide a stochastic interpreta- • model inference = constraint solving. Boolean behaviour is not possible, it tion of the kinetics. cannot be achieved in the quantitative Our modelling software platform semantics for any kinetics. The Systems Biology Markup Language Biocham (http://contraintes.inria.fr/ (SBML) uses a syntax of reaction rules biocham) is based on this paradigm. An The use of model-checking techniques, with kinetic expressions to define such SBML model can be interpreted in developed over the last three decades reaction models in a precise way. Biocham at three abstraction levels: for the analysis of circuits and pro- Nowadays, an increasing collection of • the continuous semantics (ODE on grams, is the most original feature of models of various biological processes is molecular concentrations) Biocham. The temporal logics used to

Figure 1: Screenshot of BIOCHAM.

36 ERCIM NEWS 82 July 2010 For some time, an important limitation of this approach was due to the logical nature of temporal logic specifications and their Boolean interpretation by true or false. By generalizing model- checking techniques to temporal logic constraint solving, a continuous degree of satisfaction has been defined for tem- poral logic formulae, opening up the field of model-checking to optimization in high dimension.

Biocham is freely distributed under the GPL license. In our group, it is currently used in collaboration with biologists at INRA for developing models of mam- Figure 2: Continuous satisfaction degree of a temporal logic formula for oscillations with a malian cell signalling, at INSERM and constraint of amplitude in a landscape obtained by varying two parameters. in the European consortium EraSysBio C5Sys for developing models of cell cycle for optimizing cancer chronother- formalize the properties of the behav- reaction rules) by symbolic model- apies, and since 2007 has been used in iour of the system are respectively the checking, formalize phenotypes in tem- MIT's iGem competition, for designing Computation Tree Logic, CTL, for the poral logic in a much more flexible way synthetic biology circuits. Boolean semantics, and a quantifier- than by curve fitting, search parameter free Linear Time Logic with linear con- values from temporal specifications, Link: straints over the reals, LTL(R), for the measure the robustness of a system with http://contraintes.inria.fr/Biocham quantitative semantics. respect to temporal properties, and develop in this way, quantitative models Please contact: Biocham has been used to query large of cell signalling and cell cycle for François Fages Boolean models of the cell cycle (eg cancer therapies. INRIA Paris-Rocquencourt, France Kohn's map of 500 species and 800 E-mail: [email protected]

Prediction of the Evolution of Thyroidal Lung Nodules Using a Mathematical Model by Thierry Colin, Angelo Iollo, Damiano Lombardi and Olivier Saut

Refractory thyroid carcinomas are a therapeutic challenge owing to some being fast-evolving - and consequently being good candidates for trials with molecular targeted therapies - whilst others evolve slowly. This variation makes it difficult to decide when to treat. In collaboration with Jean Palussière and Françoise Bonichon at the Institut Bergonié (regional centre for the fight against cancer), we have developed a diagnostic tool to help physicians predict the evolution of thyroidal lung nodules.

The evolution of thyroidal lung nodules that cannot always be recovered from tion using temporal series of MRI or may be difficult to evaluate. experimental data. The model proposed scans. The approach uses optimization Furthermore, when unwell patients are here is tuned for each patient thanks to techniques and Proper Orthogonal concerned, physicians try to minimize two medical images following the evo- Decomposition (POD) to estimate the the use of invasive techniques, lution of a nodule. From this analysis, it parameters of the chosen mathematical restricting treatment (by radiofrequency is possible to obtain an estimate of the model (adapted to the type of cancer ablation) to nodules that may become evolution of a targeted nodule using studied) that best fit with the real evolu- malignant. Thus, an accurate prognosis only non-invasive techniques. tion of the tumour shown on the images. of each nodule is critical. We propose a numerical method of predicting the Our model describes, not only the Our model is a simplified Darcy-type actual tumour growth for a specific volume of the tumour, but also its local- model describing the evolution of var- patient. ization and shape. It takes into account ious cellular densities (proliferative and nutrient concentration, cell-cycle regu- quiescent cancer cells, healthy tissue…) Classically, accurate mathematical lation and evolution of populations of as well as nutrients distribution or models describing tumoral growth cells, as well as mechanical effects. Our mechanical variables using Partial involve a large number of parameters prediction relies on parameters estima- Differential Equations (PDE). This

ERCIM NEWS 82 July 2010 37 Special Theme: Computational Biology

Figure 1: Evolution of the an untreated thyroidal nodule in the lung.

Figure 2: results of a prediction of the evolution of the nudule based on the mathematical model.

Figure 3: In addition to computing the volume of the tumour, the model can be used to predict the localization of the tumour (plotted in red).

parametric model is sufficiently accu- these parameters are determined, a pre- tion to computing the volume of the rate to take into account the main phys- diction is simply obtained by running tumour,our model enables usto predict ical features of tumor growth but simple the numerical code. its localization, as shown in the Figure 3 enough to have its parameters recov- where the computed tumour is plotted ered. We tested our technique on several test in red. cases, one of which is presented in Essentially, the technique may be sum- Figure 1. For a patient with an untreated For oncologists the development of marized as follows. Initially, the nodule thyroidal nodule in the lung, four com- such tools is of interest in therapy plan- under investigation is marked by the puterised tomography (CT) scans were ning (and in the evaluation of an anti- physicians on successive CT scans. available (we show three of them illus- tumoral treatment). For example a From these images we recover the trating the evolution of the nodule slowly evolving tumour prediction geometry of the lung and the shape of below). could reinforce the decision to wait the nodule at different times. From the without specific treatment. In the oppo- initial shape of the nodule, we run We used the first two scans to perform site case the simulation can support the numerous numerical simulations of our the data assimilation and recover the decision to start a radiofrequency mathematical model using a large set of parameters of the mathematical model thermal ablation (for example) or a parameter values. A basis using a POD adapted to the patient and to this nodule. molecular targeted therapy. approach is extracted from this large Once these parameters are determined, collection of numerical results. This the model is used in order to obtain pre- Links: time-consuming process can be effi- dictions on the evolution of this nodule. INRIA Resarch Team MC2 ciently performed on a High The results of this prediction are plotted http://www.math.u- Performance Parallel Architecture since in Figure 2. The measured volume of bordeaux1.fr/MAB/mc2/ the direct simulations can be run con- the nodule from the CT scan is plotted Institut Bergonié currently and the POD extraction uses a using circles; the continuous line repre- http://www.bergonie.org/. parallel algorithm to be as fast as pos- sents the volume computed using our sible. The last step of the procedure tuned mathematical model. Please contact: consists in solving an inverse problem Olivier Saut, based on a Newton method to recover Only the first two scans were used for INRIA, France the parameters that best fit the available data assimilation; the others are shown Tel +33 5 40002115 medical images of the nodule. Once for the purpose of comparison. In addi- E-mail: [email protected]

38 ERCIM NEWS 82 July 2010 Dynamical and Electronic Simulation of Genetic Networks: Modelling and Synchronization by Alexandre Wagemakers and Miguel A. F. Sanjuán

Cells can be considered to be dynamical systems that possess a high level of complexity due to the quantity and range of interactions that occur between their components, particularly between proteins and genes. It is important to acquire an understanding of these interactions, since they are responsible for regulating fundamental cellular processes.

The climax of the genome project, the tion interaction which is responsible for grates several scientific fields such as most notable success of which, has been the functioning of the cell. nonlinear dynamics, physics of com- the complete sequencing of the human plex systems and molecular bioengi- genome, was the identification of all the Recently, the design and the construc- neering. This is a newly emerging field genes that comprise the genetic material tion of artificial networks has been pro- with a strong interdisciplinary compo- of an organism. This achievement led to posed as a means of studying biological nent in which the future advances seem a new phase of the project: the postge- processes, such as oscillations of the very promising. nomic era. Research is now focusing on metabolism. These networks, simpler understanding the organization of, and than the natural ones, can contribute to The paradigmatic examples of synthetic interactions between proteins, the the understanding of the molecular genetic networks are the genetic toggle product of gene expression. Each pro- basis of a specific function. Simple switch and the represilator. The genetic tein is in charge of a function which can mathematical models can be con- toggle switch is the combination of two induce changes in other molecules in the structed in order to perform qualitative mutually repressing genes forming a cell, such as enzymes or even hormones. and numerical analyses, and synthetic bistable system whose state can be These molecules can be viewed as the genetic networks can even be synthe- changed with an external signal. One nodes of a network where the interac- sized in a laboratory. These works, can say that this genetic switch has tions are the links. Thus we can view the among others, gave birth to the so- memory, since it remains in its current system as a complex network of regula- called synthetic biology which inte- state until an external inducer acts again. The second paradigmatic system is the repressilator, which is in fact, a genetic oscillator. In this system three repressor genes are placed in a ring, with each repressor inhibiting the pro- duction of the following protein with a given delay. This configuration leads to oscillations in the expression of the three proteins.

Our work in this field, in the Nonlinear Dynamics, Chaos and Complex Systems Group at the Universidad Rey Juan Carlos (URJC), consists mainly of the application of nonlinear dynamics techniques to the modelling and simula- tion of genetic networks. The evolution of the protein concentration of a partic- ular gene can be represented mathemat- ically with a set of ordinary differential equations. Once these equations are defined, we can apply methods from nonlinear dynamics such as phase space analysis, bifurcation diagrams and sta- bility analysis to understand and predict the behavior of any synthetic genetic network. Furthermore these tools are useful for the design and study of labo- ratory experiments.

Figure 1: Genetic networks in living cells can firstbe identified with molecular genetics We also have proposed an alternative techniques. Once the network is identified a mathematical model is developed and analysed way to design and analyse the genetic using nonlinear dynamics methods and electronic modelling. networks with analog electronic cir-

ERCIM NEWS 82 July 2010 39 Special Theme: Computational Biology

cuits. The dynamics of a regulatory populations of genetic networks, such Link: genetic network can be simulated with as the repressilator and the toggle- http://www.fisica.escet.es very simple nonlinear circuits based on switch. Synchronisation of a popula- MOSFET transistors. Our circuits tion of such units has been thoroughly Please contact: allow a one-to-one correspondence studied, with the aim of comparing the Alexandre Wagemakers between the structure of the genetic role of global coupling with that of Universidad Rey Juan Carlos, Madrid, and electronic networks, and their global forcing on the population. We Spain analog character extends this corre- have also analysed a method for the Tel: +34 91 4888242 spondence to the full dynamical prediction of the synchronisation of a E-mail: behavior. An obvious benefit of this network of electronic repressilators, [email protected] approach is that the electronic circuits based on the Kuramoto model. Our are easier to implement experimentally research indicates that nonlinear cir- than genetic circuits. We have applied cuits of this type can be helpful in the this technique successfully in studies design and understanding of synthetic of the dynamics and synchronisation of genetic networks.

Formal Synthetic Immunology

by Marco Aldinucci, Andrea Bracciali and Pietro Lio'

The human immune system fights pathogens using an articulated set of strategies whose function is to maintain in health the organism. A large effort to formally model such a complex system using a computational approach is currently underway, with the goal of developing a discipline for engineering "synthetic" immune responses. This requires the integration of a range of analysis techniques developed for formally reasoning about the behaviour of complex dynamical systems. Furthermore, a novel class of software tools has to be developed, capable of efficiently analysing these systems on widely accessible computing platforms, such as commodity multi-core .

Computational approaches to future. The first trend draws from formal tual labs. A good opportunity to achieve immunology represent an important area methods, ie a set of description tech- this is provided by the recent design of systems biology where both multi- niques that allows qualitative and quan- shift towards multi-core architectures scale and time and spatial dynamics play titative aspects of system behaviour to that make high computational power important roles. In order to explore how be described and formally analysed. diffusely available. To exploit this different modelling and computational Generally, models consist of populations power, however, software tools need to techniques can be better integrated to of agents/entities, or aggregate abstrac- be redesigned to match the new archi- support in silico experiments, a collabo- tions of them, eg viruses or lympho- tectures, which may suffer from ineffi- ration amongst researchers of Italian and cytes. These play a part in the economy ciencies of the shared memory sub- British institutions has been established. of the whole system by means of the system due to poor memory access hot- In the long term we anticipate that this possible behaviour they exhibit. The spots and caching behaviour. Within the will lead to the engineering of synthetic overall behaviour of the system emerges FastFlow programming framework [1], immune responses. from the interaction between agents and which provides a cache-aware, lock- can be observed by means of simula- free approach to multi-threading, we In silico experiments involve the repro- tions that account for either probabilistic have developed StochKit-FF (see appli- duction of the dynamics occurring or averaged evolutions from given ini- cation page at [1]), an efficient parallel between the immune system and tial conditions. Also, the system can be version of StochKit, a reference sto- pathogens. These models can be used to observed in its transient or equilibrium chastic simulation tool-kit. simulate the mechanisms and the dynamics, and its properties, when pre- emerging behaviour of the system, to cisely expressed in a formal language, In our reference scenario we use aspects test new hypotheses and to predict their can be verified (model checked) against of the HIV dynamics within a patient, as effects, hence, they will need to be able the system behaviour. In particular the a model infectious disease. Many dif- to measure, analyse and formally reason formal analysis can be tightly coupled ferent viral strains contribute to infec- about the system behaviour. By means with stochastic simulations in order to tion progression; they use different cell of such a virtual lab one can try to design improve the information obtained from receptors (CCR5 and CXCR4) and con- novel "synthetic" responses and drug the simulation results. These sequently several anti-HIV therapies treatments that may then be imple- approaches, particularly the stochastic target these receptors. Progression of mented in the organic world. ones under certain hypotheses, are the disease (AIDS) is dependent on extremely computationally expensive. interactions between viruses and cells, We are exploring the feasibility of com- the high mutation rate of viruses, the bining two research trends, which will The second trend aims to enhance the immune response of individuals and the certainly be further developed in the near precision and effectiveness of these vir- interaction between drugs and infection

40 ERCIM NEWS 82 July 2010 Figure 1: A single stochastic simulation over the [1-1000] Figure 2: The same scenario as that represented in Figure 1 day interval. Levels of cells from the immune system during with the addition of a treatment period starting at about day HIV infection are shown on the y-axis. Importantly, the 200. The effect of the treatment can be “visually” appreciated in decay of T below about 200 represents the passage to AIDS. terms of an higher amount of T (and Z5) during the treatment This emerges at about day 500. period, with respect to the same time interval in Figure 1. dynamics. These phenomena are inher- StockKit-FF with respect original dence, it is possible to determine that ently stochastic, subject both to intrinsic StockKit for the HIV model is reported such a probability for the cases in and extrinsic noises. Some simulation in Figure 3. Figure 1 and Figure 2 is 0.377 vs. 0.855 results are shown in Figure 1 and Figure (results obtained by the PRISM model 2. Furthermore, we are quantitatively checker [2], see [3] for details). assessing different scenarios regarding StockKit-FF allows us to run multiple the immune system response. Currently, we are investigating more concurrent stochastic simulations effi- Probabilistic model checking, for structured ways of collecting informa- ciently. Results based on a large number instance, can provide a greater under- tion from multiple runs and the possi- of simulations are much more informa- standing of infection models, by bility of exploiting such multiple runs tive than a few probabilistic outcomes. shifting the attention from an informa- as approximated model knowledge for Since the cost of saving comprehensive tive, but empirical, analysis of the model checking purposes. Efficient information about each simulation can graphs produced by simulations simulations will make possible an effec- become prohibitive, StochKit-FF towards more precise quantitative inter- tive investigation of relevant character- allows the on-line parallel reduction of pretations. For instance, this technique istics of the virus attack and the immune solution trajectories from different sim- can allow us to determine the proba- defence, while the formal analysis will ulation runs by way of user-defined sta- bility of immune system failure with or provide the key tool to identify best tistical estimators, such as average and without a support therapy over a given anti-viral drug therapy. variance. The relative speed of time interval. Beyond the visual evi- This project is a British-Italian collabora- tion (Universities of Cambridge, Torino and Stirling, ISTI-CNR), partially sup- ported by HPC-Europe, EMBO and the CNR project RSTL-XXL.

Links: [1] http://mc-fastflow.sourceforge.net [2] http://www.prismmodelchecker.org/ [3] http://www.biomedcentral.com/1471- 2105/11/S1/S67

Please contact: Andrea Bracciali ISTI-CNR, Italy and Computing Science and Mathematics, University of Stirling, UK E-mail: [email protected], [email protected]

Figure 3: Speedup of StockKit-FF against original StochKit for 32, 64, and 128 runs of the HIV Pietro Lio' stochastic simulation. StockKit-FF runs are concurrently executed; their outputs are reduced by Computer Laboratory way of the average and variance functions. StockKit-FF exhibits a super-linear speedup, ie it is University of Cambridge always more than n-fold faster than StockKit when running on a n-core platform. E-mail: [email protected]

ERCIM NEWS 82 July 2010 41 Special Theme: Computational Biology

An Ontological Quantum Mechanics Model of Influenza

by Wah-Sui Almberg and Magnus Boman

Seasonal flu has prevailed in the temperate zones for 400 years without adequate scientific explanation. We suggest a novel approach to modelling influenza building on the ideas and theories of David Bohm.

Our work originates from a project that In the temperate zones, the epidemic has been underway since 2002. The spread of influenza is strongly governed project takes a multi-disciplinary by forces that coincide with season, with approach to modelling the spread of epidemic activity usually peaking in infectious disease, employing research winter. In these regions, summer out- from medicine, computer science, sta- breaks are uncommon and do not tistics, mathematics, and sociology. develop into epidemics. There are var- This approach led not only to a large ious explanations for this phenomenon programming effort, hosted by the which consider the role of climate, Swedish Institute for Infectious school semesters, and complex network Disease Control, but also to the forma- transmission effects. We hypothesise tion of the Stockholm Group for that a unification of the micro-, meso-, Epidemic Modeling (S-GEM). The and macro- modelling within a single micro-model built by the project group model would give new insights into the holds nine million individuals, corre- social dynamics involved in the spread sponding to the entire population of of influenza. Since the micro and macro- Sweden. Variations of this model have Figure 1: Depiction of the Influenza virus modellers traditionally represent oppo- been used for smallpox and influenza, protein (neuraminidase) structure, using the site and competing sides, however, and and results have been published within NASA Ribbons program. Image courtesy of since their synthesis is a long-unre- academia. The model will go open NASA. solved problem, we opt for a radically source in 2010. different kind of unification. David Bohm, encouraged by Einstein, outlined Most mathematical work in the project, Infectious disease is the cause of more an approach that could resolve the as in epidemiology in general, focuses human deaths than any other individual micro-macro dichotomy. Originally, his on traditional macro-models, in the factor, and influenza kills a larger aim was to bring quantum mechanics (a Susceptible Infectious Recovered (SIR) number of people than any other infec- micro perspective) and the theory of rel- family. Additionally, the project con- tious disease. Even in less severe years, ativity (a macro perspective) under one siders, micro- and meso- effects such as more than a million deaths worldwide theoretical umbrella, but the philosophy individual movement (between work can be attributed to influenza. Adding to and the mathematics that emerged sug- and home, for example) and family the burden of the direct suffering caused gest a more general applicability. structures (for instance, where a person by the contraction of the disease, lives, and with whom). Whilst being influenza may have negative social con- The ontological theory developed by tentative and somewhat speculative, the sequences, including a drastic fall in Bohm is a kind of algebra that was origi- work described in this short overview is productivity and possible societal con- nally designed to describe thought and potentially extremely fruitful. It targets flicts; effects that may cause further consciousness, but it has also proven the enigma of seasonal influenza, and spread of the disease and make it more valuable for understanding the social how it relates to the prospect of future difficult to care for the sick. dynamics of financial systems. lethal pandemics. Resting on the foun- Essentially, Bohm’s scheme is one of dation of our experience of micro- Currently, little is known about the participation rather than of interaction. meso-macro modelling in the bur- transmission process of influenza. In In a set of individuals constituting a pop- geoning field of computational epi- the case of the thoroughly investigated ulation, every individual participates in demiology, we propose a novel way of H5N1 virus, for instance, the World every other individual’s existence and modelling influenza. This work is car- Health Organisation acknowledged that actions; that is, the totality is present ried out in cooperation between no scientific explanation for its unusual implicitly in every individual, and every Stockholm University, the Royal pattern of spread could be identified. individual makes a mark on the totality. Institute of Technology and the Furthermore we do not know which Bohm refers to this holistic concept as Swedish Institute of Computer Science subtype—there are 16 hemagglutinin the implicate order. The concept of (SICS). While our international net- (H) and 9 neuraminidase (N) sub- wholeness makes it impossible for pri- work is extensive, we would welcome types—the next pandemic will have. mary laws to be summarized in a simple any input from the ERCIM partners on Hence, it is very hard to develop a vac- set of statements, since every aspect of our work, to be presented as a PhD cine and determine its efficacy before- reality enfolds all other aspects of it in thesis by Wah-Sui Almberg in 2010. hand. the implicate order. In contrast, the

42 ERCIM NEWS 82 July 2010 explicate order, which current laws of all these basic operations, as is customary, Links: natural sciences are based upon, refers but then he introduced an operator called S-GEM: http://s-gem.se/ to the apparent reality of things. Starting metamorphosis (M). Using M, sets of Synthetic Populations: with the explicate order, Bohm extracted transformations (E) can be translated to http://dsv.su.se/syntpop the operations displacement (D) and other Es. This is mathematically stated rotation (R) in Euclidean space. as E' = MEM-1. Adding a unit and zero Please contact: Founding a description of reality on D operation, Bohm had an algebra for Magnus Boman and R, instead of on static elements such implicate order. The implicate order is, SICS and KTH/ICT/SCS as straight lines, circles, etc, Bohm cre- to date, an incomplete theory, but impor- E-mail: [email protected], [email protected] ated a system in which process and tant steps in that direction were taken by movement are primary, and spatial coor- Bohm, together with Hiley, when they Wah-Sui Almberg dinates of a secondary nature. As these developed their complete theory of Stockholm University operators are multiplied and added, ontological quantum mechanics. This is E-mail: [email protected] orders and measures can be defined. the theory that we are now attempting to Bohm then added transformations to use for modelling influenza.

Epidemic Marketplace: An e-Science Platform for Epidemic Modelling and Analysis by Fabrício A. B. da Silva, Mário J. Silva and Francisco M. Couto

The Epidemic Marketplace is a distributed data management platform where epidemiological data can be stored, managed and made available to the scientific community. The Epidemic Marketplace is part of a computational framework for organising and distributing data for epidemic modelling and forecasting, dubbed Epiwork. The platform will assist epidemiologists and public health scientists in sharing and exchanging data.

The Epidemic Marketplace is an demiology research by the participation • Support the sharing and management e-Science platform for collecting, of epidemiologists, public health spe- of epidemiological data sets. Regis- storing, managing and providing epi- cialists, mathematical biologists and tered users should be able to upload demic semantically annotated data col- computer scientists. The Epidemic annotated data sets, and a data set lections. In recent years, the availability Marketplace is the Epiwork data inte- quality assessment mechanism of a huge volume of quantitative social, gration platform, where epidemiolog- should be available. demographic and behavioural data has ical data can be stored, managed and • Support the seamless integration of spurred an interest in the potential of made available to investigators, thus multiple heterogeneous data sources. innovative technologies to improve dis- fostering collaboration. The objectives Users should be able to have a unified ease surveillance systems, by providing of Epiwork in which the Epidemic view of related data sources. Data faster and better geo-referenced out- Marketplace will play a direct role: (1) should be available from streaming, break detection capabilities. These capa- the development of large scale, data static and dynamic sources. bilities depend on the availability of driven computational models endowed • Support the creation of a virtual com- finely-tuned models, which require with a high level of realism and aimed munity for epidemic research. The plat- accurate and comprehensive data. at epidemic scenario forecast; (2) the form will serve as a forum for discus- However, the increasing amount of data design and implementation of original sion that will facilitate the sharing of introduces the problem of data integra- data-collection schemes motivated by data between providers and modellers. tion and management. New solutions are identified modelling needs, such as the • Distributed Architecture. The Epi- needed to ensure that data are correctly collection of real-time disease inci- demic Marketplace should imple- stored, managed and made available to dence. This is achieved with the use of ment a geographically distributed the scientific community. innovative Internet and ICT applica- architecture deployed in several sites tions; (3) the set up of a computational for improved data access perform- The Epidemic Marketplace is part of a platform for epidemic research and data ance, availability and fault-tolerance. European research effort, the Epiwork sharing that will generate important • Support secure access to data. Access project, a four-year project started in synergies between research communi- to data should be controlled. The 2009. Epiwork supports multidiscipli- ties and states. marketplace should provide single nary research aimed at developing the sign on, distributed federated author- appropriate framework of tools and The architectural requirements of the ization and multiple access policies, knowledge needed for the design of epi- Epidemic Marketplace are directly customizable by users. demic forecast infrastructures, to be related to the objectives of the Epiwork • Support data analysis and simulation used by epidemiologists and public project and have been defined according in grid environments. The Epidemic health scientists. The project is a truly to the feedback from its partners. The Marketplace will provide data analy- interdisciplinary effort, anchored to the main functional requirements of the sis and simulation services in a grid research questions and needs of epi- Epidemic Marketplace are: environment.

ERCIM NEWS 82 July 2010 43 Special Theme: Computational Biology

• Workflow. The platform should provide seamless access to distributed, hetero- sources, such as social networks. workflow support for data processing geneous and redundant resources. It is a After retrieval, the collector groups and external service interaction. virtual repository because data can be the incidences by subject and creates stored in systems that are external to the data sets to store in the repository. The main non-functional requirements Epidemic Marketplace, and it provides • Forum: allows users to post com- that have been identified for the transparent access because several het- ments on integrated data from other Epidemic Marketplace are: erogeneities are hidden from its users. modules, fostering collaboration • Interoperability: The Epidemic Mar- The Epidemic Marketplace is com- among modellers. ketplace must interoperate with other posed of a set of interconnected data software. Its design must take into management nodes geographically dis- A first prototype version of the account the future possibility that sys- tributed, sharing common canonical Epidemic Marketplace is already in use tems developed by other researchers data models, authorization infrastruc- internally. This prototype implements worldwide may need to query the Epi- ture and access interfaces. several of the main features of the out- demic Marketplace catalogue for lined architecture such as data manage- access to its datasets. As shown in Figure 1, each Epidemic ment and sharing support and secure • Open-source: All software packages, Marketplace node has the following access to data, and is currently being and new modules required for the modules: populated with epidemic data collec- implementation and deployment of • Repository: stores epidemic data sets tions. Several open-source tools and the Epidemic Marketplace should be and an epidemic ontology to charac- open standards are being used in the open source. terize the semantic information of the Epidemic Marketplace implementation • Standards-based: To guarantee soft- data sets. and deployment process, such as the ware interoperability and the seamless • Mediator: a collection of web servic- Fedora Commons for the implementa- integration of all geographically dis- es that will provide access to internal tion of the main features of the reposi- persed sites of the Epidemic Market- data and external sources, based on a tory. Access control in the platform uses place, the system will be built accord- catalogue describing existing epi- the XACML, LDAP and Shibolleth ing to standards defining web servic- demic databases through their meta- standards. The front-end of this first es, authentication and metadata. data, using state-of-the-art semantic- prototype is based in Muradora, but the web/grid technologies. next version will also include a front- The Epidemic Marketplace can be • MEDcollector: retrieves information end based in the Drupal content man- defined as a distributed virtual reposi- relating to real-time disease inci- agement system. tory, a platform supporting transparent, dences from publicly available data The Epiwork project is funded by the European Commission under the Seventh Framework Programme. The authors want to thank the Epiwork project partners, the students that have been working in the initial implementa- tion of the Epidemic Marketplace (Luis Filipe Lopes, Patricia Sousa, Hugo Ferreira and João Zamite), the CMU- Portugal partnership and FCT (Portuguese research funding agency) for its support.

Links: Epiwork Project: http://www.epiwork.eu Epidemic Marketplace: http://epiwork.di.fc.ul.pt/ Fedora Commons: http://www.fedora-commons.org/ Shibolleth: http://shibboleth.internet2.edu/ Muradora: http://www.fedora- commons.org/confluence/display/MUR ADORA/Muradora Drupal: http://drupal.org Figure 1: An envisioned deployment of the Epidemic Marketplace distributed among Gripenet: http://www.gripenet.pt several locations. Currently, only the Lisbon node has been deployed. Each Epidemic Marketplace node is composed of four modules: repository, MEDcollector, forum and Please Contact mediator. The mediator will be the contact point for other applications, such as Internet Mario J. Silva Monitoring System nodes (eg Gripenet), and for clients that show data in a graphical Universidade de Lisboa, Portugal and interactive way using geographical maps and trend graphs. E-mail: [email protected]

44 ERCIM NEWS 82 July 2010 Cross-Project Uptake of Biomedical Text Mining Results for Candidate Gene Searches by Christoph M. Friedrich, Christian Ebeling and David Manset

From intracranial aneurysms to paediatric diseases - A biomedical text mining service developed in the European IP-project @neurIST has been integrated into a 3D Knowledge Browser developed in the European IP-project Health-e-Child and can be used for candidate gene searches in different diseases.

Most biomedical knowledge can be 4/2010), which was concerned with co-morbidities are mentioned together found in unstructured form in publica- integrated biomedical informatics for with Alzheimers disease?” or “Which tions. Every day approximately 2000 the management of intracranial pathways are involved in diabetes?”. new citations are added to the PubMed aneurysms, among other data mining database, a repository of more than 20 solutions, a biomedical text mining The key technologies that enable million biomedical citations. Even system called SCAIView has been SCAIView are called semantic search within specific disease areas it is impos- developed. This Knowledge Discovery and ontology search. The core of the sible for scientists to stay up to date. System, depicted in Figure 1, can service is a precompiled index that Text mining is seen as a solution to this answer questions like: “Which genes or holds a copy of the PubMed database problem. In the European Integrated- gene variations, are concerned with for fulltext searches and additional project @neurIST (FP6, 1/2006- intracranial aneurysms?” Or “Which information on named entities, which

Figure 1: The search interface of SCAIView.

Figure 2: Highlighting and Enrichment through Named Entity Recognition.

ERCIM NEWS 82 July 2010 45 Special Theme: Computational Biology

Figure 3: SCAIView searches integrated into the 3D Knowledge Browser.

occur within the text. A named entity is SCAIView was based on the question a semantic entity such as a drug name for candidate genes of a disease. The or gene name which is found by gold standard for “Intracranial Named Entity Recognition (NER). In Aneurysms” was an expert review on the NER process all synonyms of a the topic and a Cochrane Report that has semantic entity are used for an approx- been produced in the course of the imative search and, where possible, the @neurIST project. We found all candi- found entity is mapped to a unique date genes with our search and they database identifier. This process is nec- have been distributed among the top essary as some genes or diseases have ranking hits. Additionally we found up to 250 different name variants that other novel candidate genes, which occur in text. The information is have not been mentioned in the reviews. enriched with data from biomedical Other successful evaluations have been databases and ontologies (see Figure conducted for Alzheimer’s disease, 2), so that finally a query for all genes schizophrenia and Parkinson’s disease. which are on a certain pathway and are co-mentioned with a disease can be Integration into the 3D Knowledge submitted. Browser The integrated European project Most of the named entities in Health-e-Child (HeC) (FP6, 2/2006- SCAIView are searched with diction- 4/2010) has developed a 3D Knowledge aries but some entities cannot be enu- Browser, which is specially suited for merated beforehand. For these cases multi-scale and multi-level searches in rule- and machine-learning based entity biomedical Knowledge Sources. It recognisers have been implemented. lacked a search engine for discovering The recognition process involves the links between HeC data and gene-based machine learning from examples or information published in external rules provided by humans. One sources. Therefore we conducted a joint example for this might be: gene varia- research project and integrated the tions like: “... polymorphisms in TIMP- SCAIView results via Webservices into Links: 3 (249T>C, 261C>T)”. In this example the 3D Knowledge Browser. Now entity http://www.aneurist.org the result of the search is an identifier to searches, for instance for candidate http://www.health-e-child.org a polymorphism in a database or the genes, can be displayed next to database identifier of a probe on a Microarray for searches like a retrieval from SwissProt. Please contact: Genome wide association studies. In Figure 3 a screenshot of the resulting Christoph M. Friedrich interface is given. In the Health-e-Child Fraunhofer-Institute for Algorithms Evaluation of Retrieval Performance project, this has been used to search for and Scientific Computing (SCAI), Every Search engine is only as good as candidate genes in paediatric heart dis- Germany the results are valid, novel, relevant and eases, inflammatory diseases, and brain Tel: +49 2241 142502 useful for the user. The evaluation of tumours. E-mail: [email protected]

46 ERCIM NEWS 82 July 2010 Ontologies and Vector-Borne Diseases: New Tools for Old Illnesses by Pantelis Topalis, Emmanuel Dialynas and Christos Louis

Vector-borne diseases are illnesses that are characterized by the transmission of infectious agents between humans via the bites of arthropod vectors, most prominently mosquitoes. It is hoped that recent scientific advances, especially in the areas of high throughput biological research and bioinformatics will assist in alleviating the global burden caused by these diseases.

Vector-borne diseases have been central biology and genomics over the last Frank Collins, University of Notre in shaping human history by playing a decade has provided new knowledge on Dame) project funded by the National crucial role in events such as the siege of all three "partners" of the vector-borne Institute of Allergy and Infectious Syracuse, and the death of prominent diseases (human, pathogen and vector). Diseases of the USA. While the leaders, such as Alexander the Great. Combined with the recent research European networks concentrate on More than one million deaths annually, boom in bioinformatics, this knowledge "bench" biological research, the latter is resulting from hundreds of millions of could potentially be put to use for the a dedicated bioinformatics project that cases, can be attributed to malaria alone, development of specific tools that might is responsible for the development of a which, in spite of considerable efforts, help control the diseases. The potential comprehensive database of genetic, still represents one of the menaces faced eradication of malaria, a notion that was genomic and other bio-data that are per- the poorest populations in the tropics. abandoned 40 years ago, is now being tinent to disease vectors. The key con- The fight against this disease is ham- discussed again as a possibility. tents of VB include the genomic pered by increasing resistance exhibited sequence information on several vec- by both Plasmodium, the malaria para- In the frame of this enhanced effort to tors, first of which is Anopheles gam- site, against drugs, and by vector fight vector-borne diseases our group biae, the most important malaria vector anopheline mosquitoes against insecti- has joined several international net- in Africa. Additional information hosted cides used for public health purposes. works that share the same goals, most by VB includes, among others, data on Moreover, the lack of vaccines and the notably the BioMalPar and its successor gene expression, specific bioinfor- less-than-perfect infrastructure in the EVIMalaR, both Networks of matics tools and a section on insecticide countries most affected have not helped Excellence in the frame of the FP6 and resistance, IRBase (see below), which improve the situation. Nevertheless, the FP7 programmes of the European was constructed by the Cretan team. In advent and expansion of molecular Union, and the VectorBase (VB; PI: addition to these, VB is also the home of

Mosquitoes from an Anopheles gambiae laboratory colony, "resting" in the authors' laboratory.

ERCIM NEWS 82 July 2010 47 Special Theme: Computational Biology

a set of ontologies, all in OBO format, with the aim of making them freely In addition to the "classical" potential that describe different aspects of vector- available to the worldwide community usage of our existing and new/forth- borne diseases, which are the main of field entomologists and public health coming ontologies in driving databases, focus of interest of the IMBB group in workers. Finally, a large malaria their availability under the criteria the frame of VB. These can be used to ontology (IDOMAL) is also part of the briefly described make it possible to use promote interoperability between data- VB effort. This, developed in the frame them in more practical ways, for bases and facilitate complex queries in of the international consortium IDO that example, in driving epidemiological different but related datasets. works on the construction of a top-level decision support systems. Given the infectious disease ontology, is the first state of public health in most third The first two ontologies to be built were step in an ambitious plan to construct world countries, we believe that a func- those relating to the anatomy of mosqui- ontologies for all major vector-borne tional improvement based on bioinfor- toes (TGMA) and of ticks (TADS), both diseases such as African and American matics tools such as the specialized species groups that include several trypanosomiasis, lymphatic filariasis, ontologies on which we work is a prac- major vectors. These ontologies were yellow fever and more. tical way to help the poverty-stricken primarily constructed in order to help populations of the tropics. the annotation of biological experiments The common thread between all ontolo- dealing with these species. The next gies that are being constructed de novo ontology developed, MIRO, covers the is that all follow the rules set by the Links: domain of insecticide resistance, a fea- OBO Foundry, a loose collaboration http://www.vectorbase.org ture that is of extreme importance for the between teams developing open biolog- http://anobase.vectorbase.org/ontologies control of both agricultural pests and ical ontologies. Moreover, we build our disease vectors. In the frame of VB, ontologies on the format laid down by Please contact: MIRO drives IRBase, the corresponding BFO, the basic formal ontology. Both of Christos Louis database, which collects data supplied these conditions guarantee the max- IMBB-FORTH, Greece by both individuals and organizations imum of interoperability and orthogo- Tel: +30 281 039 1119? such as the World Health Organization, nality, thus enhancing their usability. E-mail: [email protected]

Unraveling Hypertrophic Cardiomyopathy Variability

by Catia M. Machado, Francisco Couto, Alexandra R. Fernandes, Susana Santos, Nuno Cardim and Ana T. Freitas

Hypertrophic cardiomyopathy is a disease characterized by a high genetic heterogeneity with variable clinical presentation, thus rendering the possibility of personalized treatments highly desirable. This can be achieved through the integration of genomic and clinical data with Semantic Web technologies, combined with the identification of correlations between the data elements using data mining techniques.

Hypertrophic cardiomyopathy (HCM) is genes have been associated with HCM These complicating factors indicate that an autosomal dominant genetic disease phenotype. The detection of these muta- a joint strategy among clinicians, biolo- that may afflict as many as one in 500 tions in the genome of the patient can gists and bioinformaticians may be individuals, being the most frequent cause greatly improve the disease diagnosis. advantageous. Biologists need to iden- of sudden death among apparently However, genetic diagnosis using tify and characterize new mutations healthy young people and athletes. It is an dideoxi-sequencing techniques is ham- with high throughput techniques, thus important risk factor for heart failure dis- pered by these high numbers of enabling clinicians to count on this ability at any age and is characterized by a genes/mutations and by the fact that information, when articulated with the hypertrophied, non-dilated left ventricle, some patients present the clinical mani- clinical findings, to provide the best myocyte disarray and interstitial fibrosis. festations of the disease but none of the possible treatment for each individual mutations is found under the current patient. Bioinformaticians have the Since the disease is characterized by a genetic testing (which indicates that tools needed to expeditiously integrate variable clinical presentation and onset, even more mutations and/or genes and analyse the data, thus providing the its clinical diagnosis is difficult prior to might be involved). Moreover, HCM’s necessary articulation between the dif- the development of severe or even fatal severity may not be the same for two ferent types of data and the identifica- symptoms. Therefore, its early diagnosis individuals, even if direct relatives, tion of patterns that might shed light is extremely important. since the presence of a given mutation into the disease variability. can have a benign pattern in one indi- In terms of genetic manifestation, more vidual and result in sudden cardiac In more concrete terms, the data ele- than 640 mutations in more than 20 death in another. ments that need to be integrated corre-

48 ERCIM NEWS 82 July 2010 Clinical data Clinical Transcript collection history variants

SNPs Genotyping Biomarker Biomarkers identification Physical Exams results exams Mutations results collection

Blood sample RNA Amplification fragments Biological Nucleic acid Amplification sample extraction Muscle tissue DNA collection Primers sample

Figure 1: Model of a human heart showing the thickening of the left ventricular wall, and the HCM characterization workflow, comprising all modelling activities. spond to the presence/absence of each traits. Although a large number of studies Currently in the integration stage, this mutation in the genome of the patients have been conducted on the linkage work is being conducted in the LASIGE (genotype data) and to the clinical ele- between specific mutations and the risk group at the Departamento de ments upon which the clinicians rely to of specific illnesses (eg cystic fibrosis), Informática of the Universidade de provide a diagnose (phenotype data). models for the more general case of Lisboa and in the KDBIO group at the The latter normally include the results genotype to phenotype association in the Instituto de Engenharia de Sistemas e from physical examinations (eg electro- presence of high disease complexity, Computadores of the Instituto Superior cardiogram, echocardiogram), as well both genetic and clinical, remains largely Técnico (both in Lisbon, Portugal). The as the clinical history of the individual unexplored. Supervised machine data used are provided by the six other (eg age at diagnosis, sudden deaths in learning techniques, such as decision Portuguese institutions listed below: the family). trees and support vector machines, offer • Phenotype data – Hospitais da Uni- the potential to identify more complex versidade de Coimbra (Coimbra), The data integration procedure in this relationships than those identified using Centro de Cardiologia da Universi- context is a very demanding task consid- simple correlation analysis, the standard dade de Lisboa, Hospital da Luz and ering that genotype and phenotype data practice in genotype-phenotype associa- Hospital de Sta. Cruz (all in Lisbon) correspond to transversal domains, nor- tion models. Standard statistical analysis • Genotype data – Centro de Química mally stored under heterogeneous for- may identify correlations between one or Estrutural of the Instituto Superior mats and on different physical locations. a small set of specific mutations, but in Técnico of the Universidade Técnica The approach proposed here is based on more complex cases these correlations de Lisboa and the Faculdade de Far- Semantic Web technologies, previously will not be significant enough to lead to mácia of the Universidade de Lisboa identified as suitable for this type of task concrete diagnosis methods. The models (both in Lisbon). since they make it possible to integrate, obtained using data mining techniques share and reuse data in an application- are expected to be of great interest both and domain-independent manner. in terms of their predictive ability and Link: their practical usability for doctors. http://kdbio.inesc-id.pt Upon completion of the data integration phase, the data will be analysed with data The ultimate goal of this work is to Please contact: mining techniques in order to infer geno- develop a clinical characterization Ana Teresa Freitas type-phenotype correlations, or, more system that, upon introduction of a new KDBIO research group, specifically, to develop models for the patient’s data, will provide an indication INESC-ID/IST Lisbon, Portugal association between the presence of cer- of whether the patient is suffering from Tel: +351 213100394 tain mutations and the resulting physical HCM based on the existing model. E-mail: [email protected]

ERCIM NEWS 82 July 2010 49 R&D and Technolgy Transfer

Continuous Evolutionary to adapt itself to a changed usage context. The realization of this vision involves a number of technologies, including: Automated Testing • Observational reflection and monitoring, to gather data necessary for run-time decisions. for the Future Internet • Dynamic discovery/composition of services, hot compo- nent loading and update. by Tanja E. J. Vos • Structural reflection, for self-adaptation. • High configurability and context awareness. The Future Internet will be a complex interconnection of • Composability into large systems of systems. services, applications, content and media, on which our society will become increasingly dependent for critical While offering a major improvement over the currently activities such as social services, learning, finance, available Web experience, complexity of the technologies business, as well as entertainment. This presents involved in FI applications makes testing them extremely challenging problems during testing; challenges that challenging. Some of the challenges are described in Table 1. simply cannot be avoided since testing is a vital part of The European project FITTEST (ICT-257574, September the quality assurance process. The European funded 2010-2013) aims at addressing the testing challenges in project FITTEST (Future Internet Testing) aims to attack Table 1 by developing and evaluating an integrated environ- the problems of testing the Future Internet with ment for continuous evolutionary automated testing, which Evolutionary Search Based Testing. can monitor the FI application under test and adapt to the dynamic changes observed. Today's traditional testing usu- The Future Internet (FI) will be a complex interconnection of ally ends with the release of a planned product version with services, applications, contents and media, possibly aug- testing being discontinued after application delivery. All test- mented with semantic information, and based on technolo- ware (test cases as well as the underlying behavioural model) gies that offer a rich user experience, extending and are constructed and executed before delivering the software. improving current hyperlink-based navigation. Our society These testwares are fixed, because the System Under Test will be increasingly dependent on services built on top of the (SUT )has a fixed set of features and functionalities, since its FI, for critical activities such as public utilities, social serv- behaviour is determined before its release. If the released ices, government, learning, finance, business, as well as product needs to be updated because of changed user entertainment. As a consequence, the applications running requirements or severe bugs that need to be repaired, a new on top of the FI will have to satisfy strict and demanding development cycle will start and regression testing needs to quality and dependability standards. be done in order to ensure that the previous functionalities still work with the new changes. The fixed testwares, FI applications will be characterized by an extreme level of designed during post-release testing, need to be adapted dynamism. Most decisions made at design or deploy time are manually in order to cope with the changed requirements, deferred to the execution time, when the application takes functionalities, user interfaces and/or work-flows of the advantage of monitoring (self-observation, as well as data col- system (See Figure 1 for a graphical representation of tradi- lection from the environment and logging of the interactions) tional testing).

Changed requirements Changed requirements Traditional

development release 1 devel. release 2 release 3

Regression Regression testing testing testing

Fixed testwares Fixed Fixed (models, testcases, Adapt manually to testwares oracles) testwares changed requirements

FITTEST Self-modifying Changed requirements Autonomic

FI applications, dynamically composed of 3rd party components and services

Automated Continous Evolutionary Testing

Evolving testwares (models, testcases, oracles) Figure 1: Traditional testing versus FITTEST testing.

50 ERCIM NEWS 82 July 2010 Dynamism, self-modification FI applications are highly autonomous; their correct behaviour for testing cannot 1 and autonomic behaviour be specified and modelled precisely at design-time.

FI applications are composed of an increasing number of third-party components 2 Low observability and services, accessed as a black box, which are hard to test.

FI applications are highly asynchronous and hence harder to test. Each client 3 Asynchronous interactions submits multiple requests asynchronously; multiple clients run in parallel; server- side computations are distributed over the network and concurrent.

Time and load dependent For FI applications, timing and load conditions make it hard to reproduce errors 4 behaviour during debugging.

FI applications are highly customizable and self-configuring, having a huge Huge feature configuration 5 number of configurable features, such as context, user, environment-dependent space configurable parameters that need to be tested.

FI applications are often systems of systems; traditional testing adequacy criteria, 6 Ultra-large scale like coverage, cannot be applied.

Table 1: Future Internet Challenges.

FI testing will require continuous post-release testing since with the objective of maximizing the portion of equivalent the application under test does not remain fixed after its states explored. Oracle learning needs training data that are release. Services and components could be dynamically used to infer candidate specifications for the system under added by customers and the intended use could change sig- test. We can select the appropriate executions through SBST, nificantly. Therefore, testing has to be performed continu- using a different objective function (eg, maximizing the level ously after deployment to the customer. of confidence and support for each inferred property). We will generate classification-trees from models and use SBST Within FI application, services and components could be to generate test cases for classification-trees. Input data gen- dynamically added by customers, and the intended use could eration and feasibility check are clearly good candidates to change significantly. Since FI applications do not remain resort to search based algorithms, with an objective function fixed after its release, they will require ongoing, continuous that quantifies the level to which a test case gets close to sat- testing even after deployment to the customer. isfying a path of interest. In concurrency testing, test cases that produce critical execution conditions are given a higher The FITTEST testing environment will integrate, adapt and objective value defined for this purpose. Coverage targets automate various techniques for continuous FI testing (eg can be used to define the objective function to be used in dynamic model inference, model-based test case derivation, coverage and regression testing of ultra-large scale systems. log-based diagnosis, oracle learning, classification trees and SBST represents the unifying conceptual framework for the combinatorial testing, concurrent testing, regression testing, various testing techniques that will be adapted for FI testing etc.). To make it possible for the above mentioned techniques in the project. Implementations of such techniques will be to deal with the huge search space associated with FI testing, integrated into a unified environment for continuous, auto- evolutionary search based testing will be used. Search Based mated testing of FI applications. Quantification of the actu- Software Testing (SBST) is the FITTEST basis for test case ally achieved project objectives will be obtained by exe- generation, since it can be adopted even in the presence of cuting a number of case studies, designed in accordance with dynamism and partial observability that characterize FI the best practices of empirical software engineering. applications. The key ingredients required by search algo- rithms are: The FITTEST project is composed of partners from Spain • an objective function, that measures the degree to which (Universidad Politecnica de Valencia), UK (University the current solution (eg test case.) achieves the testing goal College London), Germany (Berner & Mattner), Israel (eg, coverage, concurrency problems, etc.) (IBM), Italy (Fondazione Bruno Kessler), The Netherlands • operators that modify the current solution, producing new (Utrecht University), France (Softeam) and Finland candidate solutions to be evaluated in next iterations of the (Sulake). algorithm. Link: Such ingredients can usually be obtained even in the pres- http://www.pros.upv.es/fittest/ ence of high dynamism and low observability. Hence, it is expected that most techniques that will be developed to Please contact: address the project’s objectives will take advantage of a Tanja E.J. Vos, search based approach. For example, model-inference for Universidad Politécnica de Valencia / SpaRCIM model-based testing requires execution scenarios to limit Tel: +34 690 917 971 under-approximation. We can generate them through SBST, E-mail: [email protected]

ERCIM NEWS 82 July 2010 51 R&D and Technolgy Transfer

Robust Image Analysis Figure 1: Beadchip for genome-wide in the Evaluation expression analysis containing twelve of Gene Expression Studies strips.

by Jan Kalina

Gene expression data are typically analysed by standard automated procedures that tend to be vulnerable to outlying values. In a project carried out by the Centre of Biomedical Informatics of the Ministry of Education, Youth and Sports of the Czech Republic, we use alternative approaches, based on robust statistical methods, to measure differential gene expression in cardiovascular patients. Our results are applicable to personalized medicine. computed only after deleting the outlying values. At the same time these methods allow fast computing and are computa- Recent research in the area of molecular bioinformatics and tionally feasible. We propose that standard software for genetics, conducted by the Centre of Biomedical Informatics analysing gene expression data be modified to incorporate aims to find the optimal set of genes for diagnostics and this new approach. prognosis of cardiovascular diseases. Since 2006 we have been using whole-genome beadchip microarray technology The outcome of this unique project is the ability to demon- to measure gene expression. The sample of peripheral blood strate which genes are more strongly expressed in patients is taken from each patient, the ribonucleic acid (RNA) is iso- with acute myocardial infarct or cerebrovascular accident lated and applied on the microarray. This technology allows compared to controls. The significance of differential expres- to measure the gene expression as the gene activity leading to sion of particular genes is acquired by means of statistical synthesis of proteins and consequent biological processes. hypothesis testing. Clinical and biochemical data recorded The Municipal Hospital in Èáslav takes blood samples from for each patient contribute to our understanding of the two groups of patients: one group with acute myocardial genetic predisposition to cardiovascular disease. The study infarct (AMI) or cerebrovascular accident (CVA) (as exam- of gene expression profiling has allowed the Centre of ples of ischemic diseases); and a control group (patients hos- Biomedical Informatics to patent an oligonucleotide pitalized with a different cause without a manifested microarray as the main result of the whole study, directly ischemic disease). This whole-genome analysis examines the applicable to disease diagnosis, prognosis, prediction and entire set of human genes with different microbeads corre- treatment. This technology containing an optimal set of sponding to different genes randomly distributed on the sur- genes provides an invaluable contribution towards the devel- face of the microarray, which contains 12 separate physical opment of a personalized and predictive medical care, in strips (Figure 1) for samples from different patients. keeping with the new paradigm of data-driven evidence- based medicine. The standard approach in the preprocessing of the data tends to be vulnerable to outlying observations. The raw data are We believe that the development and use of robust analysis scanned images with a high fluorescence intensity corre- methods is becoming increasingly important in the area of sponding to highly expressed genes. To compute the bead- bioinformatics. Next generation (Next-Gen) sequencing, the level data for particular microbeads a cascade of transforma- new low noise approach to genetic analysis, is currently tions is computed, including the local estimation of back- undergoing rapid development, producing huge data sets. ground, image sharpening and smoothing by averaging, esti- Robust statistical methods are therefore becoming ever more mating foreground, background correction, data normaliza- crucial for fast and reliable data analysis. It is vital, when tion and outlier deletion. The initial steps are strongly influ- designing real-time image analysis systems for Next-Gen enced by local noise in the neighbourhood of particular technologies, that such methods are adaptive and tailor-made microbeads and the resulting biased values are passed on to for the particular task, allowing the user to tune parameters. the next steps of the analysis.The outlier deletion is com- The future of molecular bioinformatics will therefore require puted only at the end of the procedure. Therefore the differ- more precise and well-considered robust image analysis. ential expression analysis is sensitive to random or system- atic errors in the original data. Links: http://www.euromise.org/cbi/cbi.html As an alternative to the processing of the scanned images http://www.euromise.org/homepage/department.html with gene expression measurements we propose a more robust approach which involves searching for systematic Please contact: artifacts in the data. The methods are based on robust statis- Jan Kalina, Centre of Biomedical Informatics, Institute of tics applied to image analysis which enables outliers to be Computer Science, Academy of Sciences of the Czech deleted at each step of the procedure. The method is also Republic / CRCIM robust to specific properties of the neighbourhood of partic- Tel.: +420 266053099 ular pixels. A careful normalization of bead-level data is E-mail: [email protected]

52 ERCIM NEWS 82 July 2010 Autonomous Production rely on the fusion of the foreground likelihood information computed in each camera view. This overcomes the tradi- of Sport Video Summaries tional hurdles associated with single view analysis, such as occlusions, shadows and changing illumination. Scoreboard by Christophe De Vleeschouwer monitoring provides valuable additional input which assists in identifying the main focal points of the game. Video production cost reduction is an ongoing challenge. The FP7 ‘APIDIS’ project (Autonomous Production of In order to produce semantically meaningful and perceptu- Images based on Distributed and Intelligent Sensing) ally comfortable video summaries, based on the extraction of uses computer vision and artificial intelligence to sub-images from the raw content, the APIDIS framework propose a solution to this problem. Distributed analysis introduces three fundamental concepts: “completeness”, and interpretation of a sporting event are used to “smoothness” and “fineness”. These concepts are defined determine what to show or not to show from a multi- below..Scene analysis algorithms then select temporal seg- camera stream. This process involves automatic scene ments and corresponding viewpoints in the edited summary analysis, camera viewpoint selection, and generation of as two independent optimization problems according to indi- summaries through automatic organization of stories. vidual user preferences (eg in terms of preferred player or APIDIS provides practical solutions to a wide range of video access resolution). To illustrate these techniques, we applications, such as personalized access to local sport consider a basket-ball game case study, which incorporates events on the web or automatic log in of annotations. some of the latest research outputs of the FP7 APIDIS research project. Today, individuals and organizations want to access dedi- cated contents through a personalized service that is able to Multi-view player detection, recognition, and tracking provide what they are interested in, at a time convenient to The problem of tracking multiple people in cluttered scenes them, and through the distribution channel of their choice. has been extensively studied, mainly because it is common to To address such demands, cost-effective and autonomous numerous applications, ranging from sport event reporting to generation of sports team video contents from multi-sen- surveillance in public spaces. A typical problem is that all sored data becomes essential, ie to generate on-demand foot- players in a sports team have a very similar appearance. For ball or basket ball match summaries. this reason, we integrate the information provided by mul- tiple views, and focus on a particular subset of methods that APIDIS is a research consortium developing the automatic do not use color models or shape cues of individual people, extraction of intelligent contents from a network of cameras but instead rely on the distinction of foreground from back- and microphones distributed around a sports ground. Here, ground in each individual camera view to infer the ground intelligence refers to the identification of salient segments plane locations occupied by people. within the audiovisual content, using distributed scene analysis algorithms. In a second step, that knowledge is Figure 1 summarizes our proposed method. Once players exploited to automate the production and personalize the and referee have been localized, the system has to decide summary of video contents. who’s who. To achieve this, histogram analysis is performed on the expected body area of each detected person. Specifically, salient segments in the raw video content are Histogram peak extraction enables assignment of a team identified based on player movement analysis and score- label to each detected player (see bounding boxes around the board monitoring. Player detection and tracking methods red and blue teams). Further segmentation and analysis of the

Figure 1: On the left, the foreground likelihoods are extracted from each camera. They are projected to define a ground occupancy map (bottom right in blue) used for player detection and tracking, which in turns supports camera selection. Figure 2: Camera selection and field of view selection.

ERCIM NEWS 82 July 2010 53 R&D and Technolgy Transfer

regions composing the expected body area permits detection and recognition of the digits printed on the players’ shirts IOLanes: Advancing when they face the camera. the Scalability Event recognition The main events occurring during a basketball game include and Performance field goals, violations, fouls, balls out-of-bounds and free- throws. All these events correspond to ‘clock-events’, ie they of I/O Subsystems cause a stop, start or re-initialization of the 24'' clock, and they can occur in periods during which the clock is stopped. in Multicore Platforms

An event tree is built on the basis of the clock and scoreboard by Angelos Bilas information. When needed, this information is completed by visual hints, typically provided as outcomes of the player Data storage technology today faces many challenges, (and ball) tracking algorithms. For instance, an analysis of including performance inefficiencies, inadequate the trajectories of the players can assist in decision-making dependability and integrity guarantees, limited after a start of the 24’’ clock following a ‘rebound after free- scalability, loss of confidentiality, poor resource sharing, throw’ or a ‘throw-in’ event. and increased ownership and management costs. Given the importance of both direct-attached and networked Autonomous production of personalized video summaries storage systems for modern applications, it becomes To produce condensed video reports of a sporting event, the imperative to address these issues. Multicore CPUs system selects the temporal segments corresponding to offer the promise of dealing with many of the actions that are worth being included in the summary based underlying limitations of today’s I/O architectures. on three factors: However, this requires careful consideration of • Completeness stands for both the integrity of view render- architectural and systems issues and complex ing in camera/viewpoint selection, and that of story-telling interactions in the I/O stack, all the way from the in summary. application to the disk. “IOLanes” (Advancing the • Smoothness refers to the graceful displacement of the vir- Scalability and Performance of I/O Subsystems in tual camera viewpoint, and to the continuous story-telling Multicore Platforms) is new a EC-funded project led by resulting from the selection of contiguous temporal seg- FORTH-ICS that addresses these issues. ments. Preserving smoothness is important to avoid dis- tracting the viewer from the story with abrupt changes of viewpoint. • Fineness refers to the amount of detail provided about the rendered action. Spatially, it favours close views. Tempo- … rally, it implies redundant story-telling, including replays. Increasing the fineness of a video does not only improve the viewing experience, but is also essential in guiding the emotional involvement of viewers through the use of close-up shots.

The ability to personalize the viewing experience through the application of different parameters for each end user was appreciated during the first subjective tests. The tests also Figure 1: Layers in the I/O path of existing systems. revealed that viewers generally prefer the viewpoints selected by the automatic system than those selected by a human producer. This is, no doubt, partly explained by the IOLanes targets three major challenges: (i) dealing with per- severe load imposed on the human operator with an formance and scalability issues of the I/O stack on multicore increasing number of cameras.. architectures, (ii) addressing I/O performance and dynamic resource management issues in virtualised, single-host envi- The author thanks the European Commission and the ronments, and (iii) examining on-loading and off-loading Walloon Region for funding part of this work through the tradeoffs for advanced functions that are becoming essential FP7 APIDIS and WIST2 WALCOMO projects, respectively. in modern storage systems, eg compression, protection, encryption, error correction.

Link: Concept http://apidis.org/ Our lives are becoming increasingly dependent on electronic information that is processed and stored in computer sys- Please contact: tems. Individuals, businesses, and organizations cannot sur- Christophe De Vleeschouwer vive in today’s competitive economy without the use of dig- Université Catholique de Louvain (UCL), Belgium ital information stored in electronic data storage systems. Tel: +32 1047 2543 Data storage is perhaps the most critical and valuable com- E-mail: [email protected] ponent of today’s computing infrastructures. However, crit-

54 ERCIM NEWS 82 July 2010 ical and valuable as they may be, existing data storage sys- environments, supporting advanced functionality, and tems today fall short of applications and users’ needs in four allowing scalability with the number of cores, as multicore main respects: architectures evolve over time. 1. Performance: The storage system continues to be the per- formance bottleneck in most computer systems, due to the The project is carried out by ICS-FORTH; Univ. Politécnica processor-disk performance gap. The recent advent of Madrid, Barcelona Supercomputing Center. Spain; IBM multicore processors and the steadily increasing number Israel - Science and Technology LTD, Israel; Intel of cores per chip has resulted in a decrease in the effective Performance Learning Solutions Ltd., Ireland, and Neurocom storage bandwidth available per core. S.A. , Greece. The IOLanes project is part of the portfolio of 2. Dependability and Data Integrity: For the last three the Embedded Systems Unit – G3, Directorate General decades there has been a dramatic increase of storage den- Information Society (http://cordis.europa.eu/ist/ embedded). sity in magnetic and optical media, which has led to signif- icantly lower cost per capacity unit (€/GB). Furthermore, Link: http://www.iolanes.eu organizations and individuals have taken advantage of this trend by creating and storing ever-increasing amounts of Please contact: digital data. Storing and handling these unprecedented Angelos Bilas amounts of data has led to increased failures; and failures FORTH-ICS, Greece in the storage subsystem can be unnerving. E-mail: [email protected] 3. Data Confidentiality: The sensitivity of digital data for individuals and organisations requires careful protection of all digital assets. Data should always be stored in eg an encrypted form to prevent leakage to anyone who can obtain access to the operating systems or physical access InfoGuard: A Process- to the storage device(s). However, such solutions remain, to date, exotic, are not used in most systems, and even Centric Rule-Based when they can be used, they are extremely costly in terms of processing cycles and power. Approach for Managing 4. Resource and content consolidation: In order to minimise both cost and complexity in storage systems there is an Information Quality increasing need to consolidate both (a) storage resources and (b) data and content. The first, results in better use of by Thanh Thoa Pham Thi, Juan Yao, Fakir Hossain available resources, mainly via sharing among multiple and Markus Helfert applications and amortizing peak needs. The second sim- plifies as much as possible, management procedures and High quality data helps the data owner to save costs, to mechanisms that incur important costs to modern storage make better decisions and to improve customer service infrastructures. and thus, information quality impacts directly on businesses. Many tools for data parsing and A basic enabler for building the future storage systems is to take standardization, data cleansing, or data auditing have advantage in the I/O stack of the performance potential of mul- been developed and commercialized. However ticore CPUs and at the same time deal with their shortcomings. identifying and eliminating root causes of poor IQ along the information life cycle and within information Challenges systems is still challenging. Addressing this problem we Data storage processing is typically structured as an “I/O developed an innovative approach and tool which helps path” that takes data from the application to the storage identifying root causes of poor information quality. device itself when we need to store it or retrieve it. The I/O path in existing virtualized systems traverses through several In a recent survey by Gartner information quality (IQ) layers of the system: the application and middleware layer, related costs are estimated as much as millions of dollar the virtual machine (VM) layer, the host operating system (SearchDataManagement.com, “Poor data quality costing (OS) layer and the embedded storage controller firmware companies millions of dollars annually”, 2009.). Meanwhile layer. In non-virtualized systems, there are no virtual Thomson Reuters and Lepus survey in 2010 (http://thomson- machines and applications are executed on the host OS. This reuters.com/content/press_room/tf/tf_gen_busi- generic layered structure is shown in Figure 1. ness/2010_03_02_lepus_survey) revealed “77% of partici- pants intend to increase spending on projects that address IOLanes aims to analyse and address challenges throughout data quality and consistency issues” which are “key to risk the I/O path. It will analyse and address inefficiencies associ- management and transparency in the financial crisis and the ated with these layers on multicore CPUs, by designing an market rebuild”. I/O stack that minimizes unnecessary overheads and scales with the number of cores. Since storage systems are perhaps In order to improve and maintain IQ, effective IQ manage- the most critical component of modern computing infrastruc- ment is necessary, which ensures that the raw material (or tures, the proposed work will benefit many I/O-intensive data) an organisation creates and collects is as accurate, com- applications that support activities of businesses, organiza- plete and consistent as possible. Furthermore the information tions, and individuals alike. Our work will result in systems product which is manufactured from raw data by trans- that are able to perform multi-GBytes/s I/O in virtualized forming, assembling processes must also be accurate, com-

ERCIM NEWS 82 July 2010 55 R&D and Technolgy Transfer

Figure 1: InfoGuard architecture.

plete and consistent. Thereby IQ management usually ples of our approach are to analyze rules based on the consis- involves processes, tools and procedures with defined roles tency between business process, data and responsibilities of and responsibilities. organizational roles. Therefore, apart from business rules set out by the organizations, we can help users to identify IQ Total IQ Management is a cycle within 4 phases. The rules which come from correlations between business defining phase aims to define characteristics of information processes and data, particularly dynamic rules, and authority products and involved raw data, define IQ requirements for rules on data and business process performing. In other that information. The measuring phase focuses on IQ assess- words, this approach makes clear the imprecision of the data ments and measurement against IQ requirements based on model regarding to the business process model and vice- metrics such as data accuracy, completeness, consistency and versa which are root causes of errors at the system specifica- timeliness. The result of this phase is analyzed in the ana- tion level. lyzing phase which provides organizations with causes of errors and plans or suitable actions for improving IQ. The One of the most powerful features of our approach/tool is to improving processes are taken place in the improving phase. support the rule identification by relating graphically busi- ness processes and data models. Business process elements Many tools and techniques have been developed in order to and data elements concerning a rule are highlighted as long measure the consistency of data, mostly involving some as designing/displaying the rule. By this way, the context and forms of business rules. Current approaches and tools for IQ the rational of a rule is presented, which facilitates the rule auditing and monitoring often validate data against a set of management. business rules. These rules constrain the data entry or are Once IQ rules are obtained and stored in a rule repository, the used to report violations of data entries. However, the major analyzing service can be called to detect incorrect data. The challenge of these approaches is the identification of busi- tool then produces error reports after analyzing process. The ness rules, which can be complex and time consuming. report includes incorrect data, the involved rules, the Although simple rules can be relatively easy identifies, involved process and the involved organizational roles. dynamic rules along business processes are difficult to dis- Figure 1 illustrates the architecture of our approach. cover. In the future we will apply our approach to a broader scale of Funded by Enterprise Ireland under the Commercialization applications such as rules across applications, across data- programme, our research at the Business Informatics Group bases. We will need data profiling across databases for total at Dublin City University focuses on addressing those chal- data quality analyzing. lenges. A recent project -InfoGuard- aims to indentify and design business rules by combining business processes and Link: data models. This project complements our already estab- http://www.computing.dcu.ie/big lished rule-based data analyzing module for discovering root causes of data errors by indicating involved processes and Please contact: involved organizational roles. Markus Helfert School of Computing Our approach is based on business process documentation Dublin City University and data models extracted from legacy systems. The princi- E-mail: [email protected]

56 ERCIM NEWS 82 July 2010 Integration of an Electrical Vehicle Fleet into the Power Grid by Carl Binding, Olle Sundström,Dieter Gantenbein and Bernhard Jansen

The sudden peak of oil prices in 2008 together with the depletion of oil reserves, and the ecological impact of today’s combustion engines have sparked renewed interest in fully electrical vehicles (EV) and plug-in hybrid EVs (PHEV). Based on typical driving patterns, it is believed that EVs could substantially alter the energy mix used for individual transportation, current limitations of battery technology not withstanding. In addition, in electrical power grids with a high percentage of fluctuating, renewable, energy sources (wind, solar) a larger fleet of EVs can absorb peaks of power production, and attenuate power troughs by acting as a distributed energy resource (DER), feeding power from its accumulator into the grid. In the context of the Danish EDISON project, we are investigating the Figure 1: The EDISON Concept (Picture by courtesy of muff-.ch). potential impact of an EV fleet on electrical distribution grids and how to optimally integrate these resources into the power grid.

In the recent past, several motor vehicle vendors have traditional domestic and industrial loads when power supply announced PHEVs or EVs for the US, the Japanese, and the from intermittent, renewable, sources is reduced. European markets. The expected electrical range of these vehicles varies between 50 to 100 km, which is considered There are several challenges which need to be addressed by enough to cover a large percentage of daily driving patterns, such an undertaking: in particular commuting to and from work. • Resource management: The EVs – as managed resources The basic concept is to integrate an electrical accumulator - need to maintain appropriate levels of state-of-energy in into the vehicle which is recharged during stop-overs, prefer- their accumulators to satisfy their user’s requirements. ably during periods in which electrical energy from environ- Sufficient electrical power must be available when the mentally friendly sources is abundant and exceeds the non- vehicles are parked in order to recharge the batteries. EV load in a power grid. Conversely, one can also imagine When feeding power from the EV into the grid, enough EVs feeding back excessive energy, when not needed by the energy for the next trip must remain stored in the accumu- car’s operator, into the grid during times of peak demand lator. Ideally these requirements should be economically when insufficient power is supplied by wind or solar. Hence, beneficial for the end user, ie he or she will want to charge the vehicle’s accumulator acts as a buffer for electrical the batteries in times of cheap power and possibly re-sell energy. Since electrical power grids require continuous bal- such power at a higher price during power troughs within ancing of generated and consumed power, this energy the grid. A grid utility might have different objectives buffering is of technical and economical interest as a supply regarding pricing – a global cost optimization for provi- of balancing power to the power grid. In particular, in grids sion of energy to the EV fleet is its likely objective. with intermittent power supplies and limited balancing power supplies (hydro, gas turbines) additional balancing • Grid integration: Smart charging or discharging must be power capabilities are appealing. aligned with the electrical grid’s power generation ability and non-EV, conventional, loads. Based on forecasts of The goals of the Danish EDISON project are to build a pilot available renewable power, EVs can optimize their charg- system to aggregate a fleet of EVs and have this aggregator, ing behavior under the assumption that energy costs are or virtual power plant (EV-VPP), act as an entity capable of low in times of abundant renewable power and vice versa. delivering and absorbing balancing power [2]. Clearly, given The energy requirements for each EV can either be esti- the power volumes of typical EVs charging (which varies mated or obtained from the EV users in order to perform from 3 kW upwards) versus the power delivered by a typical planning for individual EVs in relation to overall power wind-turbine (1.5-3 MW), such aggregation is needed in supply and non-EV load in the grid. order to achieve some balancing power impact. • Communications: Reliable, secure, timely and responsive

Figure 1 illustrates the overall concept. CO2 neutral wind- communications are needed to exchange information energy is fed into the EV fleet and can be consumed by the between distributed energy producers, mobile energy con-

ERCIM NEWS 82 July 2010 57 R&D and Technolgy Transfer

Figure 2: EV-VPP Architecture - Message Flow.

sumers, and the EV-VPP. The grid’s status, as well as its sages are generated in case of non-anticipated deviations static parameters, is required for a comprehensive opti- from the plans. mization and this information needs to be relayed to the VPP. We are currently simulating the above optimizations using a synthesized grid patterned after the Bornholm (DK) power • Billing and customer relationship management: Last but grid [1]. We include approximately 11,000 low-voltage certainly not least, end-users will have to be billed for access points and simulate the motion of EVs over that grid. power consumed, and power fed from the EV’s battery For each EV, we maintain historical trip data which we use in into the grid must be accounted for. This is based on order to estimate future energy needs. The generation esti- metering information which can be transmitted from the mates are based on wind-data and conventional generator EV’s charging spot into the grid and utilities IT infrastruc- resources. We have been able to achieve reasonable perform- ture. End-users will need to be provided with the ability to ance using a fleet of over 1,000 vehicles simulated on con- register, to deregister, and to manage their profiles with the ventional lap-top computers. EV-VPP. To enable larger reach of EVs, roaming between VPPs also needs to be provided; similar to cellular tele- Our current work focuses on integrating more available real- phone systems this requires appropriate and interconnect- world trip and grid data into our simulation. We are also ed back-end infrastructure. investigating the convergence and complexity issues associ- ated with the impact of the grid constraints into the planning Our approach to solving the above integration problem is for EV charging. Based on our work so far, we believe that centered on two key aspects. First we postulate the architec- our approach delivers optimal grid and resource utilization. ture of an EV-VPP. As shown in Figure 2, the data flows into Our investigations continue. the VPP, originating from various entities in the power grid: the loads (ie, the EVs), the transport and distribution grids, Links: and the power generation. The VPP’s core task is to derive [1] Wind power system of the Danish island of Bornholm: the optimal charging plans for all vehicles under its control Model set-up and determination of operation regimes and disseminate these to the EVs. http://www.elektro.dtu.dk/English/about_us/staff/staff_lists/ staff.aspx?lg=showcommon&id=240821 The second key postulate is that we perform a centralized, global, optimization of the charging behavior. Using infor- [2] The EDISON Project http://www.edison-net.dk/ mation gathered from energy consumers and generators as well as grid operators, we set up a linear programming opti- [3] IBM ILOG CPLEX V12.1: User’s Manual for CPLEX, mization which computes a charging plan for each EV which ftp://public.dhe.ibm.com/software/websphere/ilog/docs/opti minimizes the global energy costs for total EV recharging. mization/cplex/ps_usrmancplex.pdf We also address potential distribution grid constraints by including power flow bottlenecks in the planning of the Please contact: charging. The computed charging plan is eventually pushed Carl Binding back to the EVs and its execution is then controlled during IBM Zurich Research Laboratory, Switzerland the daily grid operations; possible overriding control mes- E-mail: [email protected]

58 ERCIM NEWS 82 July 2010 Graph Transformation sions and ideation. The results of ideation sessions are repre- sented as graphs. The follow up processing requires consoli- for Consolidation dation of the results into a unified model. As the graph struc- ture is adopted, the consolidation can be done with the help of Creativity Sessions of series of graph transformations. Graph transformations are Results used to represent merge, subtract and replace operations. The web based tools (see Figure 1) are based on lightweight by Peter Dolog portlet technology, which is based on Liferay Portal Technology. This method integrates sketch editor for free Graph transformation approach for consolidation of drawing in ideation sessions, idea editor for preserving creativity sessions results is part of the FP7 EU/IST ideation results based on MxGraph software, statement project idSpace: Tooling of and training for portlet for association label suggestions according to chosen collaborative, distributed product innovation. The goal of creativity technique, portlets for graph transformations graph transformation approach is to provide a tool for (merge, subtract, and replace buttons in Figure 1) and other merging results of various sessions (such as portlets realized by different members of the idSpace consor- sessions), which are represented as tium. The idSpace consortium consists of Open University of graphs, when the session participants- are physically distributed.

Creativity is a mental and social process involving the discovery of new ideas or concepts, or new associa- tions of the creative mind between existing ideas or concepts. It has been studied from various perspectives including behavioural psy- chology, social psychology, psychometrics, cognitive science, artificial intelli- gence and philosophy. Elicitation of such ideas is usually supported by one or more so called “creativity techniques” which are usu- ally performed in a group. Figure 1: Portlets for graph transformation. We are dealing with cre- ativity in the context of dis- covery of ideas (and associations between them) applied to the Netherlands, Aalborg University, University of Cyprus, . From this perspective, each Landesinitiative Neue Kommunikationswege Mecklenburg- meeting or session which focuses on new product develop- Vorpommern e.V, University of Pireaus Research Center, ment results in some kind of model of ideas with associations University of Hildesheim, Morpheys Software, Space between them. One example of such a model is a ( Applications, and Extreme Media Solutions. a graph where nodes represent ideas and edges between them define relationships between them). Let us exemplify the idea graph using the 5W1H creativity technique . The 5W1H is one of the creativity techniques New product development usually does not end with one explored in idSpace project, it stands for the six question session. It typically consists of several sessions performed words (what, why, where, when, who and how). A typical with several participating teams in different geographical scenario in which the “5WH1” creativity technique may be locations. This is especially relevant in situations in which a applied is shown in Figure 1, using “holiday house simple new product needs to be assembled from components of ideation session” results. The aim of this session was to find physically distributed suppliers within a supply chain. As new ideas on better improved materials for such a house. there are usually multiple models resulting from several ses- One way in which this technique may be applied is to use the sions, there is a natural requirement to consolidate the results answers arising from the 5W1H questions as input for later with electronic support. questions.

The idSpace project started in April 2008 and finished in The idea behind creativity techniques is to encourage lateral March 2010. The idSpace project aims to provide a web- thinking, because it is a common human behavior to stick to based platform which supports the distributed creative ses- one line of thought that may be preferred at first glance. This

ERCIM NEWS 82 July 2010 59 R&D and Technolgy Transfer

line of thought is steared by the statements, guidelines, and snapshots of white boards; they can be used for later pro- questions from the creativity techniques. Each generated cessing, such as automatic discovery of hidden implicit rela- idea (as for example in Figure 1) can be connected to another tions between ideas, recommendations of related existing by an association with a label taken from the set of 5WH1 ideas, guiding through an information space of ideas and so questions (labels on edges in Figure 1 for example). This can on. As a consequence more people are able to contribute be modeled as graph. Further, each idea represented as a ideas, to manage creativity and its results more transparently node can be supported by a sketch drawn in a sketch editor. and efficiently, and to benefit from previous knowledge con- Once a group is finished with the idea graph, a consolidation structed in other creative sessions. session might be called upon where the graph transformation of existing graphs, including the one generated in the latest Link: session, are employed to achieve a final agreed upon sugges- http://www.idspace-project.org tion for a solution. Please contact: The benefit of such an approach is that teams get user- Peter Dolog friendly support for consolidation and preservation of the Aalborg University, Denmark creativity session results. The results are far more than just E-mail: [email protected]

Extracting Information where these expressions instantiate concepts of interest in a given domain. For instance, given a corpus of job announce- from Free-text ments, one may want to extract from each announcement the natural language expressions that describe the nature of the Mammography Reports job, annual salary, job location, etc. Put another way, infor- mation extraction may be seen as the activity of populating a by Andrea Esuli, Diego Marcheggiani and Fabrizio structured information repository (such as a relational data- Sebastiani base, where “job”, “annual salary” and “job location” count as attributes) from an unstructured information source such Researchers from ISTI-CNR, Pisa, aim to effectively and as a corpus of free text. Another example of information efficiently extract information from free-text extraction is searching free text for named entities, ie, names mammography reports, as a step towards the of persons, locations, geopolitical organizations, and the like. automatic transformation of unstructured medical documentation into structured data. An application of great interest is extracting information from free-text medical reports, such as radiology reports. Information Extraction is the discipline concerned with the These reports are unstructured in nature, since they are extraction of natural language expressions from free text, written in free text by medical personnel. However, applying

Figure 1: A mammographic report automatically annotated according to the nine concepts of interest.

60 ERCIM NEWS 82 July 2010 information extraction would be beneficial, since extracting Parental Control key data and converting them into a structured (eg, tabular) format would greatly contribute towards endowing patients for Mobile Devices with electronic medical records that, aside from improving interoperability among different medical information sys- by Gabriele Costa, Francesco la Torre, Fabio Martinelli tems, could be used in a variety of applications, from a and Paolo Mori patient’s care to epidemiology and clinical studies. As a result of the rapid increase in mobile device In recent months we have worked on a system for automati- capabilities, we believe that many users will soon cally extracting information from mammography reports migrate most of their tasks from the computer to the written in Italian. There are two main approaches to smart phone. We thus feel that new, more effective designing an information extraction system. One is the rule- security mechanisms are needed to protect users, based approach, which consists of writing a set of rules especially minors. For this reason, we are currently which relate natural language patterns with the concepts to engineering a software suite designed to monitor all be extracted from the text. This approach, while potentially security-relevant activities and, where necessary, block effective, is too costly, since it requires a lot of human power unsuitable material. to write the rules, which are jointly written by a domain expert - say, an expert radiologist - and a natural language In western countries mobile phones now outnumber people. engineer. We have followed an alternative approach, which The majority of these phones have embedded cameras and is based on machine learning. According to this approach, a good connectivity support, connecting to the Internet via general-purpose learning software learns to relate natural their mobile operator network or in wireless mode, and com- language patterns with the concepts to be instantiated from a municating directly with other devices through a Bluetooth set of manually annotated free texts, ie texts in which the interface. This means that the gap in functionality between instances of the concepts of interest have been marked by a personal computers and smart phones is being continuously domain expert. The advantage of this approach is that the reduced. Mobile devices such as smart phones or Personal human power required to annotate the texts needed to train Digital Assistants (PDAs) are now powerful enough to run the system is much smaller than that needed to manually applications that are comparable to those running on normal write the extraction rules. PCs.

The system we have built uses “conditional random fields” The mobile phone is thus now typically employed for many (CRFs) as a learning technique. CRFs were explicitly tasks, such as browsing the Internet, reading and writing e- devised for managing data of a sequential nature, such as mails, exchanging data (photos, contacts, files, etc) with text, and have given good results on text-related tasks such other devices, and so on. Clearly, these operations bring with as named entity extraction and part-of-speech tagging. them new vulnerabilities and raise new threats to the user’s privacy. A particular source of concern is the widespread use We have tested our system on a set of 405 mammographical of mobile devices by young people and minors who can reports written (in Italian) by medical personnel of the easily access content that is inappropriate for their age. Radiology Institute of Policlinico Umberto I of Rome, and Filters provided by the mobile telco operators are not suffi- manually annotated by two expert radiologists of the same cient to block such material because they only inspect data institution according to nine concepts of interest (eg, sent via the telco network and have no control on direct con- “Followup Therapies”). 336 reports were annotated by one nections (eg Bluetooth). As a consequence, in recent years, radiologist only, while the other 59 were independently much research has focused on the provision of customised, annotated by both radiologists. The presence of reports security support for mobile devices. However, what is still annotated by both radiologists has allowed us to directly missing is a general security framework that controls appli- compare the accuracy of our system with human accuracy, cations as well as phone calls and message content. by comparing the agreement between the system’s and the radiologist’s annotations with the agreement between the In order to meet this need, by calling upon the experience annotations of the two radiologists. Our experiments, run by gained in the EU project S3MS (Security of Software and 10-fold cross validation, have shown that our system obtains Services for Mobile Systems), we have developed a new near-human performance: the agreement between system modular security monitor support for mobile devices which and human, measured by the standard “macroaveraged F1” uses a plugin-based, extensible architecture to meet the measure, turned out to be 0.776, while the agreement demands of security requirements which change from device between the two experts was 0.794 (higher values are better, to device and from user to user. since 0 and 1 indicate perfect disagreement and perfect agreement, respectively). These results are especially Our system aims to monitor all security-relevant activities encouraging because no specialized lexical resource was involving a mobile device. The basic idea is to develop a used in the experiments, since no such resource exists for the minimal security core that evaluates the platform security radiological / mammographic sector for Italian. state, accepts customised security policies, and deals with heterogeneous system events. Please contact: Fabrizio Sebastiani The system can be easily extended with new modules. When, ISTI-CNR, Italy following authentication, a Module M is plugged into the E-mail: [email protected] system. It injects the list of its security-relevant events and

ERCIM NEWS 82 July 2010 61 R&D and Technolgy Transfer

Figure 1: The system core consists of three components: an events interface (Interface), a Policy Decision Point (PDP) and an Attribute Manager (AM). The Interface receives the security events, authenticating and dispatching them to the PDP. The PDP holds the policy defining the currently enforced rules and the security state. Whenever it receives an event from the Interface, it updates the internal state and computes the reaction to be returned. The PDP is able to access some system information through the AM.

the new attributes it needs to handle and then disseminates its system maps one of its attributes to the classifier invocation own Policy Enforcement Points (PEPs) within the device and uses it to determine whether a picture is offensive (eg software. pornographic) or not.

Each PEP checks the behaviour of a critical system compo- Another PEP, attached to the system GPS, gives the geo- nent and enforces certain operations (depending on what the graphical position of the device whenever it is available. This PEP is set to monitor). When a PEP captures a sensitive information can be exploited to define location-based secu- action, it generates a corresponding event and sends it to the rity policies, eg alerting a parentif the device moves outside system interface. It will then receive the action to be certain area. enforced on its target. The last PEP monitors security-relevant operations (eg Security policies are defined using an editor only accessible open/close connections or read/write files) performed by by the device owner. The policy editor provides the user with Java MIDlets. This PEP provides valuable feedback on a human-readable graphic interface so that system policy can which Java program is executed (method MIDlet.start()). be defined or modified without the need for technical skills. Security polcies generated by the editor are passed to the The large variety of information available to our PDPs makes PDP to be loaded and enforced. it possible to apply very expressive constraints. In this way parents can tailor a specific security for their children’s A very general class of security rules that can involve mul- devices. For instance, with a configuration similar to that tiple modules is thus obtained. The example shown in Figure shown in Figure 1, rules like: "Do not execute Java games at

1 uses three Modules: M1 (red) for messages control, M2 school" or "Alert me with a SMS if the device receives some (blue) for position control and M3 (green) for Java applica- pornographic content (and block this content)". Such rules tions control. Each module places its own PEP, declares the can be composed into highly expressive and effective secu- necessary attributes and specifies their security actions and rity policies. reactions (labels above and below the connections, respec- tively). For instance, the PEP watching the transmission of Please contact: messages sends an action signal whenever a message is sent Fabio Martinelli or received. The action contains information about the mes- IIT-CNR, Italy sage content (msg) and the sender/recipient (from/to). The E-mail: [email protected] possible reactions to this event are: allow (if the policy per- mits the operation), deny (if the policy prohibits it) or delete an attachment (if the message contains dangerous material). Assessment of the danger of an attachment (eg a picture or a video-clip) is done by automatically classifying its content, either remotely on a server or locally on the device itself. The

62 ERCIM NEWS 82 July 2010 Preschool Information opportunity to study real life use through surveys, inter- views, and logs. In general, both parents and teachers held When and Where you positive attitudes toward the service. PreschoolOnline offered access to information when and where it was needed. Need it Moreover, by providing a means of communication of rou- tine and logistical information the service enabled teachers by Stina Nylander and parents to spend more time talking about more indi- vidual aspects of the children’s schooling and development Having children in preschool can be a logistic challenge. when they met at preschool. Parents need to keep track of dates for events, what to bring, changes in the opening hours of the preschool, PreschoolOnline showed that even a simple service can and many other things. PreschoolOnline is an attempt improve information exchange between parents and to support the information exchange between parents teachers. We are now adding functionality to the service to and preschool teachers and provide an opportunity to create not only a tool for disseminating information but also study how ICT can be used in preschool. to gather information, such as vacation dates and whether or not parents plan to attend a meeting. As part of a joint project, carried out by SICS, Squace AB and Nockeby Preschools we have developed The six original preschools have chosen to continue to use PreschoolOnline, a web service also available on mobile PreschoolOnline after the project has ended, and more than phones. The service is organized in communities correspon- 40 other preschools have tried out the service on their own ding to the groups of children in preschool, and allows par- initiative. Furthermore, the City of Stockholm is interested in ents to access information about their child’s group and the using PreschoolOnline for all preschools in the city. preschool as well as contact information for other parents in that group. Teachers can add and update information about Link: the preschool, parents can add and update information about http://www.forskolanimobilen.se themselves and their children, and everyone can post on the notice board. Please contact: Stina Nylander During the project, PreschoolOnline was deployed in six SICS, Sweden preschools and used by more than 500 parents in a five Tel: +46 8 633 15 69 month field trial. This gave the researchers an excellent E-mail: [email protected]

ERCIM NEWS 82 July 2010 63 Events

following modes of communication: the Department of Computer Science, Plenary / Keynote Presentation(s), founding director of the Human Parallel Sessions, Poster Sessions, Computer Interaction Lab and Member Tutorials and Exhibition. of the Institute for Advanced Computer The Conference focuses on the fol- Studies at the University of Maryland. lowing major thematic areas: The pioneering work of Ben • Ergonomics and Health Aspects of Shneiderman has played a key role in Work with Computers the development of HCI as a new aca- • Human Interface and the Manage- demic discipline, by bringing scientific ment of Information methods to the study of human use of • Human-Computer Interaction computers. For more than 30 years, Ben • Engineering Psychology and Cogni- Shneiderman has promoted HCI through Call for Papers tive Ergonomics his prolific writing and lecturing. He is • Universal Access in Human-Comput- one of the most frequently referenced HCI International 2011 er Interaction researchers in computer science. The • Virtual and Mixed Reality title of his keynote address is - 14th International • Internationalization, Design and "Technology-Mediated Social Partici- Global Development pation: The Next 25 Years of HCI Conference on • Online Communities and Social Challenges". Computing Human-Computer • Augmented Cognition Proceedings • Digital Human Modeling The HCI International 2011 Conference Interaction • Human Centered Design Proceedings will be published by • Design, User Experience, and Usability Springer in a multivolume set in the Hilton Orlando, Bonnet Creek, Lecture Notes in Computer Science Orlando, Florida, USA, 9-14 July 2011 Deadlines (LNCS) and Lecture Notes in Artificial • Papers Intelligence (LNAI) series. jointly with: Abstract: 15 October 2010 • Symposium on Human Interface Notification: 3 December 2010 Exhibition (Japan) 2011 Camera-ready: 4 February 2011 The HCI International Conference is an • 9th International Conference on ideal opportunity to exhibit your prod- Engineering Psychology and Cogni- • Posters ucts and services to an international tive Ergonomics Abstract: 11 February 2010 audience of about 2,000 researchers, • 6th International Conference on Uni- Notification: 4 March 2011 academics, professionals and users in versal Access in Human-Computer Camera-ready: 1 April 2011 the field of HCI. Attendees will be able Interaction to examine state-of-the-art HCI tech- • 4th International Conference on Vir- • Tutorials nology and interact with manufacturing tual and Mixed Reality Abstract: 15 October 2010 representatives, vendors, publishers, and • 4th International Conference on Notification: 3 December 2010 potential employers. Internationalization, Design and Camera-ready: 6 May 2011 Global Development Student Volunteers • 4th International Conference on Individuals can appear as co– authors in The HCI International 2011 Program for Online Communities and Social several papers in HCI International 2011 Student Volunteers gives university stu- Computing and the affiliated conferences, but can dents from around the world the oppor- • 6th International Conference on Aug- present only one paper. tunity to attend and contribute to one of mented Cognition the most prestigious conferences in the • 3rd International Conference on Dig- Awards field of Computing and Human- ital Human Modeling A total of thirteen awards will be con- Computer Interaction. Being a Student • 2nd International Conference on ferred during HCI International 2011. A Volunteer is a great opportunity to Human Centered Design plaque and a certificate will be given to interact closely with researchers, aca- • 1st International Conference on the best paper of each of the eleven demics and practitioners from various Design, User Experience, and Usability Affiliated Conferences / Thematic disciplines, meet other students from Areas, amongst which, one will be around the world, and promote personal Overview and Areas of Interest selected to receive the golden award as and profession growth. The HCI HCI International 2011 jointly with the the Best HCI International 2011 International Conference’s past, present affiliated conferences, which are held Conference paper. Moreover, the Best and continued future successes are due under one management and one registra- Poster extended abstract will also in a large part to the skills, talents and tion, invite you to Orlando, Florida, receive a plaque and a certificate. dedication of its Student Volunteers. USA to participate and contribute to the international forum for the dissemina- Keynote address More information: tion and exchange of up-to-date scien- Professor Ben Shneiderman is the http://www.hcii2011.org tific information on theoretical, generic keynote speaker for HCI International and applied areas of HCI through the 2011. Ben Shneiderman is a Professor in

64 ERCIM NEWS 82 July 2010 Call for Participation Call for Submissions

CEDI 2010 - Third Special issue of the International Call for Participation Journal of RF Technologies: Research Spanish Conference & Applications on W3C Web on TV on Informatics RFID and the Internet Workshop Valencia, Spain, 7-10 September 2010 of Things in Europe Tokyo, Japan 2-3 September 2010 Several national symposia and work- the ERCIM-managed RACE network shops have traditionally been organized The demand for access to applications, RFID (http://www.race-networkrfid.eu/), on specific topics related to computer video, and other network services con- is designed to become a federating plat- science and engineering. CEDI joins all tinues to grow. The Web platform itself form to the benefit of all European of them into a single conference, and continues its expansion to support stakeholders in the development, adop- this is the third time that CEDI is organ- mobile devices, television, home appli- tion and usage of RFID. The network ized. The main goal is to gather the ances, in-car systems, and more con- solicits submissions to the Special Issue Spanish research community on com- sumer electronics. To meet the growing of the Journal for RF Technologies: puter science and engineering in order to demand, the Web platform of the future Research and Applications on research proudly show our scientific advances, will require smarter integration of non- in the fields of RFID and the Internet of discuss specific problems and gain visi- PC devices with Web technology so that Things in Europe. bility in Spanish society, emphasizing both hardware and software vendors can Spain’s role in the new age of the provide richer Web applications on var- The mission of the Special Issue is to Information Society. ious devices at lower costs. This is the provide an overview on the ongoing first of a planned series on "Web on TV" research activities concerning RFID The conference is organized in a ‘federated workshops to bring various communi- and the Internet of Things. The overall conference’ format, by joining several more ties together to discuss this integration. objectives are: specific symposia (22 and growing), This first Workshop will be hosted • to provide a European perspective including Artificial Intelligence, Software W3C/Keio with the support of the on research and development Engineering and Databases, Programming Japanese Ministry of Internal Affairs • to discuss European stakeholder Languages, Parallelism and Computer and Communications. The Workshop involvement, ethical issues and gov- Architecture, Computer Graphics, will be conducted in Japanese, Korean, ernance Concurrency, Soft-computing, Pattern Chinese, and English via simultaneous • to evaluate the impact of RFID and Recognition, Natural Language Processing, translation. A meeting summary will be the Internet of Things on people; Ubiquitous Computing and Human- available in English. A workshop of this • to present new business develop- Computer Interaction. In addition to the sci- series is planned in Europe early 2011. ments and; entific activities of each symposium, the • to build a bridge between research CEDI conference will feature two invited More information: and practice in Europe. keynotes, two round tables and social events W3C workshop calendar: and a gala dinner at which several awards http://www.w3.org/2003/08/Workshops/ Target audience will be delivered. Through these awards, the W3C Web on TV Workshop: The Special Issue is intended to support research community on informatics will rec- http://www.w3.org/2010/09/web-on-tv/ a professional audience of researchers, ognize the efforts of some colleagues in pro- top managers and governmental institu- moting informatics research in Spain. tions. So even though we will require a Call For Participation professional scientific contribution the The expected attendance is around 1,500 style of writing should address a wider scientists. The event is funded by the ICT 2010 audience. Submissions need to con- Spanish Ministry of Education, and tribute new and original research and sponsored by SpaRCIM. The conference Brussels, 27-29 September 2010 review articles. Contributions are will be organized by the Universidad specifically welcome from ongoing Politécnica de Valencia. The conference ICT 2010, Europe's biggest event for European research projects. chair will be Prof. Isidro Ramos, the research in information and communica- Scientific Program Chair will be Prof. tion technologies, will be organised by Deadline for submission: 15 September José Duato, and the Local Organization the European Commission DG 2010. All submitted chapters will be Chair will be Prof. Juan Miguel Information Society and Media. ERCIM reviewed (double-blind review). The Martinez, all of whom are from the and ERCIM-managed projects will most Special Issue is scheduled to be pub- Universidad Politécnica de Valencia. certainly have representation at this key lished in the second quarter of 2011. This university is one of the leading event through booths and speakers, as in technical universities in Spain, and has a Lyon, France, in 2008. More information: strong focus on innovation and research. http://www.iospress.nl/loadtop/ More information: load.php?isbn=17545730 More information: http://ec.europa.eu/information_society/ http://www.congresocedi.es/2010/ events/ict/2010/ http://www.race-networkrfid.eu/

ERCIM NEWS 82 July 2010 65 In Brief

Master of Science Spanish Pilot Project in Computational Biology for Training and Development and Biomedicine at Nice of Technology Applied

While biological data exhibits a formidable diversity, the to Systems Biology past two decades have seen the advent of massive data pro- duced either by high throughput experiments or by measure- The ‘KHAOS’ research group at the University of Málaga ment devices of increasing accuracy at very different scales has developed several applications in a pilot project for ranging from nano to macro. Handling these massive and training and development of technology applied to Systems complex data within a virtuous cycle linking modeling and Biology. measurements is one of the major challenges in Computational Biology and Biomedicine. With the appearance of analysis tools in systems biology, sources of information have been growing exponentially, Starting your career creating a necessity to approach the problem of the integra- The aim of this program is to provide excellent academic or tion of existing information. The applications developed in industrial career opportunities by offering high level cov- the pilot project solve specific problems facing the erage of modeling and computing principles that will enable researchers working in this area. These applications focus on the challenges to be met and make tomorrow's technological data integration and Web Services by means of semantics choices in biological, medical computing domains. (and the use of data available as Liked Data): • BioBroker - This tool integrates different biological data- Scientific Goal bases using XML as a model of data interchange (Navas- The goal of this program is to focus on the human being Delgado et al. SPE 2006). This was the starting point for from different perspectives (understanding and modeling the activities of Khaos in Computational Biology. functional aspects or interpreting biomedical signals from • SB-KOM - A mediator based on using KOMF as the various devices) and at different scales (from molecules to framework and using optimized scheduling algorithms for organs and the whole organism). Courses are organized into the biological data bases (Navas-Delgado et al. Bioinfor- 3 categories: 1) bioinformatics, 2) biomedical signal and matics 2008). This mediator is the natural evolution of image analysis and 3) Modeling in neuroscience. All classes BioBroker towards the use of semantics when integrating will be given in English by several outstanding professors biological data. and researchers from research institutes in the campus: Nice • ASP 3D Model Finder (AMMO Prot) - An application that Sophia Antipolis University/CNRS (I3S, IPMC, LJAD, enables the user to view the three-dimensional protein INLN) and INRIA. structure in the metabolism of Amines, using computer generated methods. This structure is generated using exist- Scholarships ing information and it is the first tool that took advantage The scholarship program offers outstanding students the of SB-KOM. chance to receive a grant covering the living expenses • SBMM-Assistant - This assistant allows the user to locate during the first period of the master. Early application information about metabolic routes (including details about increases the chances of receiving financial aid. Therefore, their components) (http://sbmm.uma.es). In addition, it it is in the applicants best interest to submit his/her applica- allows the edition of these routes by the users. This tool tion at the first round of applications. allows users studying metabolic pathways to retrieve infor- mation from many databases and curate them manually. Admission Criteria: • Social Pathway Annotation (http://sbmm.uma.es/SPA). Due to the multi-disciplinary nature of the program, the pro- This extension of SBMM introduces the use of social net- gram is designed for those having completed the first-year working tools to enable the collaborative curation of path- MSc program at home institution in either computer sci- way models. ence, electrical engineering, applied mathematics, mathe- • BioSStore. In this work we address the drawbacks present- matical biology, bioinformatics or biophysics. ed by bioinformatics services and try to improve the cur- rent semantic model introducing the use of the W3C stan- Please contact: dard Semantic Annotations for WSDL and XML Schema Frederic Cazals and Pierre Kornprobst (SAWSDL) and related proposals (WSMO Lite). Programme coordinators • Scientific Mashups. The application of mashups can be E-mail: [email protected] useful within biology . Automatic processes are needed to develop biological components which will be applied by More information: end-users to develop specific biological mashups using a http://www.computationalbiology.eu mashup platform, ie. EzWeb Platform.

All research conducted by KHAOS is funded by several research grants.

More information: http://khaos.uma.es

66 ERCIM NEWS 82 July 2010 Van Dantzigprijs for Peter VISITO Tuscany

Grünwald and Harry van Zanten VISITO Tuscany was presented by ISTI-CNR at a press con- ference in the historical centre of Pisa. VISITO provides an interactive guide for tourists visiting cities of art via a smart phone application. To receive detailed information on a mon- ument, the user simply takes a photo of it. The system will support tourists in all phases of their trip, from the initial plan- ning to post-visit archiving and sharing over social networks. Leveraging on techniques of image analysis, content recogni-

Peter Grünwald (left) and Harry van Zanten.

In April, Peter Grünwald (CWI and Leiden University) and tion and 3D scanning and browsing, on returning home the Harry van Zanten (Eindhoven University of Technology and tourist can relive the highlights of the trip via virtual visits to EURANDOM) received the Van Dantzig Prize, the highest the monuments photographed. The project, coordinated by Dutch award in statistics and operational research. The prize Giuseppe Amato, ISTI-CNR, is supported by the Tuscan is awarded every five years by the Netherlands Society for Region and by the European Regional Development Fund. Statistics and Operations Research (VVS-OR) to a young Project members include IIT-CNR and three private compa- researcher who has contributed significantly to develop- nies: Alinari24Ore, Hyperborea, and 3Logic MK. ments in these research areas. More information: More information: http://www.vvs-or.nl http://www.visitotuscany.it/index.php/en

MoTeCo - A New European More generally, MoTeCo will thus transfer knowledge to the regions of the partners, offering a unique experience for Partnership for Development in institutions involved in the project. As a main contribution of the project, a new method for transferring the training cur- Mobile Technology Competence riculum developed will be proposed, as well as good prac- tices to be used in similar dissemination processes. MoTeCo is a new project supported by the European Leonardo da Vinci Programme that will offer a training cur- Partners involved in the project are institutions with experi- riculum in mobile technology, in particular for the Symbian ence in vocational education. They comprise the company operating system (http://www.symbian.org/). At the moment, COMARCH S.A., the University of Málaga, AGH such trainings are accessible only in a few European coun- University of Science and Technology, Universidad tries located in the north-west of the continent. Politécnica de Cartagena and the Grenoble Institute of Technology. This project will help the partners to establish a The project will assure better access to training in this field stable platform to realise the project and a good basis for by transferring the training methodology from already future educational projects. Co-operation of private training existing training centres in Europe to new training centres in companies and universities offers the chance to develop the South and Central Europe. This will be achieved by: unique potential of IT and educational standards to realise • the establishment of five new training centres for mobile common projects. This assures the creation of new attractive software development vocational training adapted to the commercial use of the • five training curricula with additional training material technology. (textbooks, workbooks, presentations, e-learning plat- forms) adapted to specific national and institutional needs More information: • professional preparation of a group of 21 trainers ready to http://moteco.kt.agh.edu.pl/index.php prepare other trainers and programmers.

ERCIM NEWS 82 July 2010 67 ERCIM – the European Research Consortium for Informatics and Mathematics is an organisation dedicated to the advancement of European research and development, in information technology and applied mathematics. Its national member institutions aim to foster collaborative work within the European research community and to increase co-operation with European industry.

ERCIM is the European Host of the World Wide Web Consortium.

Austrian Association for Research in IT Irish Universities Association c/o Österreichische Computer Gesellschaft c/o School of Computing, Dublin City University Wollzeile 1-3, A-1010 Wien, Austria Glasnevin, Dublin 9, Ireland http://www.aarit.at/ http://ercim.computing.dcu.ie/

Norwegian University of Science and Technology Consiglio Nazionale delle Ricerche, ISTI-CNR Faculty of Information Technology, Mathematics and Area della Ricerca CNR di Pisa, Electrical Engineering, N 7491 Trondheim, Norway Via G. Moruzzi 1, 56124 Pisa, Italy http://www.ntnu.no/ http://www.isti.cnr.it/

Portuguese ERCIM Grouping Czech Research Consortium c/o INESC Porto, Campus da FEUP, for Informatics and Mathematics Rua Dr. Roberto Frias, nº 378, FI MU, Botanicka 68a, CZ-602 00 Brno, Czech Republic 4200-465 Porto, Portugal http://www.utia.cas.cz/CRCIM/home.html   ÿ Polish Research Consortium for Informatics and Mathematics Wydzia³ Matematyki, Informatyki i Mechaniki, Centrum Wiskunde & Informatica  Uniwersytetu Warszawskiego, ul. Banacha 2, 02-097 Warszawa, Poland Science Park 123, http://www.plercim.pl/ NL-1098 XG Amsterdam, The Netherlands      http://www.cwi.nl/ Science and Technology Facilities Council, Rutherford Appleton Laboratory Danish Research Association for Informatics and Mathematics Harwell Science and Innovation Campus c/o Aalborg University, Chilton, Didcot, Oxfordshire OX11 0QX, United Kingdom Selma Lagerlöfs Vej 300, 9220 Aalborg East, Denmark http://www.scitech.ac.uk/ http://www.danaim.dk/ Spanish Research Consortium for Informatics and Mathematics, D3301, Facultad de Informática, Universidad Politécnica de Madrid, Fonds National de la Recherche Campus de Montegancedo s/n, 6, rue Antoine de Saint-Exupéry, B.P. 1777 28660 Boadilla del Monte, Madrid, Spain, L-1017 Luxembourg-Kirchberg http://www.sparcim.es/ http://www.fnr.lu/

Swedish Institute of Computer Science FWO FNRS Box 1263, Egmontstraat 5 rue d’Egmont 5 SE-164 29 Kista, Sweden B-1000 Brussels, Belgium B-1000 Brussels, Belgium http://www.sics.se/ http://www.fwo.be/ http://www.fnrs.be/

Swiss Association for Research in Information Technology Foundation for Research and Technology – Hellas c/o Professor Daniel Thalmann, EPFL-VRlab, Institute of Computer Science CH-1015 Lausanne, Switzerland P.O. Box 1385, GR-71110 Heraklion, Crete, Greece http://www.sarit.ch/ http://www.ics.forth.gr/ FORTH

Fraunhofer ICT Group Magyar Tudományos Akadémia Friedrichstr. 60 Számítástechnikai és Automatizálási Kutató Intézet 10117 Berlin, Germany P.O. Box 63, H-1518 Budapest, Hungary http://www.iuk.fraunhofer.de/ http://www.sztaki.hu/

Institut National de Recherche en Informatique Technical Research Centre of Finland et en Automatique PO Box 1000 B.P. 105, F-78153 Le Chesnay, France FIN-02044 VTT, Finland http://www.inria.fr/ http://www.vtt.fi/

Order Form I wish to subscribe to the If you wish to subscribe to ERCIM News o printed edition o online edition (email required) free of charge or if you know of a colleague who would like to Name: receive regular copies of ERCIM News, please fill in this form and we Organisation/Company: will add you/them to the mailing list. Address: Send, fax or email this form to: ERCIM NEWS 2004 route des Lucioles BP 93 Postal Code: F-06902 Sophia Antipolis Cedex Fax: +33 4 9238 5011 City: E-mail: [email protected] Country

Data from this form will be held on a computer database. By giving your email address, you allow ERCIM to send you email E-mail:

You can also subscribe to ERCIM News and order back copies by filling out the form at the ERCIM Web site at http://ercim-news.ercim.eu/