SECTION interface [...] Issue 1 | 2008 1

Unraveling oncogenic pathways

Interview: Carole Goble Developments in

Rolling out the Dutch Life Sciences Grid SECTION interface 2 Content Issue 1 | 2008 Content

EditoriaL Lively science 3

COVERSTORY Unraveling oncogenic pathways 4

INTERVIEW Carole Goble, Manchester 7 ≥  Dutch have the potential to progress in leaps and bounds PROGRESS Approaches to improve the quality of protein structures 10 solved in the past

PROGRESS Mapping dynamic data by graphs and colors 12

Hands on Grid accelerates life scientist’s research 14 ≥  Making use of all computing capacity in a flexible way Thesis Dynamic software infrastruc- tures for the life sciences 16

course Getting started with web services and work flows 18

Portrait Eight questions for Jayne Hehir-Kwa 19

SPIN OFF InteRNA Genomics offers customized tools for analyzing 20 microRNAs

In the picture NBIC news 21

Column Timo Breit: Collaborate or perish 24 SECTION interface Editorial Issue 1 | 2008 3 Lively Science

Over the last decade bioinformatics has been established in the Nether- lands. NBIC and its consortium members have played an essential role in this and many people have invested significant amounts of time and effort to shape and support NBIC. Together we have been able to build a vibrant and well-organized bioinformatics community found in few other countries. Today, the momentum and impact of this scientific discipline are rapidly increasing and NBIC is steadily intensifying efforts to build a firmer international position and visibility of Dutch bioinformatics and e-bioscience.

NBIC focuses on bioinformatics and e-biosciences projects in research, sup- port, education, dissemination and exploitation. Although these projects are clearly at different levels of maturity, the wide range of projects under a single umbrella does make NBIC a unique organization. The coordination and facilitation by a single organization has already demonstrated major advantages in establishing critical mass, coherence, and collaboration within and between projects, which in turn contribute to scientific quality and broader application of research results. NBIC pursues top bioinfor- matics research to secure top life sciences research. A strong connec- tion to Dutch life sciences is, therefore, a major ingredient of the NBIC strategy. Although the current focus is on NGI genomics centres we also Antoine foresee more collaboration with other consortia such as top institutes. In the main interview of this magazine, Professor Carole Goble explains van why e-bioscience is interesting and relevant for the life sciences field. E-bioscience plays a special and prominent role within our programmes Kampen and is implemented in close collaboration with VLe and BiG Grid, while a collaboration with OMII-UK and myGrid is being set up. The BioAssist pro- Scientific director NBIC gramme embraces the e-bioscience paradigm of setting up and delivering bioinformatics support for selected life sciences domains. This is done in a collaborative effort between (bio)informatics and genomics research groups. BioAssist initiated lively discussions about the pros and cons of the e-bioscience approach. However, it is my strong belief that e-biosci- ence is the appropriate model for the organization of necessary support and will prove to be the only model to advance systems level science such as systems biology. Both require access to integrated tools, facilities and expertise to address the ever more complex research questions.

You are one of many stakeholders of NBIC and many of you have asked us to keep you up-to-date. Therefore, we are pleased to present you with this first issue of ‘Interface’ which presents developments in bioinformatics. The purpose of this magazine is to be a true interface between develop- ers and users of bioinformatics applications. The magazine allows mem- bers from the (bio)informatics and life sciences communities to share developments, experiences and know-how. The “hands on” section, for example, introduces Grid-technology for life sciences. ‘Interface’ also refers to the many software and database interfaces produced in our field. Two young bioinformatics researchers present their latest bioin- formatics applications in the section ‘Progress’. Other sections present developments in valorization and the recently initiated Regional Student Group. NBIC hopes that this magazine will give you more and better insight into the developments in our community. We look forward to your future contributions. Enjoy! SECTION interface 4 Coverstory Issue 1 | 2008

BY MARIAN VAN OPSTAL Cellular interaction mechanisms recon- structed from available experimental data. STEP BY STEP: Unravelling oncogenic pathways

The quest for novel biomolecular knowledge is pursued through a symbiotic relationship between data engineers and biomolecular re­ searchers. This article shows how joint efforts are stepping forward on the way of understand­ ing tumour development. SECTION interface Coverstory Issue 1 | 2008 5

ered, offers a potent method for the identification of cancer genes,” Wessels explains. assive amounts of data are being, Biological samples are obtained by infecting mice with and still will be, produced contain- slow transforming retroviruses which cause them to ing important facts about living or- develop tumours because the virus inserts randomly in ganisms. These measurements open their genome and mutates cancer genes. The position newM avenues to discover novel genetic and mo- of the insertions can be determined by amplifying DNA lecular pathways within cells. using PCR amplification techniques and mapping the resulting sequences onto the genome. In this way inser- Traditional ways to study biological phenomena in a tional mutagenesis screens are obtained. The regions in gene-by-gene approach, so-called reductionism, are the genome that are mutated in multiple independent time-consuming and make it hard to see the complete tumours are likely to contain genes involved in tumori- picture. Instead, the cell should be studied as a net- genesis. Over the last few years an extensive amount of work of complex interactions. “Computational analysis insertional mutagenesis data has been compiled and to derive new biological knowledge about underlying published in the publicly available Retroviral Tagged molecular mechanisms plays an increasingly important Cancer Gene Database (RTCGD). role in biomolecular sciences,” says Professor Marcel Besides these insertional mutagenesis screens, an- Reinders, head of the bioinformatics research group at other type of biological data is used to extract onco- the Delft University of Technology. genes and oncogenic pathways, the so-called array One of the research themes within Reinders’ group is Comparative Genomic Hybridization (aCGH) data. Such pathway reconstruction. Pathways can, amongst other datasets record DNA copy number alteration (CNA) in things, be reconstructed based on data obtained from tumorigenesis. These CNAs are aberrations in the form directed perturbation of a biological system. In this of amplicons (repeating pieces of same DNA) or dele- case, both the perturbation and the effects thereof on tions (missing pieces of DNA). Actually CNAs are a result the system are recorded. In particularly, this approach of genomic instability leading to DNA copying-faults, is used to unravel molecular pathways in tumorigen- which has been described as an enabling mechanism of esis, the work which is discussed in this article. The tumorigenesis. research is performed in close cooperation with the Netherlands Cancer Institute (NKI). Data engineers at work After collecting the Dr. Lodewyk Wessels, group leader of the Bioinformat- genomic data sets on tumours, the data engineers start ics and Statistics group (division of Molecular Biology) to work on trying to extract oncogenes and oncogenic of the NKI summarizes the ultimate aim of pathway pathways from this biological data. To this end the bio- reconstruction in tumour cells: “We try to identify on- informatics group of Delft and the Bioinformatics and cogenes and tumour suppressor genes and then try to Statistics group in Amsterdam applied the so-called find targets for new cancer therapies.” The research is kernel convolution technique, a well known statistical based on the continuing cycle of 1) generating biologi- method often used for computational signal process- cal data and 2) computational data analysis leading to ing. “As far as we know we were the first to apply kernel new hypotheses that 3) should be verified by biologi- convolution algorithms to the analysis of biological data cal experiments. In this way the underlying molecular sets,” Reinders says. mechanisms in oncogenesis are gradually revealed, The first biological question that should be answered is: step-by-step. which regions have been hit by viral insertions in multi- ple independent tumours significantly more frequently Biological experiments Cancer genes, the than expected at random? These so-called Common genes that initiate cancer, can be subdivided into two Insertion Sites, or CISs, indicate the location of putative main players: oncogenes and tumour suppressor genes. cancer genes because, in the vicinity of CISs, the mu- Mutation in either of them or in both may increase the tation may alter the expression of genes. When the af- chance that the cell will grow uncontrollably and hence fected gene is a cancer gene, its expression can cause develop into a tumour. Gene expression profiling of uncontrolled proliferation of cells. Eventually this may tumours has been widely used to reveal tumour charac- give rise to a tumour. Applying the kernel convolution teristics. However, when retrieved from tumour tissue framework has proven to be suitable for successfully that has undergone the many steps of tumorigenesis it detecting statistically significant CISs in insertional is hard, if not impossible, to deduce the initial events mutagenis screens. that are causal for the development of the tumour. The Detecting CISs, however, is only the beginning of the in- cause of most cancers is a genetic mutation. In fact, tended goal of finding oncogenic pathways. The compu- biologists use viruses to deliberately induce mutations tational analysis should be extended for finding coop- in the DNA of model organisms to study which genes erating cancer genes. Jeroen de Ridder, PhD student at in the DNA need to be mutated in order to acquire a both Reinders’ and Wessels’ research groups, explains: tumour. “Therefore, retroviral insertion mutagenesis, “To find cooperation between virally targeted genes, we whereby the event initiating the tumour can be recov- proposed a two dimensional co-occurrence space and SECTION interface 6 Coverstory Issue 1 | 2008 we developed the 2D kernel convolution algorithm.” cooperation and exclusions. However, the oncogenic Subsequently two dimensional analysis of insertional capacity of these candidate genes still needed further mutagenesis screens resulted in detecting statistically experimental validation; in particular, larger data sets significant co-mutations, the so-called Common Co- are necessary in order to obtain not only theoretical occurrence of Insertions (CCIs). Since oncogeneses may confidence, but also significant biological knowledge.” cooperate with several members of a parallel pathway, Since the proof of the pudding is in the eating, extensive the data engineers combined the co-occurrence data validation experiments are currently underway. The in- with gene family information to find significant coop- sertional mutagenesis project is a joint effort between eration between oncogenes and families of genes. “As a Professor Anton Berns’ group from the Department of proof of principle we show for instance the well-known Molecular Genetics at the NKI and Professor Maarten interchangeable cooperation of Myc insertions with in- van Lohuizen’s group from the Wellcome Trust Sanger sertions of Pim family,” explains De Ridder. Institute, who performed the ‘wet’ laboratory experi- After having detected CISs and CCIs, the next step ments and a portion of the computational analyses. comprises detecting association between viral triggers The kernel convolution analyses were performed by and downstream effects. To this end it is necessary to Reinders’ and Wessels’ groups. couple the insertion data with gene expression data. The joint efforts resulted in a recent publication in Cell Consequently, tumours induced by retroviral insertional (see reference 1). The research is based on insertional mutagenesis have been expression profiled and the re- mutagenesis screens in mice, with either a p53 or p19ARF sults from the kernel convolution algorithm have been deficiency, and wild mice. The aim of the study was to used for integrally analyzing the combination of inser- identify genes that collaborate with loss of either p53 tion data and expression data from the same tumour. or p19 ARF. This study consisted of over 500 tumours The bioinformaticians incorporated Boolean AND and from which no fewer than 10,000 insertions were re- XOR in the 2D kernel algorithm. Boolean ‘AND’ mod- trieved, which represents a six-fold increase in screen els the behaviour of a downstream effect that is only size in respect to the largest previous study. This scale expected to be observed when both loci are hit in the of analyses has allowed identification of a high number same tumour. Alternatively, the mutually exclusive loci of new candidate cancer genes and their collaborative are combined using the Boolean XOR in which case any networks (see figure). of the loci may be hit to produce a downstream effect. With regard to the ultimate goal of unravelling molecu- Finally, the researchers showed that the kernel convo- lar oncogenic pathways, Wessels formulates a clear lution framework has a wide applicability and success- concluding response: “To improve cancer treatments fully employed it to analyse aCGH data. The algorithm we need a better understanding of how normal cells had to be tailored to this data since smoothing aCGH work and what goes wrong when they turn into cancer data is different from smoothing insertional mutagene- cells. Discoveries made in the laboratory are the keys sis data and the probes on the array have a non-uniform that can unlock new therapies.” However, at the same distribution across the genome. As a consequence, the time he warns: “Translating new knowledge into proto- adapted algorithm (kernel regression) proved to result type therapies and carrying out clinical trials has still in identification of known and putative cancer genes. got a long way to go, with many, many steps.”

Proof of the pudding Wessels adds: “So far, we have seen that the developed algorithms and ap- References plied biological experiments result in the identifica- 1.  Uren, A.G., Kool, J. et al. (2008) Large-scale muta- tion of many candidate cancer genes as well as mutual genesis in p19 - and p53- deficient mice identifies cancer genes ant their collaborative networks. Cell 133, 727-741 2.  De Ridder, J. et al. (2006) Detecting statistically significant common insertions sites in retroviral insertion mutagenis screens. Plos Computational biology 2, 12, e166. 3.  De Ridder et al. (2007) Co-occurrence analysis of insertional mutagenesis data reveals cooperat- ing oncogenes. Bioinformatics, vol. 23, no. 13, pp. i133-41. 4.  De Ridder, J. et al. (2008) Inference of logic networks from insertion and expression data, in preparation. 5.  Klijn, C. et al (2008) Identification of cancer genes using statistical framework for multiexperiment analysis of undiscretized array CGH data. Nucleic Interaction network representing the co-occur- Acids Research 36, 2, e 13 rence and mutual exclusivity of the 20 most sig- nificant common insertion sites CI( Ss). Co-occur- ring CISs are connected by green lines and mutual exclusive CISs are connected with red lines. The thicker the lines, the higher the probability of interaction.

Source: Cell 133, 727-741, May 16, 2008 SECTION interface Interview Issue 1 | 2008 7

INTERVIEW WITH Carole Goble Computer scientist Carole Goble makes BY Marga van Zundert clever tools to link together all kinds of bioinformatics “The Dutch have the potential to progress in leaps and bounds” SECTION interface 8 Interview Issue 1 | 2008

omputer scientist Carole Goble of “Bioinformaticians don’t want Manchester University designs soft- ware for bioinformaticians. Tools that miracle answers, they want allow them to link together all kinds tools that enable them to find oCf datasets and algorithms to discover inter- esting genes, proteins and biomarkers. She the answers themselves.” involves bioinformaticians from the start in every project. “We have built it for the typical user and that is the premise we follow.” Nowa- days astronomers and musicians also use her What aspect of your job do you like the most? clever tools. The creativity. The ability to say: let’s try to do some- thing novel and interesting. Where does your interest in bioinformatics come from? And what is the hardest part? I had worked in medical informatics for a long while I feel a constant sense of guilt about failing. There sim- when one day in the early 1990s a Manchester bioinfor- ply isn’t enough time to meet all the demands and com- matican called Andy Brass turned up in my office. He mitments. Last week, for example, we were referenced had a data integration problem and asked for my help. It in Nature and you would think ‘Wow, that’s great’, but turned out to be a long and very happy union. Bioinfor- the rest of the day I felt guilty because I had not yet fin- maticians are far more interested in doing real science ished a bundle of reviews. The balance has gone a bit in novel technologies. In the first years, Andy tried to wrong. teach me all about our DNA and these little protein ma- chines called ribosomes. Fascinating but I truly couldn’t What part of your work are you the most proud of? retain this information. We’ve agreed that my role is to Taverna and myExperiment, because they are really build software; his role is to use it. being used. In my very first bioinformatics project, we built a system that went to different databases to solve You are probably best known among bioinformati- all kinds of problems in bioinformatics. When you input- cians as the builder of the Taverna workbench. What ted your question, the answer came back miraculous- can one find there? ly. It was clever computer science and we wrote many Bioinformatics is basically all about linking databases, papers about it, but no bioinformaticians ever used it. algorithms and analytical tools together. Many times Why? Because bioinformaticians don’t want miracle an- these originate from different sources and weren’t swers, they want tools that enable them to link together designed to be plugged together. They differ in the for- datasets to find out the answers themselves. We solved mats they use. Taverna allows bioinformaticians to link the wrong problem. over three thousand datasets, algorithms and other In all my later projects bioinformaticians were involved services together by taking care of all the inter-opera- right from the start. Now people who I have never met tional processes needing a flow of data between them. come up to me at conferences and say, ‘I used your Furthermore, Taverna keeps an automated record of tools to do my PhD, may I have my photograph taken what you have done in a workflow. So you know exactly with you?’ That is an incredible recognition, particularly were the data came from and which version of which because we have been through a lot of battles to create software you have used. These workflows can be ex- these services. In Taverna we did web services at a time changed with colleagues and peers all over the world when that was considered a bad thing. And we were told with myExperiment. to do portals for myExperiment, but we choose a Face- book type application. And do they exchange workflows? Actually, when designing myExperiment we assumed What is your favourite example of the use of that people would be very reluctant to share. We built a Taverna? complete framework that allows a workflow author to One my favourites is the work by Paul Fisher, a Ph.D. in precisely control if and who may look at the workflow my group, on the resistance of cattle towards sleepi- and who may download it. If someone allows reuse, he ness sickness. Paul wanted to know why certain cat- or she builds up a credit and reputation catalogue. You tle are resistant to the parasite Trypanosoma and why can see how many times a workflow has been viewed others are not. He generated a set of workflows for this and has been downloaded. You can even track who problem, which include statistical analysis, microarray downloaded it and the original author can put the cred- cluster analysis, text-mining and linked all these to- its on his or her CV. Reputation, credit, fame, that is gether. Through this he discovered a gene that is impli- what scientists want. cated in the resistance. Experiments in the laboratory confirmed it and the results were published in Nucleic Acids Research1 last year. The gene had been missed SECTION interface Interview Issue 1 | 2008 9 before because biologists were triaging the data too Carole Goble early. The crucial information turned out to be in a com- pletely different area than where they had been looking. Therefore, this discovery couldn’t possibly have been done by hand, the datasets were simply too big. Luck- ily, computers don’t get tired, they don’t get bored and don’t cut corners; they just keep on going. Moreover, the same workflows turn out to apply equally well to other datasets. Joanne Pennock, a Manches- ter biologist, had a similar question on the resistance of mice to whipworm. We identified a potential candi- date resistance gene in just one afternoon. Paul and Jo met one another in a Manchester pub and that is the reason why we built myExperiment: to create a virtual pub where biologists and bioinformaticians scattered around the world can meet and exchange workflows.

Will there be spin-offs from Taverna? Actually, we are thinking about how to commercialise Taverna. At the moment, I have a bigger software en- gineering team than a research team. Taverna has two years of software engineering in it to build the proto- type into an accessible web service, but I can’t write 2008 Launch of myExperiment papers on how we rebuilt and hardened it. When you 2003 Launch of the Taverna workbench successfully develop software or an infrastructure that 2000 Full professorship at the School of Computer many people want to use for real science there is al- Science, ways the problem of sustainability. Who funds it in the 1997 Co-founder and co-director of the long term? Biology doesn’t want to fund it, because it Information Management Group isn’t biology. Computer science doesn’t want to fund 1992 First bioinformatics project with biologist it because it isn’t computer science research. We re- andy Brass ceived Open Middleware Infrastructure Institute money 1985 Member of faculty staff, School of Computer for Taverna, because the UK decided to fund an organi- Science, University of Manchester sation that hardens successful software. A very sensi- 1982 Bachelor in computer science at Manchester ble idea. Software has to have constant attention. It is University like a garden which will run wild if you don’t look after it. I believe that any organization or country who eventual- ly sorts out the long term sustainability of its key tools example, is marvellous. We work very closely with quite will win. a few Dutch people, with Gert Vriend, for example, and with Marco Roos whose knowledge driven text-mining The Dutch bioinformaticus Gert Vriend says bio- techniques are really very novel. And the work of Mor- informatics is a communication science. Do you ris Swertz is also impressive. Rather than building yet agree? another kind of tool, he built a factory for building tools. Yes, I very much agree. In fact it is communication in The Dutch are very open to novelties. What would be two dimensions. There is communication between biol- really good, though, is to encourage people to open up ogy and computer science, and between the scientific their technologies, their datasets and their tools. I can and the technical world. Bioinformaticians are medi- understand that people may hesitate to open up tools ating between these different communities. It is also that give them a critical lead, but they are using every- communication science because bioinformatics link body else’s material. Scientists really ought to behave together various datasets: proteomic data with tran- like good citizens, who not only take from the commu- scription data, with phenotypical data, with text-mining nity but also give back. environments etc. Also in that sense, bioinformatics is an exercise in mapping between different communities. NOTE 1 What do you think of the Dutch contributions to bio- Fisher P, Hedeler C, Wolstencroft K, Hulme H, Noyes H, Kemp S, Stevens R, Brass A. A systematic informatics? strategy for large-scale analysis of genotype phe- You may have been slow starters, but that allows you to notype correlations: identification of candidate genes involved in African trypanosomiasis, Nucleic jump into bioinformatics without making all the mis- Acids Research, 20 A u g u s t 20 07, p u b m e d ID = 17 70 9 3 4 4. takes others made before. You have the potential to progress in leaps and bounds. The NBIC programme, for SECTION interface 10 Progress Issue 1 | 2008

Data refinement by Robbie Joosten Approaches to improve the quality of protein structures solved in the past

rystal structures in the how well the 3D atom coordinates We set out working on the subset of protein data bank are often in a PDB file fit the X-ray data. The the PDB that had an X-ray resolution Cregarded as completely accu- Ramachandran plot Z-score2 looks of 2.0Å with a complete set of rate representations of the ‘real’ 3D at a crystal structure from a differ- experimental data. Unfortunately, structures. The tendency to over- ent perspective. It is based on our because experimental data deposi- estimate the power of experimental current knowledge of what a protein tion was not mandatory until early methods and the quality of their should look like in terms of the tor- this year, that left us with just 1200 results may introduce serious risks. sion angles of the peptide back- PDB entries. Using the experimental bone. Other tests are based on our data straight from the PDB is very As bioinformaticians, we are used to knowledge of atomic bond lengths difficult because parts of it come in working with other people’s data. It and angles, hydrogen bonds, atomic at least five different formats. is easy to forget that behind each en- packing, etc. Various fixes and reformats were try in Uniprot, the protein data bank required to make this input data (PDB), or one of the many other Thanks to these validation tests uniform. From that point on we databases used in our field lie many we can now find the strengths and could start the interesting work, days of hard lab work. In fact, many of weaknesses of a particular PDB the actual re-refinement. us cannot really remember what do- entry when we want to use it. More ing lab work was all about. We are just importantly, the validation tests can The protocol We used three happy with the results. Macromo- also be applied to check the quality types of refinement of increasing so- lecular X-ray crystallography is a good of crystal structures before they are phistication. The first one, rigid body example of this potential problem. deposited at the PDB. Problematic refinement, treats every amino acid We must bear in mind that crystal structures can then be fixed before chain in a PDB file as a rigid body structures are models that try to they reach the general public. and tries to fit this to the experimen- explain the results of an experiment. tal data. This had very little effect They are extremely useful for drug Re-refinemenT As methods in general but a few crystal struc- design, homology modelling and improve, so does the quality of the tures that were rotated or translated other forms of structural biology but new entries in the PDB. Great as this slightly with respect to their experi- they should be handled with care. may be, it does not help the existing mental data benefited a lot. After all, measurements have er- entries in the PDB. But what if we The second type of refinement rors and modelling requires (human) apply the latest methods on the entailed moving around individual interpretation of the experimental existing PDB entries? This is exactly atoms to improve the fit with the data that may introduce new errors. what we tried: a re-refinement of experimental data. Geometric This leaves us with the problem of the crystal structures in the PDB in restraints were applied to make sure picking the most reliable crystal a fully automated way. The goals of the bonds lengths and angles in the structure(s) for our research. the project were clear: find a structure remained sensible. This is refinement protocol that can be required in macromolecular crystal- Validation Fortunately, crystallo- used on all entries of the PDB and lography because the risk of over- graphers and bioinformaticians have that improves the fit of the atom fitting data is very large. created many measures to assess coordinates with the experimental The third and final type added an the quality of crystal structures. For crystallographic data without extra model for the movement of instance, the R-free1 is a measure of decreasing other validation scores. groups of atoms. In the crystal, SECTION interface Progress Issue 1 | 2008 11 atoms move around a bit which Does it work? Of the 1200 struc- References: causes noise in the measurement. tures evaluated, the fit with the data 1. Brünger, A.T. (1992) Free R value: a novel statistical quantity for This movement is modelled as the in terms of R-free could be improved assessing the accuracy of crystal so called ‘B-factor’. Because of the 78% of the time using the TLS mod- structures. Nature 355, 472-475. 2. Hooft, R.W.W., Sander, C. and way atoms are bonded and packed, els. The geometric quality of the Vriend G. (1997) Objectively judg- this movement is not the same in structures (expressed as the ing the quality of a protein struc- ture from a Ramachandran plot. every direction so ideally the Ramachandran Z-score) also im- CABIOS 13, (4), 425-430. B-factor should be anisotropic. But proved for the majority of structures. 3. http://www.cmbi.ru.nl/pdb_redo/ (The d e t a i l s o f t h e r e -r e fi n e m e n t p r o ce - again the risk of over-fitting the data You can download the re-refined d u r e a n d f u l l v a l i d a t i o n r e p o r t s) lurks around the corner. A nifty trick structures (we have now done over 4. http://www.nbic.nl/support/apps/ and http://www.nbic.nl/support/ circumvents this problem by 20,000) from our website (see refer- apps/27579/ modelling the anisotropic transla- ences). But beware: most of the re- Contact tion, libration and screw rotation refined structures can be improved Robbie Joosten M. Sc. (TLS) of large groups of atoms further by hand. Getting the most CMBI, NCMLS UMC Nijmegen (think: domain motion). The risk of from your data is still hard work! P.O. box 9101 over-fitting becomes much smaller 6500 HB Nijmegen W e b: w w w.c m b i.r u.n l this way. E-mail: [email protected]

≥ Key concepts – X-ray analysis has proven to be a power- ful tool for elucidating crystal structures of proteins. – Since the introduction of general data- bases accessible in public domains, the solved structures are usually deposited at the joint protein data bank (PDB). – Unfortunately, re-analysing the stored x-ray structures by applying present A available statistical tools revealed many anomalies. Failings vary from minor im- perfections to entirely wrong structures that should not be used for design or explanation of biological experiments. – Bioinformaticians of the Centre for Mo- lecular and Biomolecular Informatics (CMBI) from the Nijmegen Centre for Mo- lecular Sciences at the Radboud Univer- sity Medical Centre, focus research on PDB improvement by developing soft- ware tools to assess and improve the quality of structures solved in the past. – In the accompanying article the author B describes re-refinement procedures and the obtained results. The article is based on a recent Science publication Improvement of structure quality from the Nijmegen CMBI research group (R. P. Joosten and G. Vriend, PDB Im- A  Change in fit to the experimental data. ΔRfree = Rfree,org - Rfree,re-refined where Rfree,org is taken from the provement starts with data disposition, PDB file header andR free,re-refined the calculated fit to the Science, vol 317, 13 July, 2007 p 195) data after re-refinement withTL S models. Positive values denote an improved fit. –  Of the 1200 structures evaluated, the fit with the data proved to be improved 78% B Distribution of Ramachandran Z-scores for 1170 PDB entries (the other contained no protein) before and after re-refine- of the time. The details of the re-refine- ment with TLS models. Higher values are better. ment procedure and full validation reports are available on http://www.cmbi.ru.nl/pdb_redo/.  By the editors SECTION interface 12 Progress Issue 1 | 2008 v Visualizing biological data by Michel Westenberg Mapping dynamic data by graphs and colors

isualization provides a way to consists of a gene expression value ratio is above some positive thresh- understand large amounts of and an associated statistical value, old and, in the second category, it Vdata. It facilitates understand- a way to simultaneously visualize is active if this ratio is just positive. ing of the data on various scales. In these attributes is required. A graph- In the last category, regulators are this way features may be revealed ical object that conveys multiple considered active always. The sec- that were not anticipated. data attributes is called a glyph. ond step marks non-regulator genes In designing a glyph, it is important active if they show a strong change DNA microarrays are used to meas- to have interesting data features in expression level. The final step ure the expression levels of thou- stand out to facilitate visual identi- marks all edges active that connect sands of genes simultaneously. To fication. A glyph that works well is a two active nodes. In the visualization, study dynamical biological proc- coloured rectangle, where the col- the active sub-network is highlighted esses, time series experiments are our is determined by the expression by an increased border width and all conducted, in which gene expres- value and the height by the reliability inactive nodes and edges are made sions are measured as a function of value: the more reliable, the higher semi-transparent. We draw the active time. For analysis, it is necessary to the rectangle. To support analysis of network in the context of the com- link the measurements at the tran- the network at a single time point, plete network to support the user in scriptional level (gene expression the second task, the gene expres- maintaining a mental map of the en- approximated from microarray data) sion value can also be used as a fill tire network. to observable characteristics of an colour for the gene box. organism at the functional level. Metabolic pathways Visualization Traditional analysis methods make Active substructures The of the time series at the transcription this integration intricate, since they transcription network contains many level facilitates the identification of are not only laborious, but also error nodes and edges which makes it dif- regulons (genes under regulation of prone. Information visualization, on ficult to understand which substruc- the same protein) within the active the other hand, elevates the power of tures of the network are active and network. However, it is also inter- the human visual system to obtain in- inactive over time. Therefore, the time esting to explore the time series at a sight in such abstract data by the use series data are also used to detect higher level, namely, that of the met- of interactive visual representations. active substructures in the regulatory abolic pathway. A well-established A visual exploration of microarray network algorithmically. This algo- annotation system of pathways is data from a time series experiment rithm proceeds as follows. The first provided by the Kyoto Encyclopaedia involves two main tasks. Firstly, the step finds the active regulators by in- of Genes and Genomes (KEGG), which detection of global trends and sec- specting the expression level and the also provides pathway drawings. For ondly, detailed comparison of gene change in expression level – the ex- each pathway, a statistical test (Fish- expressions for a number of genes pression ratio– at a given time point. er’s exact test) is performed to obtain at specific time points. The first task Expression levels are distinguished the probability that an enrichment can be performed efficiently by look- in three categories: low, medium, of active genes in that pathway can ing at the entire time series. As in and high. In the first category, a be attributed to chance. Pathways in each time point the measurement regulator is active if the expression which this probability is below a cer- SECTION interface Progress Issue 1 | 2008 13 v

tain threshold are considered sig- proach and application both facili- 2. Westenberg, M.A., Van Hijum, S. A. F. T. , Kuipers, O. P. and Roer- nificant. Pathway visualization con- tates and speeds up data analysis dink, J. B. T. M. (2008)Visualizing sists of superimposing gene boxes tremendously in comparison to the Genome Expression and Regulatory Network Dynamics in Genomic and containing time series glyphs onto more traditional list-based approach. Metabolic Context Computer Graph- a KEGG pathway drawing. We plan to address dynamic visuali- i c s F o r u m, 27, (3), in press. 3. Chang, D. E., Smalley, D. J. and The various visualizations are pro- zation of pathways in future work. Conway, T. (2002) Gene expres- vided as linked views by the soft- This will provide more freedom in sion profiling ofE scherichia coli growth transitions: an expanded ware, which also has a rich set of drawing a pathway, in combining a stringent response model. Molecu- user interaction techniques to sup- number of pathway drawings into one l a r M icrobiology 45, (2), 289-306. 4. http://www.nbic.nl/support/apps/ port data exploration. This is de- picture and in visualizing the actual and http://www.nbic.nl/support/ scribed in more detail in a research network of metabolic compounds apps/27351/

paper (see reference 2), which also and associated chemical reactions. Contact contains a case study in which do- Dr. Michel A. Westenberg Institute for Mathematics and main experts perform an analysis Computing Science of a time series experiment involv- References University of Groningen P.O. Box 407 ing the E. coli bacterium. The case 1. Ware, C. (2004) Information Visu- alization: Perception for Design, 9700 AK Groningen study demonstrates that our ap- Morgan Kaufmann Publishers, San Web: http://www.cs.rug.nl/~michel Francisco, USA, 2nd edition. E-mail: [email protected]

≥ Key concepts – Modern DNA analysis tools generate huge volumes of data and may lead to information overload. This is challeng- ing an increasing number of scientists attempting to glean relevant information from such data collections. A – To meet the need for improved visualiza- tion of biological data, bioinformaticians are researching a variety of updated ap- proaches that could enhance the abil- ity to rapidly collect, share, and analyze crucial information. Visualization of bio- logical data by graphical representation A of information is a rapidly growing field B within bioinformatics. – The research group Scientific Visualiza- tion and Computer Graphics (part of the Department of Mathematics and Com- puting Science of the University of Gron- ingen) focuses strongly on computational science and visualization research in the life sciences, both fundamental and ap- plied. –  The author describes a promising ap- B proach to translate huge data tables generated from genome expression time

series experiments into graph-based A Visualization of the transcription network of E. c o l i and a and colour-based representations. 17-points time series (data from [3]). During the experiment, the bacterium is grown on a mixture of glucose and lactose. These are used to elucidate regulatory It grows preferentially on glucose, until that energy source networks and metabolic pathways. is depleted. Cell growth then stops, while the organism ad- justs to growing on lactose. The active subnetwork at that – The described technique is demonstrat- time is shown above. Red colors indicate upregulation, and ed by visualization of the transcription green colors indicate downregulation. The shift to a differ- ent energy source causes upregulation of the LacI regulon, network of E. coli genome illustrating a which is shown in more detail in the inset. growth experiment. B Many genes active at the time of the shift to growing on lac- By the editors tose are part of the ribosome pathway (the protein factory). Most members of this pathway are downregulated, because cell growth has stopped. SECTION interface 14 Hands on Issue 1 | 2008

Vbyisu Laililizingan Vermeer biological data accelerates life scientist’s research

Life sciences research questions which take three years of computing time can be answered now within a week, when handled on the Life Science Grid (LS-Grid). Maurice Bouwhuis from SARA (Amsterdam) and Victor de Jager (NBIC, Nijmegen) explain what Grid is. PhD student Miaomiao Zhou illustrates how the LS-Grid helped in speeding up her research on bacteria.

The Grid seems to be the next leap Amsterdam. “Taking the concept computer facility. To log in, a ‘pass- in computer interconnectivity. In from the Grid we started rolling out port’ is needed in order to be recog- 2006 its power was demonstrated the Dutch Life Science Grid nized as a valid user. by a striking example: a collabora- (LS-Grid).” The choice for many small nodes tion of many laboratories analyzed The project is financed by the 28 instead of one big data and comput- 300,000 possible drug components million euro BIG GRID1 project ing centre had a practical reason. against the avian flu virus using aimed at developing a nation-wide Bouwhuis explains: “Local small 2000 computers for four weeks. In grid infrastructure to enable e- nodes can be attached to local fa- order to compute simultaneously, in science. “The LS-Grid is actually a cilities. For example, if a sequencer parallel, on the sometimes thou- set of computer nodes, situated at or an fMRI scanner is present it can sands of computers which make up different life sciences departments be attached to the LS-Grid no mat- a Grid, a special kind of software, and medical hospitals,” explains ter where it is physically located. ‘middleware’, was developed. Bouwhuis. Each node consists of a Instead of having to move all these “Today there are many different cluster of computers and equipment facilities to one big centre, one can disciplines that need storage space for data storage. All the nodes are create a local node attached to the and computer power, and one of connected by a high speed network. LS-Grid. By logging in on the Grid them is the Life Sciences,” says The people who are logged in on the these machines can be used by all Maurice Bouwhuis from SARA com- LS-Grid have access to all the nodes the users throughout the Nether- puting and networking services in simultaneously and to the central lands for generating research data.” SECTION interface Hands on Issue 1 | 2008 15

DEVELOPER Maurice Bouwhuis on a local server. When the results ing job to a free processor. As soon leads a small group at SARA called were satisfactory, de Jager, togeth- as a node is too busy, the job will be ‘e-science support’. Their aim is to er with the e-science-team, con- redirected to another free proces- work together with (bio)scientists to verted it in such a way that it could sor. Initially, for Zhou these proc- solve their enhanced science (e-sci- run in a standard way on all the grid esses were happening in a black box ence) problems. “We want to find out nodes in parallel. because she couldn’t see how her what they need, to build interfaces “Some tools which are installed computing experiment was doing for them and to transfer expertise.” at the node in Nijmegen must be and where it was being processed. Victor de Jager works for the NBIC added to Zhou’s software package Together with the e-science support BioAssist program and discusses because they may not be installed team, de Jager provided a solution the ICT needs of life science at other nodes,” explains de Jager. by writing an extra interface to track researchers with the e-science “This added tool-package enables the job and give information about support group from SARA. He ad- making use of all computing capac- its status. vised bioinformatics PhD student ity in a flexible way.” Miaomiao Zhou to use the LS-Grid “Compare it to a backpack with food “We aim to fill the when she showed him the size of and equipment, so you can survive the calculations and comparisons in enemy territory and not have to gap between avail- she wanted to do. He assisted her in rely on any local resources,” says able ICT support and getting her software package ready Bouwhuis. At every node (cluster of for Grid computing. Zhou wrote the computers) middleware has been what researchers initial software and first tested it installed which directs a comput- actually need”

USER Miaomiao Zhou is a PhD 4000 protein sequences. To search ones she created a pipeline of eight student in Professor Roland with one protein of 100 amino acids different SCL-predictors, called Siezen’s group (bacterial genomics, on the whole Pfam database would LocateP. Zhou now uses LocateP CMBI, Nijmegen), and one of their take around an hour. To do a com- to predict the SCL of the proteins research interests is to study the plete search on Pfam on the 300 ge- encoded by completed bacterial genomes of food-relevant bacteria. nomes on a standard server would genomes. She also uses the LS-Grid “We focus on the genomes of bac- take three years. By using the LS- for this calculation. “To analyze the teria in dairy food. By looking at the Grid I was able to use 300 different 300 genomes with LocateP takes sequences we hope to explain the machines/servers at the same time one month on a single server and functional differences of bacterial doing these searches. Now - at least by using the LS-Grid it could take proteins,” Zhou says. in theory - I have to wait one week only one day if enough resources are The protein encoding parts of the and I have all the data. That makes a available. At present we are work- genomes are analyzed for similarity world of difference!” ing on combining Pfam searches with known proteins to find pos- The other part in Zhou’s research with my LocateP pipeline, but this is sible functions. One of the simi- consists of the prediction of subcel- not ready yet.” larity searches Zhou performed is lular localization (SCL) of bacterial with the Pfam protein database. At proteins based on the sequence “One hour or three that point she encountered for the features that she can detect. That is first time the limitations of using important since the protein function years, that makes a only a single server. “I already had is usually related to its localization. world of difference” genomes of 300 bacteria and each Zhou developed some new SCL pre- genome contains about 3000 to dictors2 and together with existing

Notes 1 The three requesters of this 2 Zhou, M., Boekhorst, J., p r o je c t a r e n c f (N a t i o n a l C o m - Fr a n c k e, C. a n d S i e z e n, puter Facility), NIKHEF (Na- R.J.(2008) LocateP: genome-scale tional institute for subatomic subcellular-location predic- physics) and NBIC (Netherlands tor for bacterial proteins. BMC Bioinformatics Centre). Bioinformatics,9:173. SECTION interface 16 Thesis Issue 1 | 2008

By Bastienne Wentzel Dynamic software infrastructures for the life sciences

The Groningen Bioinformatics Centre (GBiC) is a joint initiative of the faculty of Mathematics and Natural Sciences and the University Medical Centre Groningen UMCG. The Bioinformatics Centre , led by Prof. Ritsert Jansen, is special- ized in genetical genomics. Genetical genomics is a new strategy to study the effect that genetic variations may have on individuals, such as diseases in animals or humans, or what networks of genes are switched on in plants. Researchers from diverse disciplinaes work to- gether to develop novel analyti- Centre for information technology at University of Groningen cal methods, models, tools to unravel how biological systems work and, like Morris Swertz, to magine life scientists having to write about forty develop novel methods and tools times fewer lines of code to produce software to produce and evolve software I for their data analysis system. This is what Morris infrastructures that advance the Swertz from the Groningen Bioinformatics Centre has life sciences. accomplished with his MOLGENIS software generator. He recently received his PhD degree for his research.

Progress in life sciences research has produced enor- but such a system will grow very large very quickly. Larg- mous amounts of data but the production of suitable er systems are exponentially more difficult and time- software infrastructure to manage and process these consuming to maintain or update to new research.” data has not been able to keep up. Part of the problem is that each research group has their own needs for bio- Blueprint Therefore, Swertz has devised a mini- technologies, protocols and specific species on which mal programming language to quickly assemble these research is done. Usually, bioinformaticians write much reusable blocks of code, a so-called Domain Specific software codes by hand to adapt existing software in- Language (DSL). The ‘domain’ in his case is biology. The frastructures to local needs. ‘specific’ part means standardisation to make a number “Software developers have a luxury problem,” says of operations easy. This may be adding a ‘sequence’ Swertz. “Virtually anything you want can be changed and input box on the user interface or a database query for adapted. However, doing this is extremely time-consum- new types of biomolecular data. A bioinformatician set- ing.” Inevitably, the wheel is reinvented over and over ting up a system for a proteomics experiment for exam- again, he says. “It is fairly easy to copy reusable blocks ple only has to tell the system, using the DSL, what data of code from earlier systems into your own new program types, data relations, screens and queries he needs for SECTION interface Thesis Issue 1 | 2008 17 this type of experiment without having to ‘tinker under access next to graphical user interfaces, bioinformati- the hood’; how biological features translate into soft- cians will be able to answer questions on many genes in ware code is too much detail. “It is the blueprint of your much less time.” system,” says Swertz. Of course, these short DSL sentences need to be trans- Beyond borders Swertz is looking beyond the lated into a code the computer understands. Swertz borders of The Netherlands in more than one way. The devised a computer programme called a generator Groningen Bioinformatics Centre where Swertz now which can do this automatically. His generator, MOL- works as a post-doc cooperates with Carol Goble’s GENIS, quickly translates the high-level DSL instruc- leading eScience group at the University of Manchester. tions from the bioinformatician into low-level instruc- MOLGENIS is not the only software generator for bio- tions in Java, the language his systems uses. “The big informatics. A number of other generators exist, each advantage is that the code generated does not require with its own specialty. MOLGENIS is specifically aimed maintenance by hand. Should you wish to change the at data management. The Manchester group devised system you only need to change the blueprint and run the Taverna system aimed at processing. Taverna has the generator again.” a DSL to describe a workflow of operations by clicking on blocks of existing ‘processors’ and linking input and Synergy Swertz had already made a prototype of output together. “Integrating MOLGENIS and Taverna MOLGENIS before he even started as a PhD student in would give researchers these two functionalities in Groningen. He ran a company that sells tools for creat- one,” Swertz explains. ing web applications. MOLGENIS started as such a tool. Additionally, he works on the standardization of blue- The department of Molecular Genetics (MolGen) at prints in EU-CASIMIR. If we could set up a number of Groningen University asked him to create a system to standard blueprints, for example a data model for trace microarray experiments. Following its success, genetical genomics, as we are doing with EU-CASIMIR, the newly formed Groningen Bioinformatics Centre research groups would not only be able to communicate decided to offer him a PhD position. “The bioinformat- on a technical level but also actually understand each ics world is very dynamic, completely different from the other.” insurance world I was used to from my MSc; a kind of Wild West. I found it a lot more challenging and decided to stay.” The first MOLGENIS system was relatively simple. The Name: Morris Swertz MolGen lab needed to perform microarray experiments. U n i v e r s i t y : University of Groningen The first part involved the design and production of PROMOTORS:  Prof. Dr. R.C. microarray slides. The second part involved extrac- Jansen, Prof. dr. E.O. d e B r o c k tion of RNA from their bacteria, adding a colour label THESIS: Dynamic software and applying these to the microarray. Finally, a scan is infrastructures for t h e l if e s c i e n c e s made of the slides and the contrast in colour analysed, indicating what genes are switched on or off. Swertz’ P h D o b t a i n e d o n 1 5 F e b r u a r y 20 0 8 system enabled the administration and tracing of all these steps in the lab. “The challenge was that they had Additional information just started designing their experiments while we were - www.molgenis.org already building their data management system. It was - www.rug.nl/gbic - Swertz, M and Jansen, R.C. a great synergy.” (20 07) B eyond standardization: Since the first project in 2002, the MOLGENIS software d y n a m i c s o f t w a r e i n f r a s t r u c- tures for systems biology, has been technically completely re-engineered to sup- Nature Reviews Genetics 8, port new types of experiments. “The great thing is that 235-243. - Swertz, M. et al (2004) it is very easy for life scientists to update their systems molecular Genetics Information with new software tools. Just run the updated genera- System (MOLGENIS): alternatives in developing local experimen- tor with an existing blueprint to have the tools add- tal genomics databases, ed,” explains Swertz. New additions to the MOLGENIS Bioinformatics 20, 2075-8 3. toolkit include the ability to handle much larger data- sets and to retrieve data using the R statistics-suite. He even works on a MOLGENIS version which not only generates new databases for new types of experiments but can also automatically import existing databases to share their data in a standardized way. The request came from EU-CASIMIR, the European consortium of mouse genetics researchers. “CASIMIR wanted to open up existing mouse internet databases for integration on a large scale. By generating programmatic ‘web service’ SECTION interface 18 Course Issue 1 | 2008

By Astrid van de Graaf Getting started with web services and workflows

t the end of last year bio- Taverna tool had already been on his are available for this on the internet. informaticians were sub- task list for quite a while, but it had Taverna connects you automatically Amerged in the ins and outs no high priority. In the daily routine with these web services and it facili- of web services and workflow tools at work there were always more im- tates the analytical process. “With during a three days NBIC-course portant problems to solve. Meijer: Taverna you can put all this informa- at the Wageningen University. Cas “The workshop was the opportunity tion together and annotate and ana- Meijer, bioinformatician at Orga- to fully concentrate on the workflow lyse your data sets step-by-step to non, was one the participants. He tools and web services and learn find the answer to your question” discovered the use of Soaplab, a how to use them. Before I can offer The art is to find the right web ser- Sesame door for analytical tools. it the scientist as a facilitating tool vices to solve the analytical prob- for their research, I myself first have lem. “So it was really an eye-opener “On the European Bioinformatics to learn how it works and test it. The that dictionaries such as Soaplab Institute mailing list I followed the workshop fulfilled these needs and exist,” says a relieved Meijer. “In discussion about Taverna, a work- was very useful.” these dictionaries you can look up flow management tool that is freely the web services. It guides you to available,” explains bioinformati- At Organon, Meijer is responsible the databanks for gene identifica- cian Cas Meijer of the department for the infrastructure for microar- tions and also specifies the input of Molecular Design & Information ray data analyse. Since microarray needed and the output to be at Organon, part of the Schering- experiments generate an enormous expected. That makes life a lot Plough Corporation. “The same amount of data, workflow tools easier. The drawback is that not all names always answered the ques- that help to compare and analyse the dictionaries are that good yet. tions. When I saw the announce- the gene expression profiles of dif- Some web services have already ment of the NBIC-workshop, I rec- ferent treated cells are more than disappeared or just moved because ognised the names from the list welcome. It is very important to give the URL has changed and the web and was curious about the faces a biological interpretation to the services became untraceable again. belonging to those names.” results. Meijer explains: “A stand- But Soaplab was a real discovery.” Of course this was not the main ard way of describing the molecular reason for Meijer to subscribe to function of gene products is called the course. The introduction of the gene ontology. All sorts of databases

NBIC Practical Course on Web Ser­ formatics. The next course on ‘web services vices and Workflow Management Participants are required to have a and workflow management’ will The aim of this course is to pro- good working knowledge (hands-on be presented in Autumn 2008. The vide the participants with a work- experience / “fluency”) in Java or focus then will shift slightly onto ing knowledge of web services and Perl (optionally Python). ‘mash ups’, REST and workflow re- workflow management tools. There- The course is organised by Jack positories like MyExperiment. fore the course offers a number of Leunissen, Harm Nijveen and Pieter lectures followed by hands-on ses- Neerincx from Wageningen Universi- More information sions. Each day will be closed with a ty Research and Maurice Bouwhuis Course programme: scientific talk on a practical real-life from SARA computing and network- www.bioinformatics.nl/events/ application of web services in bioin- ing services. webservices/ SECTION interface Portrait Issue 1 | 2008 19 Eight questions for Jayne Hehir-Kwa BY BASTIENNE WENTZEL

Name: Jayne Hehir-Kwa Date of Birth: 4th July 1977 Place of Birth: Narrogin, Western Australia Nationality:  Australian, Dutch Resident for 9 years Status: married Study: computer and Mathematical Sciences, Australia Career: allshare, PhD at Radboud UMC, Nijmegen, Hobbies: music; the trombone and the trumpet

1 Who is Jayne Hehir-Kwa? 5 How does your work fit in the research of the Human I come from a small farming town in Western Australia. I Genetics Department? studied Computer and Mathematical Sciences in Perth I am one of four bioinformaticians who all work in the because I was curious about computers and was fasci- microarray facility. My work is focused on biological nated by them. I was offered a job with the Dutch soft- problems, and not informatics. As I work in the biol- ware development company Allshare in the Netherlands ogy department, this comes with the territory. Part of so I moved here. At the back of my mind I always wanted my job is to offer support for the microarray facility, to do a PhD so after a few years I applied for a job with streamlining all data that comes from patients and de- the Radboud University in Nijmegen, starting as a sci- termining the best way to analyse data from the micro- entific programmer for the microarray facility in the Hu- arrays. I then siphon off the interesting results for my man Genetics Department and then slowly evolving to a own research. research project which officially started on January 1st. 6 With what kind of researchers or specialists do you 2 What is your definition of bioinformatics? cooperate especially? It is using computers to solve biological problems. I Within the microarray facility, some bioinformaticians don’t think it is the other way around just yet. I mean bi- are specialized in the expression arrays and I am spe- ology can’t solve computer problems yet. cialized in the genomics arrays. I regularly cooperate with biologists and clinical geneticists. Everyone is gen- 3 What competences do you need to have as a bioin­ erally quite open in this group. Cooperation with them formatician? runs quite well. I like investigating things, I am a problem solver. I also like the applied research, being close to the diagnostics. 7 You chair the RSG Netherlands, a network of PhD- But the computer programming part is the part I enjoy students. What kind of network does it concern and most. A bioinformatician needs to be flexible. Commu- why do you participate? nication is also important. I have to be able to talk with The International Society of Computational Biology the biologists, I have to make sure they know what I do (ISCB) has a student council which has regional student and that I understand what they want from me. groups (RSG) throughout the world. I am chairman of the Dutch group. Some bioinformatics PhD students not 4 About your research: what do you expect to discover working in dedicated bioinformatics groups can feel very or deliver? isolated. I was one of them. We organize events for these I specialize in genomic microarrays. A patient may have students to bring them together and set up a network. a deletion or duplication on their chromosome which Lots of people, including the NBIC, recognize the need causes mental retardation, but we discovered that for such a network and support our efforts. Since the healthy people may also have these. I try to determine start in March we have organized a well received course the difference between benign variations which do not and lunch meeting. We are planning more social events have any consequences and those which cause illness. and we have a hyves site (bioinformatics.hyves.nl). The advantage of working within the hospital is that we have access to a vast cohort of patients. We hope to run 8 With which message would you like to finish this a thousand patients this year. portrait of a ‘bioinformatician in practice’? A general misconception is that all bioinformaticians are nerds that can’t be understood. I’m afraid that some- times this is true, although it shouldn’t have to be. SECTION interface 20 Spin off Issue 1 | 2008

Putting bioinformatics on the market By Lilian Vermeer InteRNA Genomics offers customized tools for analyzing microRNAs

hrough the years Edwin Cup- time to actually start the company. these very short sequences is com- pen and Eugene Berezikov But then the NBIC Venture Chal- plicated. Due to their small size they Thave gained a lot of knowl- lenge Workshop initiative came will fit many places in a vertebrate edge about small RNAs and bioin- along. With this initiative the NBIC genome.” For the discovery and formatical approaches for RNA data intends to stimulate the valorisa- analysis of small RNA it is important analysis. It resulted in their start-up tion of bioinformatics knowledge to know if RNA hairpins are present, company InteRNA Genomics, spe- present among scientists/bioinfor- how conserved the sequence is and cialized in converting raw sequenc- maticians. “We were asked to par- whether it contain SNPs. ing data into interpretable formats. ticipate and that made us decide to Another important research ques- spend more time on our business tion is expression profiling of small In 2006 Edwin Cuppen, PhD and pro- plans,” says Cuppen. In the work- RNA: how do expression levels fessor of Genome Biology, and Eu- shops the participating teams were change under different circum- gene Berezikov, PhD, both from the offered handles to refine and evalu- stances? For example, levels may Hubrecht Institute, had already set ate their business ideas under the be different in samples from cancer up a company called InteRNA Tech- guidance of experienced coaches. patients, compared to healthy indi- nologies, which aims at developing Cuppen and Berezikov were award- viduals. “We have accumulated all microRNAs (a type of small regu- ed 30,000 Euros, which they used the tools for these types of analy- latory RNA) as a therapeutic and to actually set up a new business sis and, depending on the research diagnostic tool. In their research on called InteRNA Genomics in 2007. question, we can construct a cus- small RNAs, Cuppen and Berezikov The new service company, which is tomized pipeline for every individual designed several (bioinformatical) distinct from the already mentioned project,” Cuppen says. “Although tools to analyze RNA data generated InteRNA Technologies, offers bioin- the essential tools are not novel, by small RNA sequencing efforts formatical solutions for analysis of and other researchers with the and to pick up new microRNA family small RNA. proper biological and informatics members. New ultra-high-through- background would be able to build put sequencing technologies are Customized pipeline such a pipeline, it would cost a lot of perfectly suited for small RNA Small RNA consists mostly of time and money. Researchers from analysis but the enormous increase around 20 -23 nucleotides. “The academic institutions as well as in the amount of data generated by generation of RNA sequencing data commercial organizations can have these next generation sequencing is not a big problem because it can their raw sequencing data analyzed techniques requires different tools be outsourced to sequencing serv- by us, so that they save time and and a dedicated computational in- ice providers,” says Cuppen. “How- can concentrate on their core bio- frastructure. Gradually more and ever, computational analysis of logical research question.” more people from academic institu- tions came to Cuppen and Berezikov with questions about their system InteRNA Genomics B.V., a subsidiary of InteRNA Technologies, is for small RNA analysis. co-founded by Eugene Berezikov, PhD, and Edwin Cuppen, PhD, lead- At that time Cuppen and Berezikov ing scientists in ultra high-througphut sequencing data genera- tion and analysis. Roel Schaapveld is COO (Chief Operating Offic- already had in mind the idea of set- er) at InteRNA Technologies. He manages the company’s day-to-day ting up a company that could help operations and develops the company’s business strategy. researchers with these questions. More information: However, their very busy academ- www.interna-technologies.com www.interna-genomics.com ic jobs made it difficult to find the SECTION interface In the picture Issue 1 | 2008 21

news NBIC Consortium Meeting 2008

Each year the Netherlands Bioinfor- standards. Christian Conrad (EMBL, developed in a market-like setting. matics Centre organizes a consor- Heidelberg) presented in his lecture A jury nominated three applications tium meeting for the scientists who a fully automated platform for for the final: a plenary presentation cooperate in the NBIC programmes: genome wide RNAi screens in live in front of the conference audience. BioRange (research), BioAssist (sup- cell-based assays. The final lecture The audience voted for the prize port) and BioWise (education). This was given by Piero Carninci from winner: Thomas Kelder (University year, the meeting was organised as Japan (RIKEN, Genome Science of Maastricht) and his tool for com- a two-day conference in ‘De Werelt’ Laboratory). He showed how the munity-based pathway creation, conference hotel in Lunteren, where complexity of transcriptome data visualization and online collabora- 175 researchers with a background can be analyzed by high-throughput tion: www.wikipathways.org. Prizes in bioinformatics, mathematics, sequencing analysis. were also awarded for the best computer science and life sciences oral presentation (Miaomiao Zhou, gathered to discuss the latest de- Wikipathways Lectures and Radboud University Nijmegen, on velopments in bioinformatics in a oral presentations were inter- LocateP, a software tool to predict lively setting. changed with interactive poster the subcellular-location of bacterial sessions and three parallel work- proteins on a genome-wide scale) Seven sins Antoine van Kam- shops in diverse topics: Taverna and best poster (Jeroen de Rid- pen, scientific director of NBIC, (workflow management), Usability der, Delft University of Technology; opened the conference with an & Valorisation and Speed reading. development of logic networks to overview of NBIC projects, which In the evening researchers dem- derive associations between muta- was followed by thematic ses- onstrated the software they have tions causing tumour development). sions with oral presentations from bioinformatics researchers who are working in these projects. To broaden the scope of the pro- gramme, three key-note speakers were invited from scientific dis- ciplines adjacent to bioinformat- ics. Carole Goble (University of Manchester) presented ‘The seven deadly sins of bioinformatics’ tak- ing the perspective of information and workflow management. Key points in her presentation were: share, re-use and use (open source)

NBIC is preparing phase II

The Netherlands Bioinformatics ond phase of the Netherlands Bioin- ence questions in the five domains: Centre (NBIC) was set up in 2003 formatics Centre. agro, food, health, pharma, and in- and is one of the technology centres dustrial fermentation. The main goal of the Netherlands Genomics Initia- In the first phase NBIC focussed on of the second phase (NBIC-II) will tive (NGI). On April 1st 2008 NBIC generating a Dutch bioinformatics be to ensure a strong focus on using handed over the 25 million Euro network and infrastructure con- this infrastructure for actually solv- NBIC-II business plan 2009-2012 to sisting of the people, software, and ing the questions raised by the NGI NGI as a proposal to launch the sec- hardware needed to solve life-sci- genomics centres and other major SECTION interface 22 In the picture Issue 1 | 2008 research consortia. The focus will be BioRange II - high-throughput sequencing on the utilisation of bioinformatics - Sequence-based bioinformatics - Systems bioinformatics expertise and tools in life science - Genotype – phenotype model- research, including opportunities for ling BioWise II exploitation. NGI has formed an in- - Proteomics and metabolomics - Research school for bioinformat- ternational committee to review the - Systems bioinformatics ics PhD students and post-docs NBIC-II business plan and NGI will - track to increase the exchange make a decision about continuation BioAssist II – support platforms of MSc students between uni- of NBIC towards Summer 2008. - Integrated analysis of functional versities genomics data - activities to raise interest of NBIC II Business plan 2009-2012: - Proteomics data management MSc, BSc and high school stu- research, support and education and analyses dents for bioinformatics NBIC runs a series of programmes - Metabolomics data management - Links to international education which will be continued and fo- and analyses initiatives cussed in the second phase. - Biobanking

NBIC Research: BioRange highlights

BioRange is the bioinformatics - Prof. Ritsert Jansen’s group also facilitate metabolomics-based research programme coordinat- (University of Groningen) has de- biomarker research. IBM is collabo- ed by NBIC, currently covering 25 veloped a novel approach towards rating in this project by building a projects. BioRange groups develop software development called Mol- parallel implementation of the algo- novel bioinformatics methods and genis, which can be used to quickly rithms. Accepted for publication in tools that lead to new scientific in- produce customized software infra- Analytical Chemistry. sights in biology. Recent highlights structures that “systems biologists of this programme are: really want to have”. It has been - Dr. Hans van Beek and his col- published in Nature Reviews Genet- leagues recently published the de- - In the new tradition of Wikipe- ics and is part of Morris Swertz’s velopment of their software pack- dia, Dr. ’ group (Eras- thesis. In addition, Molgenis will be age FluxSimulator, which can help in mus MC/Leiden UMC) has recently integrated with Taverna. the simulation of the distribution of launched Wikiproteins (http://www. isotopes in metabolic networks. The wikiproteins.org), which provides a - Prof. Rainer Bisschof’s group tools are easy to use by biologists in method for community annotation (University of Groningen) has re- the field. Furthermore the software of proteins on a large scale. Using cently developed the algorithms package FluxEs was developed for the proprietary term “knowlet” and COW-CODA and 2Dwarp, which al- the estimation of fluxes in meta- containing quantitative and quali- low the time alignment of complex bolic models from NMR data, with tative information on the relation- chromatograms resulting from LC- successful validation on experi- ship between different concepts, MS analysis of complex samples, mental data. Published in Journal of WikiProteins calls on a million such as prepared serum or urine. Statistical Software (January 2007, minds to improve the global knowl- 2Dwarp facilitates automatic data Volume 18, Issue 7). edge base on protein function. Pub- processing and statistical analy- lished in the open access journal sis of proteomics data for biomar- More information: http://www.nbic. Genome Biology, May 2008. ker research. It has the potential to nl/research/biorange/

Results and progess of BioRange projects are presented at the yearly NBIC consortium meeting. Life scientists and computer scientists meet each other, sometimes while enjoying a game of chess in the NBIC café. SECTION interface In the picture Issue 1 | 2008 23

Bioinformatics Support: develop- ment of BioAssist

NBIC groups are currently preparing The platforms are built upon an The scientific programmers at the the implementation of six support ICT-infrastructure for life sciences, collaborating institutes have re- platforms to provide the necessary which has been developed in close cently started the process of creat- bioinformatics support for the fol- cooperation with the e-science/ ing bioinformatics support plat- lowing scientific domains: ICT-programs Virtual Laboratory for forms and the first tangible results 1. Integrated analysis of functional e-science (VL-e) and BiG-Grid. Par- will become available in the course genomics data ticipants in a support platform bring of this year. Please check the Bio- 2. Proteomics data management together existing tools, databases Assist-pages of the NBIC-website and analysis and the expertise developed within (http://www.nbic.nl/support/bio- 3. Metabolomics data management various projects and institutes, and assist/). Here you can also find an and analyses make them available to researchers overview of Applications developed 4. Biobanking in bioinformatics and life scientists. in BioRange/BioAssist: 5. high-throughput sequencing http://www.nbic.nl/support/apps/ 6. Systems bioinformatics

Bioinformatics education: BioWise

The NBIC Education programme formatics that provides state-of- schools and graduate schools in the BioWise aims to initiate, stimulate the-art and high-level education Netherlands. and organize bioinformatics educa- for bioinformatics PhD students. tion initiatives in the Netherlands These PhD students will, in turn, - NBIC participates in the na- and serves as a stepping stone for provide education to academic tional high-school project (“trav- the best training in bioinformatics master and bachelor students and elling DNA-labs”) with a bioinfor- and related disciplines. Together will contribute to the training of matics practical called “Surfing with the expert groups in the field, biologists. through your genes”, developed by NBIC is setting up activities for the Radboud University of Nijmeg- PhD-students, BSc/MSc-students - the website http://www.nbic. en. In three years, more than 5000 and even high school students. nl/biowise/courses/ provides in- high school pupils have become formation on courses of interest acquainted with bioinformatics The following gives an illustration of for students and researchers in the techniques, i.e. for DNA decoding, the activities in BioWise. field of bioinformatics. The over- database searching and drug de- view includes courses at BSc-level, sign. See: www.bioinformatica-in- - NBIC has started the implemen- MSc-level and PhD/post-doc-level, de-klas.nl and www.bioinformatics- tation of a Research School Bioin- offered by universities, polytechnic at-school.eu

Colofon

Interface is published by the Netherlands Joost Kok, Leiden Institute of Advanced Printing Bioinformatics Centre (NBIC). The magazine Computer Science Bestenzet, Zoetermeer aims to be an interface between developers Marco Roos, Institute for Informatics and users of bioinformatics applications. Harold Verstegen, Keygene N.V. Disclaimer Although this publication has been prepared Netherlands Bioinformatics Centre Editorial Board NBIC with the greatest possible care, NBIC 260 NBIC Antoine van Kampen cannot accept liability for any errors it P.O. Box 9101 Ruben Kok may contain. 6500 HB Nijmegen Marc van Driel t: +31 (0)24 36 19500 (office) Karin van Haren To (un)subscribe to ‘Interface’ please send f: +31 (0)24 36 19395 an e-mail with your full name, organisation e: [email protected] Coordinating and editing and address to [email protected] w: http://www.nbic.nl Marian van Opstal Bèta Communicaties, The Hague Copyright NBIC 2008

Editorial Advisory Board Design and lay-out Maurice Bouwhuis, Sara Computing and CLEVER°FRANKE, Utrecht Networking Services Timo Breit, Swammerdam Institute for Life Photography Sciences Thijs Rooimans Roeland van Ham, Plant Research Victor de Jager International, Wageningen UR Dick van Aalst Martijn Huynen, UMC St Radboud Nijmegen Stockphotos from Istockphoto SECTION interface Column Issue 1 | 2008 Column Collaborate or perish! Let’s be honest about it: I am a biologist, not a bioinformatician. About nine years ago I came back from the USA, fairly frustrated that Timo even life sciences researchers at Harvard and MIT hardly used the then already available and abundant DNA-sequencing data. So I set out to become a bioinformatician and to do biological research Breit by means of computer-aided data integration. Unfortunately, my lack of computer skills and affinity proved to be insurmountable obstacles. Well, if you can’t do it yourself, persuade someone who can to do it for you. So I gathered a team of multidisciplinary experts around me. They were quite a collection of scientific species: biologists, omics- technologists, (theoretical) physicists, computer scientists, infor- maticians, statisticians, ecologists etc. I learned a great deal from them, mainly that I know incredibly little and that any discipline has a – for me unattainable – depth and specialization. For instance, (theoretical) physicists told me that biology is not a real science. Computer scientists confused me with their lingo of re-interpreted common scientific terminology. Statisticians juggling with numbers showed me the true value of ‘statistical significance’. Informati- cians convinced me of the dangers of non-standardized data/ vocabularies. And with omics-biologists, I learned that if we apply the classical research approach/designs in omics experiments, we will get lost due to lack of resolution/data and scientific igno- rance. Above all, they showed me that it are not the solutions to the omics-based research problems that lay in those enabling do- mains, but the expertise to investigate these problems collectively. Therefore, starting a multidisciplinary e-BioScience group ended Head of MicroArray Department & up being merely the beginning, a seed that has to grow into an e- Integrative Bioinformatics Unit, Science network of expert scientific and infrastructure groups. University of Amsterdam But collaborating in word is a far cry from collaborating in practice. Besides working up respect for one another’s expertise (and idio- syncrasies), and acceptance of mutual dependence, there are quite a few obstacles to overcome. For instance, it is invariably impos- sible for multidisciplinary scientists to publish their results in high- indexed (mono-disciplinary) journals. Additionally, a major part of the e-(Bio)Science approach consists of building infrastructure, not exactly highly valued in academic circles. Multidisciplinary collabo- rations are initially also time consuming. Together these factors limit the individual scientist to stand out, and as we all know, it’s “publish or perish”. Yet, since omics technologies have fundamentally and permanently changed life-sciences research, we have to start taking multidis- ciplinary collaboration seriously or we are merely wasting a great deal of time plus money and hampering scientific progress. With this in mind, we should learn from the omics drug-discovery pipe- line disillusion. Maybe the success of the physicists and their ongo- ing mega-experiments at CERN presents a better example. There are some hopeful signs in the Netherlands: the community building effort of NBIC BioRange, (bio)informatics infrastructure building in BioAssist and BigGrid; foundation of the Netherlands Institute for Systems Biology; and transformation of the Virtual-Laboratory for e-Science project into a National e-Science Centre. But the main change has to be in the hearts and minds of the researchers. Thus I foresee an exciting future for life-sciences research, if we all genu- inely adopt this new motto: Collaborate or perish!