EMBL-EBI Annual Scientific Report 2008

European Bioinformatics Institute Annual Scientific Report 2008 Annual Scientific Report 2008 European Bioinformatics Institute EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD United Kingdom Tel. +44 (0)1223 494444, Fax +44 (0)1223 494468 www.ebi.ac.uk EMBL Heidelberg Meyerhofstraße 1 69117 Heidelberg Germany Tel. +49 (0)6221 387 0, Fax +49 (0)6221 387 8306 www.embl.org [email protected] EMBL Grenoble 6, rue Jules Horowitz, BP181 38042 Grenoble, Cedex 9 France Tel. +33 (0)4 76 20 72 69, Fax +33 (0)4 76 20 71 99 EMBL Hamburg c/o DESY Notkestraße 85 22603 Hamburg Germany Tel. +49 (0)40 89902 0, Fax +49 (0)40 89902 104 EMBL Monterotondo Adriano Buzzati-Traverso Campus Via Ramarini, 32 00015 Monterotondo (Rome) Italy Tel. +39 06 90091285, Fax +39 06 90091272 Texts: EMBL-EBI Group and Team Leaders Layout, editing and cover design: Vienna Leigh, EMBL Office of Information and Public Affairs Louisa Wright, EMBL-EBI Scientific Outreach Officer Contents SECTION 1: INTRODUCTION 5 Foreword 7 EMBL-EBI in 2008 9 Facts and Figures 13 Outreach and Training 19 Industry Support 31 Systems and Networking 33 SECTION 2: SERVICES IN 2008 35 External Services Team Scientific Report 37 The Activities of the PANDA Group 43 Vertebrate Genomics 59 The Ensembl Genomes Team 65 The Proteomics Services Team 73 The InterPro Team 79 Chemoinformatics and Metabolism 83 Database Research and Development Group Activities 89 The GO Editorial Office 91 The Microarray Informatics Team 95 The Microarray Software Development Team 103 The Macromolecular Structure Database Team 107 Grid and e-Science Research and Development 109 Literature Resource Development 113 SECTION 3: RESEARCH IN 2008 119 The Bertone Group: differentiation and development 121 The Goldman Group: evolutionary tools for sequence analysis 129 The Huber Group: functional genomics 139 The Le Novère Group: computational systems neurobiology 145 The Luscombe Group: genome-scale analysis of regulatory systems 151 The Rebholz-Schuhmann Group: exploitation of scientific literature for new biomedical discoveries 159 The Thornton Group: computational biology of proteins – structure, function and evolution 163 INDEX 167 SECTION 4: APPENDICES Available online at www.ebi_SR08_appendices 3 Section 1 Introduction Foreword 7 EMBL-EBI in 2008 9 Facts and Figures 13 Outreach and Training 19 Industry Support 31 Systems and Networking 33 Foreword Foreword Welcome to EMBL-EBI’s 2008 Annual Scientific Report. Biological research has broken new ground in 2008: enabled by impressive advances in DNA sequencing technology, life scientists are now generating petabytes of data on a daily basis. EMBL- EBI’s mission, to provide bioinformatics services, research, training and industrial support, has therefore never been more important. New DNA sequencing methods will drive a second revolution in biology, with impacts not just in basic biological research, but for personalised medicine and for measuring the biodiversity of the planet. With the start of the 1000 Genomes Project, the fundamental nature of human variation will be revealed. To address this, the EBI launched the European Genotype Archive (EGA) during 2008 to store individual genomes; the EBI has also become the custodian of the associated Trace Archive, which holds the raw sequence data. Both will play a major role in understanding human variation. The maturing fields of systems and chemical biology have highlighted the importance of the ‘small’ molecules of life and their interactions, which also relate closely to the design of novel therapeutics and new approaches to crop management. To this end, we are developing a suite of open source chemistry resources. As well as building on ChEBI, a database of small biomolecules, we are creating ChEMBL, which will hold information on structure-activity relationships. Research at EBI is diverse and flourishing, producing both exciting discoveries and the development of powerful new tools to handle the flood of data. Highlights this year include a new phylogenetic- aware method for multiple sequence alignments, sophisticated tools to handle tiling array and image data, and new discoveries in the control of gene expression and cellular division. We have also recruited a new research group leader, to champion the increasingly important RNA field. Our services are being used ever more frequently by scientists in Europe and worldwide. We provide simple web access for at least 300,000 independent users every month, but increasingly scientists are using programmatic access to large amounts of data through web services technologies. Web services now account for close to a million jobs per month. The computational and storage needs of all these projects are tremendous, and this year has seen an enormous growth in both the computer power and the storage needed to fulfil our commitments. As the data resources develop, our commitment to providing training for users increases. With a growing team of trainers, we have run many new workshops and courses this year, both at Hinxton and throughout Europe. The main focus of our training is to empower biologists to make the most of their data, but we also hold more technical workshops for computational specialists. Our Industry Programme has also grown, invigorated by the new computational chemistry developments at the EBI and bringing new ideas both for services and workshops. With increased funding from EMBL this year, we have been able to consolidate part of Europe’s core set of data resources, but longer-term funding is still precarious. To address this, the prepara- tory phase of the ELIXIR project is in full swing. The project aims to develop a plan to construct and operate a sustainable infrastructure for biological information in Europe, to support life science research and its translation to medicine and the environment, the bio-industries and society. This infrastructure will increasingly become the life-blood of life science research in Europe, allowing scientists to combine information from many sources to understand and model complex biological systems and their interactions with the environment. It is notable that all six of the other European Research Infrastructure projects in the biomedical sciences are looking to ELIXIR to support and interact with their own informatics efforts. All our efforts at the EBI rely on extensive interaction with colleagues in Europe and throughout the world. The deposition of new data, the daily exchange of information between data resources, the joint development of software tools, the sharing of curation tasks and the challenges of collaborative research have built an extensive community of collaborators. It remains our privilege and pleasure to work with them. Janet Thornton, Director Graham Cameron, Associate Director 7 EMBL-EBI in 2008 EMBL-EBI in 2008 SERVICES Things seldom slow down for the service projects of the EBI, and 2008 was no exception. New technology in the lab has caused a huge surge in the rate of DNA sequencing, and has again revolu- tionised genomics and transcriptomics (See Facts and Figures, Figure 1 on page 13). The European Nucleotide Archive has accepted 10Tb of next-generation sequence data, and ArrayExpress now takes data from ultra high-throughput sequencing (UHTS) experiments. This has challenged our compute and storage systems, with our total disk space reaching 2.5 petabytes in the course of the year. This has been accompanied by a steady increase in the need for compute power, and by the end of the year the EBI’s computer facility included close to 7,000 cores. To accommodate these increases within the available space and without saturating the power supply to the campus we have had to include more compact ‘blade’ systems. There have also been significant changes in the ‘flavour’ of our technology, with Linux-based database servers finding favour, and Apple Macs increasingly being the desktop machine of choice. Janet Thornton The sheer volume of data stored creates substantial anxiety about back-up and disaster recovery sys- Director tems, and the EBI is addressing this through a recently signed agreement to utilise a remote ‘replica- PhD 1973, King’s tion’ site ten kilometres from the campus. Naturally, all of this imposes new demands on our network College & National Inst. capacity, and we have now ordered a 10GB/second connection to London. Geography has worked in For Medical Research, London. our favour and influenced our choice of location for our remote data replication centre, which sits on Postdoctoral research the route of this new network connection. at the University of Oxford, NIMR & Birkbeck Aside from dealing with the sheer quantity of data, our nucleotide sequence efforts have been con- College, London. solidated and refreshed with the imminent launch of Ensembl Genomes as a natural home for the Lecturer, Birkbeck College 1983–1989 DNA data from all organisms that have completed ‘reference genomes’. This will greatly enhance the Professor of utility of those data by exploiting the software of the Ensembl system. Biomolecular Structure, University College On the protein sequence side, a major development has been the provision of a first draft of the London since 1990. complete human proteome available in UniProtKB/Swiss-Prot. The InterPro resource, which is an Bernal Professor at Birkbeck College, integrated resource for protein families and domains, has grown to over 16,500 entries, providing 1996–2002. annotations for almost 80% of proteins in the

EMBL-EBI Annual Scientific Report 2008

Improving the Prediction of Transcription Factor Binding Sites To

Applied Category Theory for Genomics – an Initiative

A Community Proposal to Integrate Structural

SD Gross BFI0403

Functional Effects Detailed Research Plan

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Easybuild Documentation Release 20210907.0

Deep Profiling of Protease Substrate Specificity Enabled by Dual Random and Scanned Human Proteome Substrate Phage Libraries

The Impact of Academic Patenting on the Rate, Quality, and Direction of (Public) Research Output

Request for Comments on Proposed Amendments to Patent

Applying Library Values to Emerging Technology Decision-Making in the Age of Open Access, Maker Spaces, and the Ever-Changing Library

The Northern Powerhouse in Health Health Science Alliance Ltd Research – a Science and Innovation Audit