Annual Scientific Report 2009

European Bioinformatics Institute Annual Scientific Report 2009 Annual Scientific Report 2009 European Bioinformatics Institute EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD United Kingdom Tel. +44 (0)1223 494444, Fax +44 (0)1223 494468 www.ebi.ac.uk EMBL Heidelberg Meyerhofstraße 1 69117 Heidelberg Germany Tel. +49 (0)6221 387 0, Fax +49 (0)6221 387 8306 www.embl.org [email protected] EMBL Grenoble 6, rue Jules Horowitz, BP181 38042 Grenoble, Cedex 9 France Tel. +33 (0)4 76 20 72 69, Fax +33 (0)4 76 20 71 99 EMBL Hamburg c/o DESY Notkestraße 85 22603 Hamburg Germany Tel. +49 (0)40 89 902 110, Fax +49 (0)40 89 902 149 EMBL Monterotondo Adriano Buzzati-Traverso Campus Via Ramarini, 32 00015 Monterotondo (Rome) Italy Tel. +39 06900 91402, Fax +39 0690091406 Texts: EMBL-EBI Group and Team Leaders Layout, editing and cover design: Vienna Leigh, EMBL Office of Information and Public Affairs Louisa Wright, EMBL-EBI Outreach Programme Project Leader 3 Contents SECTION 1: INTRODUCTION 5 Foreword 7 Highlights of 2009 9 SECTION 2: SERVICES IN 2009 15 The Activities of the PANDA Group 17 The European Nucleotide Archive Team 31 Vertebrate Genomics 37 The Ensembl Genomes Team 45 The Proteomics Services Team 51 The InterPro Team 57 Computational Chemical Biology: the ChEMBL Team 61 Chemoinformatics and Metabolism 65 Database Research and Development 71 The GO Editorial Office 75 The Microarray Informatics Team 79 The Microarray Software Development Team 85 The Protein Data Bank in Europe (PDBe) Team 89 Developing and Integrating Tools for Biologists 93 Literature Resource Development 97 SECTION 3: RESEARCH IN 2009 103 The Bertone Group: differentiation and development 105 The Enright Group: functional genomics and analysis of small RNA function 111 The Goldman Group: evolutionary tools for sequence analysis 117 The Le Novère Group: computational systems neurobiology 123 The Luscombe Group: genome-scale analysis of regulatory systems 129 The Rebholz-Schuhmann Group: semantic standardisation of the scientific literature 135 The Thornton Group: computational biology of proteins 141 SECTION 4: SUPPORT IN 2009 147 Outreach and Training 149 Industry Support 159 Systems and Networking 163 External Services Team 165 4 SECTION 5: FACTS AND FIGURES 169 Services and Research 171 Publications 177 Major Database Collaborations 185 Scientific Advisory Boards 187 External Seminar Speakers 189 INDEX 191 Section 1 Introduction 5 Foreword 7 Highlights of 2009 9 6 Janet Thornton Graham Cameron Director Associate Director Foreword 7 Introduction Welcome to EMBL-EBI’s 2009 Annual Scientific Report. With the emergence of ever-more powerful sequencing and the need for improved translation of biological discoveries into applications, EMBL-EBI’s mission, to provide bioinformatics services for biomolecular data, to perform basic bioinformatics research, to provide user training and to support industry, has never been more important. – Foreword This year has seen major developments in capturing and storing data from next-generation sequencing machines. These are powering many international projects, including the 1000 Genomes Project to reveal the molecular basis of human variation. With our international colleagues, the EBI is providing the public archive for the data generated. The computational and storage needs of all these projects are tremendous, and require an enormous growth in both the computer power and the storage needed to fulfil these commitments. This year has also seen ChEMBLdb go live. This is a freely available web resource for chemical biology and drug dis- covery research. ChEMBLdb provides worldwide access to a large number of medicinal chemistry lead optimisation experiments (usually known as Structure Activity Relationship, or SAR data) reported in the primary literature. The ChEMBLdb resource is highly complementary to existing resources such as UniProt, PDBe and ChEBI, but adds an important new capability to address the needs of the pharmaceutical and biotechnology industries, and the academic chemical biology communities with chemical data. The EBI’s research is diverse and flourishing, producing both exciting discoveries and the development of powerful new tools to handle the flood of data. We have contributed to understanding how genome structure affects its function; to defining the repertoire of transcription factors, which provide regulatory mechanisms underlying biological processes in humans; and to elucidating the effect of deleting a component of the nutrient-responsive mTOR (mam- malian target of rapamycin) signalling pathway, which led to increased life span and resistance to age-related patholo- gies in mice. In parallel we have developed new resources, including leading the worldwide effort to develop a Systems Biology Graphical Notation, which is a new ‘language’ to describe and visualise all kinds of biological knowledge, from gene regulation to metabolism and cellular signalling. Our services are being used ever-more frequently by scientists in Europe and worldwide, with around 3 million web hits per day on the EBI website and over one million compute jobs per month run at the EBI. As the data resources develop, our commitment to providing training for users increases. This year we have run many new workshops and courses, both at Hinxton and throughout Europe, and participated in many conferences, in total reaching an estimated 24,000 researchers. Our Industry Programme has also grown, invigorated by the new computational chemistry developments at the EBI and bringing new ideas both for services and for workshops. The Innovative Medicines Initiative (IMI), run by the European pharmaceutical industry and the European Commission, offers a welcome opportunity to address the needs of this sector more directly with explicit grant funding, involving the EBI in several new projects. The additional funding made available by EMBL Council in this indicative scheme is being used to consolidate the core set of data resources provided by EMBL-EBI, but longer-term funding is still precarious. The ELIXIR preparatory phase, which aims to develop a plan to construct and operate a sustainable infrastructure for biological information in Europe, is beginning to bear fruit. However, although two countries have already committed to this project, there is still much work to be done to define the scope and secure stable funding for this infrastructure for the future. All our efforts at the EBI rely on extensive interactions with colleagues in Europe and throughout the world. The deposition of new data, the daily exchange of information between data resources, the joint development of software tools, the sharing of curation tasks and the challenges of collaborative research have built an extensive community of collaborators. It remains our privilege and pleasure to work with them and together we will continue to aim for excel- lence in all that we do. Janet Thornton, Director Graham Cameron, Associate Director Janet Thornton, Director PhD 1973, King’s College & National Inst. For Medical Research, London. Postdoctoral research at the University of Oxford, NIMR & Birkbeck College, London. Lecturer, Birkbeck College 1983–1989. Professor of Biomolecular Structure, University College London since 1990. Bernal Professor at Birkbeck College, 1996–2002. Director of the Centre for Structural Biology at Birkbeck College and University College London, 1998–2001. Director of EMBL-EBI since 2001. 8 Graham Cameron, Associate Director Applications Programmer, EMBL Data Library, 1983–1983. Introduction Database Administrator, EMBL Data Library, 1983–1986. Manager, EMBL Data Library, 1986–1992. Project Leader overseeing the creation of EMBL-EBI Outstation, 1993–1994. Head of Services, EMBL-EBI, 1994–1998. Joint Head of EMBL-EBI, 1998–2001. – Foreword Associate Director of EMBL-EBI since 2001. Janet Thornton Graham Cameron Director Associate Director Highlights of 2009 9 SERVICES Introduction Enabling optimal exploitation of biomolecular information is at the heart of EMBL-EBI’s mission. The EBI services include the provision of biological databases and tools to explore them. Our constituency includes academic and com- mercial researchers throughout Europe and the world, and we form a European node in many global data-sharing – Highlights of 2009 collaborations (figure 1). These service activities are accompanied by extensive outreach and training (concentrated mostly in Europe) with a dedicated team (see page 149), and industry users receive targeted support through the EBI’s Industry Programme (page 159). 2009 has been an interesting year for these services. The familiar ever-increasing data flow rates perpetuate exponen- tial database growth curves, which are already etched on the retinas of service providers and funders. However, in 2009 striking qualitative changes have accompanied the familiar quantitative changes. This has resulted in new data resources, restructuring of existing resources and the demise of some obsolete resources. EMBL-Bank Ensembl Ensembl Genomes ArrayExpress UniProt InterPro IntAct PRIDE PDBe Reactome GO Figure 1. Map showing location of EMBL-EBI’s collaborations for the major databases. One of the most striking new resources is ChEMBLdb, a database of bioactive compounds (drugs and drug-like molecules) and their quantitative properties and bioactivities. Computer methods for the storage and exploitation of chemical information – now called cheminformatics – pre-date the existence

Annual Scientific Report 2009

Applied Category Theory for Genomics – an Initiative

Gene Prediction: the End of the Beginning Comment Colin Semple

SD Gross BFI0403

The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

Functional Effects Detailed Research Plan

Meeting Review: Bioinformatics and Medicine – from Molecules To

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

C. Elegans Whole Genome Sequencing Reveals Mutational Signatures Related to Carcinogens and DNA Repair Deficiency

Abstracts In

Deep Profiling of Protease Substrate Specificity Enabled by Dual Random and Scanned Human Proteome Substrate Phage Libraries

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

Human Genetics: International Projects and Personalized Medicine