Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

European Bioinformatics Institute EMBL-EBI Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by PickeringHutchins Ltd www.pickeringhutchins.com EMBL member states: Austria, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, United Kingdom. Associate member state: Australia EMBL-EBI is a part of the European Molecular Biology Laboratory (EMBL) EMBL-EBI EMBL-EBI EMBL-EBI EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD United Kingdom Tel. +44 (0)1223 494 444, Fax +44 (0)1223 494 468 www.ebi.ac.uk EMBL Heidelberg Meyerhofstraße 1 69117 Heidelberg Germany Tel. +49 (0)6221 3870, Fax +49 (0)6221 387 8306 www.embl.org [email protected] EMBL Grenoble 6, rue Jules Horowitz, BP181 38042 Grenoble, Cedex 9 France Tel. +33 (0)476 20 7269, Fax +33 (0)476 20 2199 EMBL Hamburg c/o DESY Notkestraße 85 22603 Hamburg Germany Tel. +49 (0)4089 902 110, Fax +49 (0)4089 902 149 EMBL Monterotondo Adriano Buzzati-Traverso Campus Via Ramarini, 32 00015 Monterotondo (Rome) Italy Tel. +39 (0)6900 91402, Fax +39 (0)6900 91406 © 2012 EMBL-European Bioinformatics Institute All texts written by EBI-EMBL Group and Team Leaders. This publication was produced by the EBI’s Outreach and Training Programme. Contents Introduction Foreword 2 Major Achievements 2011 4 Services Rolf Apweiler and Ewan Birney: Protein and nucleotide data 10 Guy Cochrane: The European Nucleotide Archive 14 Paul Flicek: Vertebrate genomics 16 Paul Kersey: Ensembl genomes 18 Jane Lomax: The Gene Ontology editorial office 20 Henning Hermjakob: Proteomics services 22 Sarah Hunter: InterPro 24 Maria Martin: UniProt development 26 Claire O’Donovan: UniProt content 28 John Overington: ChEMBL 30 Christoph Steinbeck: Cheminformatics and metabolism 32 Alvis Brazma: Functional genomics 34 Misha Kapushesky: Functional genomics: Atlas 36 Helen Parkinson: Functional genomics: production 38 Ugis Sarkans: Functional genomics: software development 40 Gerard Kleywegt: Protein Data Bank in Europe (PDBe) 42 Tom Oldfield: PDBe databases and services 44 Sameer Velankar: PDBe content and integration 46 Johanna McEntyre: Literature services 48 Peter Rice: Developing and integrating tools for biologists 50 Service team members 52 Research Paul Bertone: Pluripotency, reprogramming and differentiation 56 Anton Enright: Functional genomics and analysis of small RNA function 58 Nick Goldman: Evolutionary tools for genomic analysis 60 Nicolas Le Novère: Computational systems neurobiology 62 Nicholas Luscombe: Genomics and regulatory systems 64 John Marioni: Computational and evolutionary genomics 66 Dietrich Rebholz-Schuhmann: Literature research 68 Julio Saez-Rodriguez: Systems biomedicine 70 Janet Thornton: Computational biology of proteins: structure, function and evolution 72 The EMBL International PhD Programme at the EBI 74 Research group members 76 Support Cath Brooksbank: Outreach and Training 80 Dominic Clark: The EMBL-EBI Industry Programme 82 Lindsey Crosswell: External Relations 84 Mark Green: Administration 86 Petteri Jokinen: Systems and Networking 88 Rodrigo Lopez: External Services 90 Support team members 92 Facts and Figures 2011: A year in numbers 94 Scientific advisory boards and committees 100 Major database collaborations 103 External seminar speakers 106 Publications 108 Introduction Janet Thornton Foreword Director Graham Cameron Associate Director Foreword Welcome to EMBL-EBI’s 2011 Annual Scientific Report! capture the words used to describe biological processes or track the evolution of enzymes and their ability to evolve new functions. 2011 has been a challenging and good year. The flood of data continues to rise on all fronts, and we have to We have continued to provide a safe haven for biological run ever faster just to stand still. We have developed a novel way data, adding value by careful curation and annotation and to compress and store the vast quantities of data pouring out of making data easily accessible to all scientists worldwide. high-throughput genomic experiments every day. To optimise We launched several new resources with powerful tools for our delivery of services to the community, we have continued to analysis, for example ‘Metabolights’ for metabolomic data and transfer our resources to data centres in London: a demanding a portal for metagenomics data. Our new Enzyme Portal now task that has progressed well. integrates extensive information on enzymes from several EMBL-EBI resources. The scientific literature remains the Our resources are increasingly in demand: recently, our website greatest repository of information and knowledge and, in the received over 7 million web hits in one day and almost 4 million spirit of open access, EMBL-EBI is now leading a collaboration jobs were run in one month. This reflects an ever-changing to provide UK PubMedCentral, a free online literature resource landscape of users throughout the world, and an associated rise in for life science researchers. the need for bioinformatics training. More and more of our users are translational scientists working in medicine and industry. Biological information underpins discovery. Our research To supplement our busy hands-on training and roadshow scientists are deeply involved in many collaborative projects, programmes, we launched Train online, a new resource that helps developing new algorithms and tools to extract robust, validated users learn in their own time and at their own pace. knowledge from new and existing data. Our basic research addresses a wide range of biological questions, from the We have continued to grow, and have strengthened our evolution of species to understanding synaptic plasticity and cell leadership by adding two new Team Leaders to the Protein Data differentiation. One study of note this year combined genetic and Bank in Europe (PDBe) team; we expect to bolster leadership genomic approaches to reveal how one protein helps protect the in our sequence variation resources in 2012. Our new External genome during duplication. Another shed light on how a large Relations programme, which got underway in September 2011, protein complex contributes to stem cell differentiation during has already made substantial progress in building relationships embryonic development. Through the development of new and communications with key stakeholders. models and algorithms, other efforts have elucidated the role of a We continue to coordinate the preparatory phase of ELIXIR, kinase involved in memory. Our researchers continue to generate the nascent pan-European research infrastructure for biological powerful resources for exploring biology, for example by helping information in Europe. It is our hope that ELIXIR will provide users create logic-based models of cellular signalling pathways, the facilities necessary for life science researchers to access and 2 Introduction exploit biological data in the coming decades. Following the We are very happy to welcome the many visitors who come to signing of a Memorandum of Understanding by EMBL and 11 EMBL-EBI to work with our scientists, use the data resources, nations, ELIXIR’s member states can now work together with attend training courses and participate in the stimulating EMBL to develop the International Consortium Agreement scientific environment that characterises our campus. Foreword necessary for the construction of ELIXIR. With the emergence of ELIXIR, we expect this number of visitors to increase as we work together to build a distributed From a funding perspective EMBL-EBI’s scientists have been bioinformatics infrastructure. very successful in raising funds (€23 million in 2011) for services and research through competitive grants – almost always in Recent announcements on the next generation of sequencing collaboration with others in the UK, mainland Europe or machines promise to usher in a completely new scale of data internationally. We thank our major funders (the EU, the NIH, production and analysis. The impact of these technologies the Wellcome Trust and the UK Research Councils). Clearly, on medicine and the environment is only just beginning, and without these funds we would not be able to deliver the high- requires even closer collaboration with our colleagues in the quality services and research for which we are known. clinic and in industry. We were delighted when the UK government committed Such close international collaborations are not new to EMBL-EBI. capital funding of €90 million to support EMBL-EBI and the Indeed, all our efforts rely on such interactions. The deposition establishment of the ELIXIR infrastructure hub. Over the next of new data, the daily exchange of information between data eight years this will fund computing equipment for the delivery resources, the joint development of software tools, the sharing of of all our services, including ELIXIR. In addition it will fund a curation tasks and the challenges of collaborative research have new building on the Wellcome Trust Genome Campus that will allowed us to build an extensive network of colleagues. We look provide accommodation for staff; an Industry and Innovation forward to strengthening these links in the coming year and, as Suite to promote the use biological information in applications always, to creating new collaborations. in medicine, biotechnology and the environment; and two new training rooms that will allow us to train more users. Janet Thornton Director

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

Applied Category Theory for Genomics – an Initiative

Gene Prediction: the End of the Beginning Comment Colin Semple

SD Gross BFI0403

The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

Bioinformatics Study of Lectins: New Classification and Prediction In

Toolboxes for a Standardised and Systematic Study of Glycans

Maternally Inherited Pirnas Direct Transient Heterochromatin Formation

File Formats Exercises

UNIVERSITY of CALIFORNIA SAN DIEGO Making Sense of Microbial

Standardised Benchmarking in the Quest for Orthologs

EMBL-EBI Powerpoint Presentation

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads