<<

The EMBL-European Institute Guide to data resources Copyright © 2012 EMBL-European Bioinformatics Institute Cover image courtesy of Samuel Kerrien The European Bioinformatics Institute Guide to data resources

Contents

About this guide 2 and 2 expression 3 and 4 Macromolecular structures 4 Small 5 and reactions 5 Interactions, pathways and networks 6 Patent sequences 6 Literature and text mining 7 Ontologies 7 Tools 8 Genes and genomes About us Next-generation has transformed the task of collecting DNA The EMBL-European Bioinformatics Institute (EMBL-EBI) maintains one sequence data. We no longer sequence just single genomes but also sets of of the world’s most comprehensive related genomes in structured studies. This has made possible the 1000 collections of freely available Genomes Project, an international initiative to build a comprehensive biological databases. Our free tools catalogue of all common human variation. We also capture genetic variation allow you to analyse this information of medical relevance. Such data require careful attention to access control, and share it with the research ensuring that only authorised scientists can use them, and that they do so in a community. way that reflects the spirit of consent and confidentiality. About this guide Ensembl Ensembl produces and maintains both automatic This document provides an overview www.ensembl.org and manually curated annotation on selected of EMBL-EBI’s core resources. You eukaryotic genomes. Automatic annotation can find overviews of these resources is based on mRNA and information. in Train online. Ensembl provides valuable insights into variation www.ebi.ac.uk/training/online within and between species, and allows you to compare whole genomes to identify conserved Support elements. It is integrated with several other Detailed help documentation important molecular resources, for example is available through each of the UniProt, and can be accessed programmatically. resources described here. You can Ensembl is developed as a joint project between also find answers to some of your EMBL-EBI and the questions through our homepage: Sanger Institute. www.ebi.ac.uk/help Ensembl Genomes is a portal to -scale Training www.ensemblgenomes.org data from , , fungi, and metazoa, through a unified set EMBL-EBI offers training at all of interactive and programmatic interfaces. levels. Our hands-on training courses Domain-centric resources (focused on particular are held regularly on the Genome areas of scientific interest) are developed in Campus in Hinxton, UK and at host institutions throught the world. collaboration with the appropriate scientific Train online guides you through communities, and integrated in the broad many bioinformatics resources, in taxonomic context through comparative analysis your own time and at your own pace. and standardised annotation. www.ebi.ac.uk/training European Nucleotide ENA, a member of the International Sequence Stay in touch Archive Database Collaboration, contains all the www.ebi.ac.uk/ena nucleotide sequences in the public domain You can find EMBL-EBI news, service and consolidates data from EMBL-Bank, the updates and training opportunities on European Trace Archive and the Sequence Read our website, Twitter and Facebook. Archive. The EBI Search returns results from www.ebi.ac.uk selected annotated sections of the ENA. @emblebi European Genome- EGA allows you to explore datasets from numerous genotype experiments – including /EMBLEBI phenome Archive www.ebi.ac.uk/ega case-control, population and family studies – that are supplied by a range of data providers.

Metagenomics Portal Our Metagenomics service is an automated www.ebi.ac.uk/metagenomics pipeline for the analysis and archiving of metagenomic data, and provides insights into the functional and metabolic potential of a sample.

Search www.ebi.ac.uk Our search service is your gateway to a world of information spanning DNA, genes, functional genomics, proteins, structures, small molecules, enzymes, interactions, pathways, scientific publications and patent sequences. 2 Leanne, discovery

Genome-wide expression assays, originally using microarrays and more recently high-throughput sequencing, can answer specific biological questions and provide reference data sets. The associated large-scale expression datasets can be used to answer questions that might not be related to the study for which the data were originally generated. For example, a gene expression study that reveals differentially expressed genes characteristic of a particular type of cancer may also reveal candidate genes for therapeutics development. EMBL-EBI’s functional genomics resources provide easy access to gene expression data and related information. Our visualisation tools collate information from multiple data sets and present it in an intuitive way. Leanne works on discovery and validation of candidate drug targets in a pharmaceutical company. She explains: ‘I do a lot of cloning and expression analysis. I need to find published cDNA sequences and their orthologues, so that I can find appropriate models in which to validate my targets. For discovery, I’d like a simple way to search across all public-domain biological information. A colleague mentioned a gene to me in passing recently and I only caught the name of it. I’d like to be able to find out more about it without having to forage around in lots of different publications.’ ArrayExpress The MIAME-standard compliant ArrayExpress www.ebi.ac.uk/arrayexpress Archive stores functional genomics experiments EBI resources for performed using RNA-Seq/ChIP-Seq and Leanne: array-based technologies. EBI Search for quick summary Expression Atlas The Expression Atlas allows you to search for reports on specific genes. www.ebi.ac.uk/gxa gene expression changes measured in various cell BLAST to root out types, parts, disease states and many vector contamination. other biological and experimental conditions. It represents a curated subset of the ArrayExpress The Expression Atlas for Achive experiments, and can be downloaded for profiles and local use. functional genomics. UniProt to learn more about R-Workbench The R-Workbench provides access to R and protein function. www.ebi.ac.uk/Tools/rcloud Bioconductor in the EBI cloud, and rapid access to data stored in EMBL-EBI databases such as ChEMBL for druggability of ArrayExpress. You can upload your own data and target molecules. use EMBL-EBI’s infrastructure to perform your data analyses.

3 Francesco, marine biologist Proteins and proteomics

Our protein and nucleotide data are extensively linked, both at the level of underlying data and through the coordination of web resources. This helps us capture, organise and interpret sequence-related data, providing information freely in a variety of formats including user-friendly web sites. Wherever possible, we collaborate with other groups worldwide with similar aims.

UniProt This comprehensive resource organises protein www..org sequence information so that you can find splice variants, functional information and links to related Francesco is Head of Department resources from a single entry. UniProt is a joint in a new collaborative centre effort with the Swiss Institute of Bioinfomatics and for marine biology. His group is Georgetown University. It comprises four components: doing metagenomic studies of the UniProt Knowledgebase (UniProtKB), extreme marine environments but Reference Clusters (UniRef), Archive (UniParc) his department has just secured and Metagenomic and Environmental Sequences funding to look for novel wound- (UniMES). UniProtKB is the hub for the collection healing compounds in marine of functional information on proteins, with accurate, algae. “We’re about to scale up our consistent and rich annotation. sequencing efforts severalfold. We need to put some pipelines in place InterPro InterPro classifies proteins into families and predicts that will enable us to make sense of www.ebi.ac.uk/ the presence of important domains and sites. all the data that we’re generating. InterProScan allows you to query your sequence The extremophile project is a against the database. collaborative effort, and we want to be able to share information with PRIDE The Proteomics Identifications Database is a our research network.” www.ebi.ac.uk/pride centralised, standards-compliant, public repository for proteomics data. It contains protein and EBI resources for identifications and their associated supporting Francesco: evidence. PRIDE also captures details of post-translational modifications. European Nucleotide Archive for DNA barcodes to identify in his samples. His team could set up a pipeline Macromolecular structures to submit their data to ENA and Three-dimensional structures give us mechanistic insight into how make it publicly available. macromolecules work, and help explain how their functions are disrupted by Ensembl and Ensembl Genomes mutation or interaction with small molecules. Our structural data resources to characterise organisms by are engineered to reflect the needs of a growing and diverse structural comparing them with a wealth of biology community. information on the genomes of many fully sequenced organisms. PDBe The Protein Data Bank in Europe (PDBe), part of the InterProScan to help identify www.ebi.ac.uk/pdbe Worldwide Protein Data Bank (wwPDB), allows you domains and motifs in the novel to search all the macromolecular structures in the sequences he is generating that public domain. Its tools help you perform sophisticated may imply specific functions. analyses of the data, including sub-structure searches and structural comparisons. You can analyse detailed UniMes, a repository specifically molecular interactions and correlate them with developed for metagenomic and sequence or structure patterns. environmental data. Metagenomics Portal for PDBsum This pictorial database provides an overview of powerful analysis of samples www.ebi.ac.uk/ macromolecular structures deposited in the Protein containing a variety of organisms. Data Bank archive. Web Services to help create pipelines. ProFunc ProFunc allows you to identify the likely biochemcial www.ebi.ac.uk/profunc function of a protein from its 3D structure.

4 John, Small molecules medicinal chemist

EMBL-EBI’s cheminformatics resources help bridge the gap between small molecules and the macromolecules with which they interact in living systems. In addition to the resources listed below, PDBe offers small services that are associated with macromolecular structures.

ChEBI Chemical Entities of Biological Interest (ChEBI) www.ebi.ac.uk/ is a manually annotated dictionary of small molecular entities, for example any constitutionally or isotopically distinct , molecule, or radical. It provides a wide range of related chemical John is a medicinal chemist information such as formulae, links to other in a large, multinational databases and a controlled vocabulary that describes pharmaceutical company. the ‘chemical space’. “I’m constantly looking for new information, so searchable ChEMBL ChEMBL, a database of bioactive compounds, databases of existing chemical www.ebi.ac.uk/ focuses on interactions between small molecules and literature are very useful,” he says. their macromolecular targets, including medicinal “For hit and lead optimisation, chemistry, clinical development and therapeutics new ideas that could get me data. Pharmaceutically important gene families in into novel chemical space would ChEMBL can be viewed in the GPCR and Kinase be great. Information on my SARfari web portals. pharmacophore would be really useful too; things like: is it RESID This collection of annotations and structures for likely to be toxic? Or, can we www.ebi.ac.uk/RESID protein modifications includes amino-terminal, predict if analogues will be carboxy-terminal and peptide chain biologically active?” post-translational modifications. EBI resources for John: MetaboLights MetaboLights is a database for metabolomics experiments and derived information. It is ChEMBL for detailed www.ebi.ac.uk/ information about over a million metabolights cross-species, cross-technique and covers metabolite structures and their reference spectra as well as their bioactive compounds. biological roles, locations and concentrations, and Kinase SARfari, an integrated experimental data from metabolic experiments. workbench linking kinase sequences with their structures, compounds that Enzymes and reactions target them and screening data. GPCR SARfari for G-protein- The Portal helps you explore the biology of enzymes and proteins coupled receptors. with enzymatic activity. It integrates publicly available information about enzymes, such as small-molecule chemistry, biochemical pathways and drug ChEBI for information about compounds. The Enzyme Portal summarises information from: small molecular entities and links to the macromolecules with •• The UniProt knowledge base; which they interact. •• The Protein Data Bank in Europe; PDBe for 3D structures of •• Rhea (www.ebi.ac.uk/rhea), a database of enzyme-catalysed reactions; macromolecules, including •• Reactome, a database of biochemical pathways; protein–ligand interactions. •• IntEnz (www.ebi.ac.uk/IntEnz), a resource with enzyme nomenclature information; •• ChEBI and ChEMB (see above); •• Cofactor (www.ebi.ac.uk/thornton-srv/databases/CoFactor) and MACiE (www.ebi.ac.uk/thornton-srv/databases/MACiE) for highly detailed, curated information about cofactors and reaction mechanisms. The Enzyme Portal covers a large number of species including the key model organisms, and provides a simple way to compare orthologues. EMBL-EBI also hosts the Catalytic Site Atlas, a resource of catalytic sites and residues that have been identified in enzymes using structural data: www.ebi.ac.uk/thornton-srv/databases/CSA 5 Leon, postdoc Interactions, pathways & networks

Computational systems biology takes us beyond the identification of molecular ‘parts lists’ for living organisms and towards synthesising information to generate and test new hypotheses about how biological systems work.

IntAct The IntAct molecular interaction database and analysis www.ebi.ac.uk/intact system provides a central, standards-compliant repository of molecular interactions, including protein–protein, protein–small molecule and protein–nucleic Leon is a postdoc working on acid interactions. quorum sensing in bacteria. He wants to understand what makes Reactome Reactome is a database of human biological pathways built a normally harmless bacterium www.reactome.org from connected reactions that encompass all biological pathogenic in the lungs of people events as well as classical biochemical events. Content with cystic fibrosis. “I’m using a is authored by expert and linked to scientific combination of transcriptomics, literature that experimentally validates the steps in a proteomics and metabolomics pathway. The goal is to represent the current biological to understand these pathogenic consensus of human pathways, as a free online reference changes better,” he says. “I end up and a downloadable core dataset. Reactome offers tools for with big spreadsheets of protein or searching and viewing pathways as interactive diagrams, gene IDs and I’m trying to piece and for analysing datasets by comparing them with other together which signalling pathways pathways and overlaying expression or interaction data. are involved in flipping to the pathogenic state. When I come BioModels Biomodels is a database of published mathematical models across a gene or protein that’s not www.ebi.ac.uk/ describing biological processes. You can simulate the familiar to me, I want to be able to biomodels models online and use them to develop your own. Biologists find out about it quickly and put can store, search and retrieve published and annotated it in the context of other genes or mathematical models of biological interest. proteins that are expressed at the same time.” PATENT ­Patent sequences EBI resources for Leon: EBI Search for quick summary We offer free, unrestricted access to dedicated patent sequence databases reports of genes or proteins. that contain valuable information about patent family data, patent Enzyme Portal to learn more equivalents and ‘earliest priority’ dates. These databases also offer access to about a particular enzyme that full-text patent documents. switches a pathway on or off. Our non-redundant patent sequence databases help you search efficiently InterPro for uncharacterised and return de-duplicated results, saving valuable time and effort. proteins or novel protein families Patent nucleotide and protein sequences are contributed by the European and domains. Patent Office (EPO), US Patent and Trademark Office, and the Japanese UniProt for the function of (JPO) and Korean (KIPO) patent offices, including World Intellectual proteins in his lists of IDs. Property Organization (WIPO)-approved patents from these offices. Further agreements with patent offices from Brazil, Canada, China and to find out Mexico have been achieved or are in progress. which ones are involved in the same processes. www.ebi.ac.uk/patentdata/nr Reactome to map gene and protein lists to pathways. Leon could also submit his transcriptomics and proteomics experiments to ArrayExpress and PRIDE to analyse his results in the context of other similar experiments.

6 Judith, Literature and text mining geneticist

We develop free public resources based on the scientific literature, and lead the development of UK PubMed Central. Our resources provide links to full-text articles and maintain cross-references to biological databases such as UniProt, ENA, PDBe and InterPro. The CiteXplore search engine integrates text-mining functions to build links to biological concepts and the underlying databases of proteins, genes and gene ontology.

CiteXplore CiteXplore covers 25 million abstracts inlcuding www.ebi.ac.uk/citexplore PubMed, Agricola and patents. It combines literature search with text-mining tools, Judith works for a plant-breeding citation networks and cross references to other institute that is using genomics to bioinformatics resources. identify new crop strains that are resistant to drought, salt and fungal UK PubMed Central UKPMC contains over 2 million full text life diseases. “We’re doing linkage http://ukpmc.ac.uk science research articles, of which around 250 000 studies to find out which genes are are . Combined with the CiteXplore involved in resistance to different content and functions listed above, it provides types of stress,” she explains. “Some integrated text-mining tools as well as of the crops we work on haven’t grant-reporting services. been sequenced yet. We’ve got a lot of data, both on genomic variants and on differences in expression Ontologies levels, that we need to map on to well-characterised plants to make The wide variations in terminology between different research communities sense of it. We also need to find out pose a challenge both for the researcher conducting a search and the whether any of our sequences have computer executing it. Ontology services help identify the most useful associated patents so that we don’t information by identifying functionally equivalent terms. waste time and effort.”

Gene Ontology The Gene Ontology Consortium provides a controlled EBI resources for vocabulary to describe gene and gene product attributes www.geneontology.org Judith: in any organism. Ensembl Genomes for comparative analysis and GOA provides Gene Ontology assignments for proteins GO Annotation orthologue prediction. for all species in the UniProt Knowledge Base. www.ebi.ac.uk/QuickGo ArrayExpress to find related ChEBI The ChEBI chemistry ontology provides a controlled functional genomics data. Judith vocabulary to describe small molecules, including their could also submit her expression www.ebi.ac.uk/ChEBI biological and chemical roles. data to ArrayExpress. Expression Atlas to see which Systems Biology The SBO project develops controlled vocabularies and genes are expressed under Ontologies ontologies for problems in systems biology. SBO is part similar conditions. of BioModels.net. www.ebi.ac.uk/sbo Gene Ontology for matching processes and functions to plant Experimental EFO is used to describe experimental variables in genes and their products. Factor Ontology functional genomics experiments, and is cross- referenced to more than 20 species or domain-specific Reactome to map genes www.ebi.ac.uk/EFO terminologies. It provides a way to integrate databases to pathways. using different terminologies and is deployed in The EPO-EBI Non-redundant searching for the ArrayExpress Archive and Expression patent databases to search for Atlas databases. prior art.

Ontology Lookup OLS is a centralised query interface for ontology and Service controlled vocabulary lookup. It features interactive graphics that clarify the relationships between terms. www.ebi.ac.uk/ ontology-lookup

7 Ola, clinical researcher Tools

We provide a comprehensive range of bioinformatics tools. Here we highlight a selection of those used most frequently, but you can find a more extended list on our website (www.ebi.ac.uk/Tools). For programmatic access, take a look at our Web Services (SOAP or REST APIs): www.ebi.ac.uk/Tools/webservices

Nucleotide sequence search Protein sequence search You can use these tools to search You can use these tools to search nucleotide sequence databases, protein sequence databases including Ola is an MD/PhD. When she’s not including ENA and resources for coding UniProtKB, sequences derived practising as an oncologist, she’s sequences, immunoglobulins and high- from macromolecular structures, working on her research project. throughput sequences: immunoglobulins and sequences “I’m looking for proteomics-based from patents: biomarkers for the early detection •• BLAST Nucleotide of bladder cancer, specifically •• ENA Sequence Search •• BLAST Protein from transitional-cell (Exonerate) •• PSI-Search carcinomas that are shed into the urine,” she explains. “I do mass spec Multiple Pairwise sequence alignment of samples from patients coming in for biopsies. I need to identify the For alignment of three or more For alignment of two sequences to proteins whose production is up- or sequences to identify regions of identify regions where the sequence is down-regulated. I’ve started looking conservation, which may indicate conserved: at post-translational modifications functional constraints and infer •• Needle evolutionary relationships: on these proteins, and have found •• LAlign a phosphoprotein that seems to be •• Omega upregulated in some patients. I’d •• ClustalW2 Functional genomics tools like to know more about the cellular •• WebPRANK processes leading to this change. To explore data from gene expression This might help us find novel •• T-Coffee experiments: therapies, too.” •• MUSCLE •• Expression Atlas •• EFO Tools EBI resources for Ola: Protein functional analysis •• EBI R-Workbench PRIDE for proteomics For identification of potential protein experiments that have identified families, domains and functional sites proteins from carcinoma cells. Text mining allowing the the function of a protein Ola could also submit her own sequence to be predicted: For analysis of text documents to research results to PRIDE. identify important terms and relate them •• InterProScan UniProt to find out about the to database entries and ontologies: •• Pratt phosphorylation sites on her •• EBIMed •• Phobius new biomarker and what •• Whatizit type of kinase might catalyse •• RADAR these changes. Data retrieval and ID mapping IntAct to find evidence of other Molecular structure analysis •• dbfetch proteins that interact with her These tools help you determine and protein of interest. visualise macromolecular structures: •• ENA / SRA •• PDBeFold •• ENA / EMBL-SVA •• QuaternaryStructure •• UniProt / UniSave •• SRS •• PICR

8 The European Molecular Biology Laboratory EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD, UK

Tel: +44 (0) 1223 494 444 Fax: +44 (0) 1223 494 468 E-mail: [email protected]

Web: www.ebi.ac.uk Twitter: @emblebi Facebook: /EMBLEBI

Find this brochure online: www.ebi.ac.uk/information/brochures