Advanced Genomic Data Mining
Total Page:16
File Type:pdf, Size:1020Kb
Education Advanced Genomic Data Mining Xose´ M. Ferna´ndez-Sua´rez*, Ewan Birney European Bioinformatics Institute, Ensembl Group, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom Introduction trieve information through different plat- pathways in the Reactome database). forms such as Galaxy [5] and the biomaRt Click on the ‘count’ button at the top to As data banks increase their size, one of package of BioConductor [11,12]. Many obtain this number. the current challenges in bioinformatics is of these tools also interact with the UCSC Next, we can enrich our search for to be able to query them in a sensible way. Table Browser and have similar approach- enzymes in the UniProt database. This Information is contained in different es using the UCSC system. We also will require the ‘linked’ or secondary databases, with various data representa- address programmatic access using Bio- dataset. Follow this description, or view tions or formats, making it very difficult to Mart’s own implementation of Web ser- the tutorials for use of the linked database use a single query tool to search more than vices (MartService). For local deployment at http://www.ensembl.org/common/ a single data source. of BioMart, see Table 1. Workshops_Online?id = 117. Data mining is vital to bioinformatics as Click on the second ‘Dataset’ option at it allows users to go beyond simple BioMart Web Interface the left of the page. Select ‘UniProt browsing of genome browsers, such as proteomes’ as the database. In this Ensembl [1,2] or the UCSC Genome Browser First we will focus on BioMart’s Web instance, we will add as a filter the Gene [3], to address questions; for example, the interface (http://www.biomart.org) to il- Ontology (GO) [15] term ‘GO:0005975’ biological meaning of the results obtained lustrate how to join two different datasets: (associated with carbohydrate metabolic with a microarray platform, or how to Reactome [13], a database of metabolic processes); this will be under ‘EXTER- identify a short motif upstream of a gene, pathways, and UniProt [14], a catalogue NAL IDENTIFIERS’, ‘Limit to pro- amongst others. There are a number of of protein information. In this example, teins…GO ID(s)’ in the secondary data- integrated approaches available, some of we need to obtain a catalogue of enzymes set. Also select, under ‘External which are described below (Figure 1). involved in carbohydrate metabolism in references’: ‘Entries with EC ID(s)’, to limit The Table Browser at UCSC [4] sup- humans, as we are interested in a congenic our query to enzymes only, and ‘eukaryota’ ports text-based batch queries to the UCSC disorder in this pathway. To ask this Genome Browser, limiting the output to along with ‘Homo sapiens’under‘SPE- question without an integrated data min- CIES’ (Species and Proteome Name, entries meeting the selected criteria. A ing tool, one would have to start with disadvantage of this tool is that users need respectively). This will give a count of Reactome to find enzymes involved in 257 in the secondary dataset. The to be familiar with the underlying database reaction pathways in human and then schema in order to know where their data is genome location can be displayed in the compare those enzymes to a list of entries output by choosing the following Attri- stored. Similarly, performing complex que- in UniProt. However, BioMart allows us ries might require multiple steps that can be butes:‘‘Genome component name’’ for the to join the two databases. chromosome, ‘‘Start Position’’ and ‘‘End burdensome with this tool. Galaxy [5] We can start our query by clicking on provides a set of tools that can retrieve data Position’’ for the coordinates. Click ‘MartView’ from the Web interface at from the Table Browser (Table Browser ‘Results’ for the table in Figure 2. http://www.biomart.org, and selecting and BioMart will be explained below), Now you have a list of enzymes in the Reactome database. Now, select the facilitating complex queries that require UniProt involved in carbohydrate metab- reaction dataset. Filters applied will be multiple joins (Figure 2). olism in humans. simply ‘Limit to Species’ Homo sapiens. BioMart provides a query-oriented data Attributes can be selected as ‘‘Reaction management system to interact with differ- BioConductor name’’ and ‘‘Gene ENSEMBL ID’’. At this ent datasets (Ensembl [2], RGD [6,7], and stage, 2,432 entries meet our criteria (i.e. BioConductor is open source software WormBase [8], among many others). This we have asked for all human reaction for the analysis of genomic data. It is based data ‘‘warehouse’’ was originally developed for Ensembl, creating EnsMart [9,10]. From there, it was first deployed across the European Bioinformatics Institute Citation: Ferna´ndez-Sua´rez XM, Birney E (2008) Advanced Genomic Data Mining. PLoS Comput Biol 4(9): (EBI), and now it has become a joint project e1000121. doi:10.1371/journal.pcbi.1000121 between EBI and Cold Spring Harbor Editor: Fran Lewitter, Whitehead Institute, United States of America Laboratory (CSHL). The generic query Published September 26, 2008 system has shifted toward a federated approach that has been deployed for several Copyright: ß 2008 Ferna´ndez-Sua´rez, Birney. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in biological databases, and has become a any medium, provided the original author and source are credited. component of the Generic Model Organ- Funding: The Ensembl project receives primary funding from the Wellcome Trust. Additional funding is ism Database (GMOD) project. provided by EMBL, NHGRI, NIH-NIAID, BBSRC, MRC, and the European Union. BioMart receives additional In this contribution, we provide some funding from the EMBRACE project (FP6 Programme, European Commission, contract number LHSG-CT-2004- solutions for data mining; we focus 512092). Ensembl supports BIOMART. on advanced ways of interacting with Competing Interests: The authors have declared that no competing interests exist. BioMart using other applications to re- * E-mail: [email protected] PLoS Computational Biology | www.ploscompbiol.org 1 September 2008 | Volume 4 | Issue 9 | e1000121 Figure 1. Diagram depicting the way different applications interact with data mining tools. The boxed area shows the topics discussed in this paper. BioMart’s Central Server coordinates access to different databases: on the left-hand side, Galaxy interacts with BioMart by means of URL- based queries, allowing interaction with the UCSC Genome Browser and its Table Browser; while on the right interaction with Ensembl and its API is shown. Other platforms such as BioConductor rely on Web services (MartService) to retrieve information from the BioMart system. Additionally, BioMart is compliant with the DAS protocol [34] and can be queried by means of a Perl API, biomart-perl, as well as a Java API, martj. doi:10.1371/journal.pcbi.1000121.g001 on the R language [16] (which is an packages/release/Software.html. The bio- database for Caenorhabditis elegans and related implementation of the S language, a maRt package provides an API (Application nematodes RGD [6,7]; and Reactome [13], statistical programming language original- Programming Interface) in the scripting a curated knowledgebase of biological ly developed at Bell Laboratories to language R, allowing interaction with pathways, amongst others. support research and data analysis of large biomaRt databases. These include Ensembl, R can be installed on different plat- statistical projects [17]). which produces and maintains automatic forms; there are binaries available for R is an integrated software environment annotation on selected eukaryotic genomes; Unix, Windows, and Macintosh. For a list for data manipulation, which can be used as VEGA [18], the manually annotated Ver- of the Comprehensive R Archive Network a statistics system (throughout many differ- tebrate Genome Annotation database; (CRAN), go to http://cran.r-project.org/ ent packages). There are a large number of dbSNP [19], the Single Nucleotide Poly- mirrors.html. Once obtained, the source biologically relevant modules in BioCon- morphism database of NCBI; Gramene should be unpacked and installed follow- ductor, some of which are described in [12] [20], a resource for comparative grass ing the instructions provided. There is a and at http://www.bioconductor.org/ genomics; WormBase [9], the canonical built-in help facility invoked with help, Figure 2. BioMart can join different datasets, in this case Reactome and UniProt to identify enzymes involved in carbohydrate metabolism. doi:10.1371/journal.pcbi.1000121.g002 PLoS Computational Biology | www.ploscompbiol.org 2 September 2008 | Volume 4 | Issue 9 | e1000121 Table 1. URLs for additional information. BioMart documentation http://www.biomart.org/user-docs.pdf BioMart tutorials http://www.ensembl.org/info/using/website/tutorials/ http://www.ensembl.org/common/Workshops_Online BioMart Central Server http://www.biomart.org/ BioConductor http://www.bioconductor.org/ http://www.bioconductor.org/packages/release/Software.html R http://www.r-project.org/ Installing R http://cran.r-project.org/doc/manuals/R-admin.html R Archive Network http://cran.r-project.org/mirrors.html biomaRt documentation http://www.bioconductor.org/packages/1.8/bioc/vignettes/biomaRt/inst/doc/biomaRt.pdf Galaxy http://main.g2.bx.psu.edu/ doi:10.1371/journal.pcbi.1000121.t001 for instance ‘help(debug)’ will provide With biomaRt installed, load the relevant Below are two commands to query the documentation about the debug function. library with the library(biomaRt) library to see the currently available marts Furthermore, following the installation of command, and then connect to any public and datasets on the central server a package, a pre-built help search index is BioMart database. The listMarts function (Figure 3). created. To know what commands are will show which BioMart services are avail- 4library(biomaRt) available in biomaRt use ‘help.