Overview of Biological Database Mapping Services for Interoperation Between Different ‘Omics’ Datasets Shweta S

SOFTWARE REVIEW Overview of biological database mapping services for interoperation between different ‘omics’ datasets Shweta S. Chavan,1,2 John D. Shaughnessy Jr2 and Ricky D. Edmondson2 1University of Arkansas Little Rock (UALR)/University of Arkansas Medical Sciences (UAMS) Joint Bioinformatics Program, AR 72204, USA 2Myeloma Institute for Research and Therapy, UAMS, Little Rock, AR 72205, USA *Correspondence to: Tel: þ1 501 454 2764; Fax: þ1 501 686 6442; E-mail: [email protected] Date received: 17th May 2011 Abstract Many primary biological databases are dedicated to providing annotation for a specific type of biological molecule such as a clone, transcript, gene or protein, but often with limited cross-references. Therefore, enhanced mapping is required between these databases to facilitate the correlation of independent experimental datasets. For example, molecular biology experiments conducted on samples (DNA, mRNA or protein) often yield more than one type of ‘omics’ dataset as an object for analysis (eg a sample can have a genomics as well as proteomics expression dataset available for analysis). Thus, in order to map the two datasets, the identifier type from one dataset is required to be linked to another dataset, so preventing loss of critical information in downstream analysis. This identifier mapping can be performed using identifier converter software relevant to the query and target identifier databases. This review presents the publicly available web-based biological database identifier converters, with comparison of their usage, input and output formats, and the types of available query and target database identifier types. Keywords: omics, genomics, proteomics, identifier mapping, biological database identifier converter Introduction interoperation between databases. Thus, enhanced mapping between these databases is required to Many primary biological databases are dedicated to facilitate the correlation of independent experimen- providing annotation for a specific type of biologi- tal datasets, which can be provided by Identifier cal molecule such as a clone, transcript, gene or (Id) mapping services. protein (eg the National Center for Biotechnology Id mapping services are tools to connect one Information [NCBI] Entrez Gene1 database pro- type of database Id to the corresponding Id in vides annotation for genes, whereas UniprotKB2,3 another database. This mapping includes three provides this for proteins). Other types of second- types of relationships: one-to-one, one-to-many ary databases provide relevant information about and many-to-many. One-to-many and the attributes of these molecules, such as pathway, many-to-many relationships are required to account function(s) or structural information (eg the Kyoto for biological processes such as alternative splicing, Encyclopedia of Genes and Genomes [KEGG]4 resulting in one gene giving rise to multiple tran- and Protein Data Bank [PDB].5 Often, these data- scripts, the presence of several isoforms of a single bases provide limited cross-references for protein and other similar processes occurring in a # HENRY STEWART PUBLICATIONS 1479–7364. HUMAN GENOMICS. VOL 5. NO 6. 703–708 OCTOBER 2011 703 SOFTWARE REVIEW Chavan et al. cell. Also, gene expression datas such as microarray input; (ii) target output databases; (iii) species; (iv) datas are known to have multiple probes targeting a data sources—and therefore coverage of the genome single transcript and vice versa (eg Affymetrix6 or proteome of a particular species; (v) ease of use; probes, which can be described by many-to-many (vi) database update frequency; (vii) possible conver- relationships). Thus, mapping Ids of multiple data- sion types (eg protein-to-gene, gene-to-transcripts bases to one another facilitates the correlation of etc.); (viii) speed of conversion; (ix) a detailed help different types of ‘omics’ datasets which, in turn, section or tutorial describing the intended use of might provide meaningful insights into the biologi- the application; and (x) an algorithm for mapping cal processes occurring in a cell. database Ids. In general, these services establish Several Id mapping services are publicly available mapping links using existing cross-references or by (Table 1). The seven Id converters that are discussed using sequence alignment information to determine in detail in this review were selected to represent the the match. Fewer Id converters have their own majority of Id converters—as well as major biological published algorithm for establishing the mapping, in databases—used in high-throughput genomics and addition to using the existing cross-references. proteomics datasets. They have many common fea- Many of the Id converters provide a web-based tures, such as: (i) supporting one-to-many and many- intuitive user interface, generally having three com- to-many relationships; (ii) providing mapping counts, ponents: input types (ie query databases), output and the details of the database Ids used as a ‘bridge’ to types (ie target databases) and the species under link to target database Ids; (iii) providing a web-based consideration. One such web application is Clone/ graphical user interface (GUI) which allows sub- Gene ID Converter.7 It provides the option for mission of a single Id or a batch search for multiple several query and target databases for human, Ids; and (iv) the output database Ids are hyperlinked mouse and rat species. The output can be custo- to their original database for reference. mised by selecting from a number of output data- These Id converters also differ in a number of bases which are divided into several logical levels, ways, such as in the availability of (i) type of query such as gene, gene clone, protein and functional annotation. Further, detailed references are pro- Table 1. Link to web interface of various Id converters vided for the resultant output Ids by hyperlinking them to their original data sources. The output can Mapping services Link also be obtained in a spreadsheet or text format. Gene/Clone ID converter http://idconverter. A detailed list of input/output databases, availability bioinfo.cnio.es/ and other pertinent features can be found in Table ID mapping by UniProt http://www.uniprot.org/ 2. A useful piece of information provided on the ?tab=mapping interface itself includes the specific version that was MatchMiner http://discover.nci.nih. used as the data source for individual databases. gov/matchminer/ This is of importance, considering the frequent MatchMinerLookup.jsp updates of sequence databases and the increasing DAVID gene ID conversion http://david.abcc.ncifcrf. novel findings about the biological molecules in tool gov/conversion.jsp the respective research areas. Another similar type of Id converter is ID g:Convert http://biit.cs.ut.ee/ mapping hosted by UniProt.3 It supports almost all gprofiler/gconvert.cgi organisms, with monthly updates, and maps approxi- CRONOS http://mips.helmholtz- mately 90 database sources, including primary muenchen.de/genre/ sequence databases and secondary functional/ proj/cronos/index.html structural annotation databases. Thus, the input and bioDBnet:db2db http://biodbnet.abcc. output database option is divided into Uniprot, other ncifcrf.gov/db/db2db.php sequence database, three-dimensional structure, 704 # HENRY STEWART PUBLICATIONS 1479–7364. HUMAN GENOMICS. VOL 5. NO 6. 703–708 OCTOBER 2011 Overview of biological database mapping services Table 2. Comparison of various Id converters Features of Gene/Clone ID ID mapping MatchMiner DAVID gene ID g:Convert CRONOS bioDBnet:db2db mapping converter by UniProt conversion tool # services HENRY STEWART PUBLICATIONS 1479–7364. Interface Web-based Web-based Web-based Web-based Web-based Web-based Web-based GUI form GUI form GUI form GUI form, GUI form GUI form GUI form command line Output Html, text, Html, text Html, text, Html, text, Html, text, Html, email for Text, spreadsheet format spreadsheet spreadsheet spreadsheet spreadsheet, minimal batch mode (no header) Organisms Human, Human, mouse, Human, mouse Human, about Human, mouse, rat Human, mouse, A specified list could mouse, rat rat and many another 90,000 and 31 other rat, cow, dog, and not be found other species species Ensembl-supported fruit fly genome species HUMAN GENOMICS Input/output Clone Ids, GenBank, Affymetrix Ids, Affymetrix Id, Agilent Affymetrix, Agilent, Ensembl/FlyBase Affymetrix, Agilent, clone or Affymetrix Ids, EMBL, DDBJ GenBank Id, Illumina Id, CCDS Ids, Ensembl Transcript ID, EMBL, GenBank, RefSeq transcript GenBank Accession, EST, GenBank Accession, transcript, Illumina, Affymetrix, Agilent, Genomic, RefSeq Accession IMAGE Clone Id, Gene symbol, RefSeq DNA/ CCDS Nucleotide (Additional FISH-mapped GenPept Accession, Genomic output: EMBL)* BAC Clone Id NCBI GI, RefSeq . VOL 5. NO 6. 703–708 OCTOBER 2011 RNA/Genomic accession Input/output HUGO gene Entrez Gene, Gene Symbol Entrez gene Id, Ensembl Gene, Gene Name, Entrez Gene ID, gene names, Entrez HGNC, HUGO/Alias, Ensembl gene/ Entrez Gene, Ensembl/FlyBase Ensembl Gene ID, gene Ids, Ensembl Ensembl, Name, UniGene transcript Id, RefSeq mrna, Gene ID, GI, UniGene gene Ids, UniGene UniGene, Cluster Id, Entrez RefSeq mRNA UniGene GeneID, HGNC, cluster Ids, RefSeq TIGR (JCVI) Gene Id, RefSeq accession, RefSeq mRNA RNAs (Additional RNA UniGene Id output: CCDS)a Input/output RefSeq peptides, UniProtKB, RefSeq protein PIR accession, PIR Id, Ensembl Protein, Protein Name, UniProt Accession, protein SwissProt names RefSeq, PIR NREF Id, RefSeq IPI,

Overview of Biological Database Mapping Services for Interoperation Between Different ‘Omics’ Datasets Shweta S

Maternally Inherited Pirnas Direct Transient Heterochromatin Formation

The Biogrid Interaction Database

Flybase: Updates to the Drosophila Melanogaster Knowledge Base Aoife Larkin 1,*, Steven J

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

The Drosophila Melanogaster Transcriptome by Paired-End RNA Sequencing

Rnacentral: Progress in the Development of the Non-Coding RNA Sequence Database

Basic Bioinformatics

Flybase 2.0: the Next Generation Jim Thurmond1, Joshua L

Phylogenetic Profiling of the Arabidopsis Thaliana Proteome

Arabidopsis Thaliana Functional Genomics Project Annual Report 2008

Bioinformatics Approaches for Functional Predictions in Diverse Informatics Environments. Paula Maria Moolhuijzen, Bsc This Thes

Abstracts Selected for Platform Presentation