BIOINFORMATIC DATABASES Sequence, Phylogenetic, Structure and Pathway, and Micro- Array Databases

B BIOINFORMATIC DATABASES sequence, phylogenetic, structure and pathway, and micro- array databases. It highlights features of these databases, At some time during the course of any bioinformatics pro- discussing their unique characteristics, and focusing on ject, a researcher must go to a database that houses bio- types of data stored and query facilities available in the logical data. Whether it is a local database that records databases. The article concludes by summarizing impor- internal data from that laboratory’s experiments or a tant research and development challenges for these data- public database accessed through the Internet, such as bases, namely knowledge discovery, large-scale knowledge NCBI’s GenBank (1) or EBI’s EMBL (2), researchers use integration, and data provenance problems. For further biological databases for multiple reasons. information about these databases and access to all hyper- One of the founding reasons for the fields of bioinfor- links presented in this article, please visit http:// matics and computational biology was the need for manage- www.csam.montclair.edu/~herbert/bioDatabases.html. ment of biological data. In the past several decades, biological disciplines, including molecular biology and bio- SEQUENCE DATABASES chemistry, have generated massive amounts of data that are difficult to organize for efficient search, query, and Genome and protein sequence databases represent the analysis. If we trace the histories of both database devel- most widely used and some of the best established biological opment and the development of biochemical databases, we databases. These databases serve as repositories for wet- see that the biochemical community was quick to embrace lab results and the primary source for experimental results. databases. For example, E. F. Codd’s seminal paper, ‘‘A Table 1 summarizes these data repositories and gives their Relational Model of Data for Large Shared Data Banks’’ (3), respective URLs. published in 1970 is heralded as the beginning of the relational database, whereas the first version of the Protein GenBank, EMBL, and the DNA Databank of Japan Data Bank (PDB) was established at Brookhaven National Laboratories in 1972 (4). The most widely used biological data bank resource on the Since then, especially after the launching of the human World Wide Web is the genomic information stored in genome sequencing project in 1990, biological databases the U.S.’s National Institutes of Health’s GenBank, the have proliferated, most embracing the World Wide Web European Bioinformatics Institutes’ EMBL, and Japan’s technologies that became available in the 1990s. Now there National Institute of Genetics DNA Databank of Japan (1, are hundreds of biological databases, with significant 2, 7). Each of these three databases was developed sepa- research efforts in both the biological as well as the data- rately, with GenBank and EMBL launching in 1980 (4). base communities for managing these data. There are Their collaboration started soon after their development, conferences and publications solely dedicated to the topic. and DDBJ joined the collaboration shortly after its creation For example, Oxford University Press dedicates the first in 1986. The three databases, under the direction of the issue of its journal Nucleic Acids Research (which is freely International Nucleotide Sequence Database Collabora- available) every year specifically to biological databases. tion (INSDC), gather, maintain, and share mainly nucleo- The database issue is supplemented by an online collection tide data, each catering to the needs of the region in which it of databases that listed 858 databases in 14 categories in is located (4). 2006 (5), including both new and updated ones. Biological database research now encompasses many The Ensembl Genome Database topics, such as biological data management, curation, qual- The Ensembl database is a repository of stable, automati- ity, integration, and mining (6). Biological databases can be cally annotated human genome sequences. It is available classified in many different ways, from the topic they cover, either as an interactive website or downloadable as flat to how heavily annotated they are or which annotation files. Ensembl annotates and predicts new genes, with method they employ, to how highly integrated the database annotation from the InterPro (9) protein family databases is with other databases. Popularly, the first two categories and with additional annotations from databases of genetic of classification are used most frequently. For example, disease [OMIM (10)], expression [SAGE (11,12)] and gene there are archival nucleic acid data repositories [GenBank, family (13). As Ensembl endeavors to be both portable and the EMBL Data Library, and the DNA Databank of Japan freely available, software available at Ensembl is based on (7)] as well as protein sequence motif/domain databases, relational database models (14). like PROSITE (8), that are derived from primary source data. GeneDB Database Modern biological databases comprise not only data, but also sophisticated query facilities and bioinformatic data GeneDB (15) is a genome database for prokaryotic and analysis tools; hence, the term ‘‘bioinformatic databases’’ is eukaryotic organisms. It currently contains data for 37 often used. This article presents information on some pop- genomes generated from the Pathogen Sequencing Unit ular bioinformatic databases available online, including (PSU) at the Welcome Trust Sanger Institute. The GeneDB 1 Wiley Encyclopedia of Computer Science and Engineering, edited by Benjamin Wah. Copyright # 2008 John Wiley & Sons, Inc. 2 BIOINFORMATIC DATABASES Table 1. Summary of Genome and Protein Sequence Databases Database URL Feature GenBank http://www.ncbi.nlm.nih.gov/ NIH’s archival genetic sequence database EMBL http://www.ebi.ac.uk/embl/ EBI’s archival genetic sequence database DDBJ http://www.ddbj.nig.ac.jp/ NIG’s archival genetic sequence database Ensembl http://www.ensembl.org/ Database that maintains automatic annotation on selected eukaryotic genomes GeneDB http://www.cebitec.uni-bielefeld.de/ Database that maintains genomic information groups/brf/software/gendb_info/ about specific species related to pathogens TAIR http://www.arabidopsis.org/ Database that maintains genomic information about Arabidopsis thaliana SGD http://www.yeastgenome.org/ A repository for baker’s yeast genome and biological data dbEST http://www.ncbi.nlm.nih.gov/dbEST/ Division of GenBank that contains expression tag sequence data Protein Information http://pir.georgetown.edu/ Repository for nonredundant protein sequences and Resource (PIR) functional information Swiss-Prot/TrEMBL http://www.expasy.org/sprot/ Repository for nonredundant protein sequences and functional information UniProt http://www.pir.uniprot.org/ Central repository for PIR, Swiss-Prot, and TrEMBL database has four key functionalities. First, the database sequence repository (17), but a collection of DNA and stores and frequently updates sequences and annotations. protein sequences from existing databases [GenBank (1), Second, GeneDB provides a user interface, which can be EMBL (2), DDBJ (7), PIR (18), and Swiss-Prot (19)]. It used for access, visualization, searching, and downloading organizes the sequences into datasets to make the data of the data. Third, the database architecture allows inte- more useful and easily accessible. gration of different biological datasets with the sequences. Finally, GeneDB facilitates querying and comparisons dbEST Database between species by using structured vocabularies (15). dbEST (20) is a division of GenBank that contains sequence data and other information on short, ‘‘single-pass’’ cDNA The Arabidopsis Information Resource (TAIR) sequences, or Expressed Sequence Tags (ESTs), generated TAIR (16) is a comprehensive genome database that allows from randomly selected library clones (http://www.ncbi. for information retrieval and data analysis pertaining to nlm.nih.gov/dbEST/). dbEST contains approximately Arabidopsis thaliana (a small annual plant belonging to 36,843,572 entries from a broad spectrum of organisms. the mustard family). Arabidopsis thaliana has been of Access to dbEST can be obtained through the Web, either great interest to the biological community and is one of from NCBI by anonymous ftp or through Entrez (21). The the few plants whose genome is completely sequenced (16). dbEST nucleotide sequences can be searched using the Due to the complexity of many plant genomes, Arabidopsis BLAST sequence search program at the NCBI website. thaliana serves as a model for plant genome investigations. In addition, TBLASTN, a program that takes a query amino The database has been designed to be simple, portable, and acid sequence and compares it with six-frame translations efficient. One innovate aspect of the TAIR website is of dbEST DNA sequences can also be useful for finding MapViewer (http://www.arabidopsis.org/servlets/mapper). novel coding sequences. EST sequences are available in the MapViewer is an integrated visualization tool for viewing FASTA format from the ‘‘/repository/dbEST’’ directory at genetic, physical, and sequence maps for each Arabidopsis ftp.ncbi.nih.gov. chromosome. Each component of the map contains a hyper- link to an output page from the database that displays all The Protein Information Resource the information related to this component (16). The Protein Information Resource (PIR) is an integrated public bioinformatics resource that supports genomic and SGD:

BIOINFORMATIC DATABASES Sequence, Phylogenetic, Structure and Pathway, and Micro- Array Databases

Bioinformatics 1

13 Genomics and Bioinformatics

Use of Bioinformatics Resources and Tools by Users of Bioinformatics Centers in India Meera Yadav University of Delhi, [email protected]

Bioinformatics Study of Lectins: New Classification and Prediction In

1 Supporting Information for a Microrna Network Regulates

A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

In Silico Analysis of Single Nucleotide Polymorphisms

HMMER User's Guide

Apply Parallel Bioinformatics Applications on Linux PC Clusters

Supporting Information

Impact of the Protein Data Bank Across Scientific Disciplines.Data Science Journal, 19: 25, Pp

HMMER User's Guide