Human Genome Center [150-177.Pdf]
Total Page:16
File Type:pdf, Size:1020Kb
150 Human Genome Center Laboratory of Genome Database Laboratory of Sequence Analysis ゲノムデータベース分野 シークエンスデータ情報処理分野 Professor Minoru Kanehisa, Ph. D. 教 授 理学博士 金 久 實 Research Associate Toshiaki Katayama, M. Sc. 助 手 理学修士 片山俊明 Research Associate Shuichi Kawashima, M. Sc. 助 手 理学修士 川島秀一 Lecturer Tetsuo Shibuya, Ph. D. 講 師 理学博士 渋谷哲朗 Research Associate Michihiro Araki, Ph. D. 助 手 薬学博士 荒木通啓 Owing to continuous developments of high-throughput experimental technologies, ever-increasing amounts of data are being generated in functional genomics and proteomics. We are developing a new generation of databases and computational technologies, beyond the traditional genome databases and sequence analysis tools, for making full use of such large-scale data in biomedical applications, espe- cially for elucidating cellular functions as behaviors of complex interaction systems. 1. Comprehensive repository for community to make the system consistent with the existing genome annotation open standards. Thus, the contents of the KEGG DAS server can be accessed programatically by Toshiaki Katayama and Minoru Kanehisa the DAS protocol and graphically in a web browser using GBrowse. BioDAS, which is an KEGG DAS is a DAS (Distributed Annotation XML over HTTP data retrieving protocol, en- System) service for all organisms in the ables the user to write various kinds of auto- GENOME and GENES databases in KEGG mated programs for analyzing genome se- (Kyoto Encyclopedia of Genes and Genomes). It quences and annotations. For example, by com- started as part of the standardization efforts bining KEGG DAS with KEGG API, a program along with KEGG API and KGML, which are a to retrieve upstream sequences of a given set of SOAP based KEGG web service and XML repre- genes with similar expression patterns on the sentation of KEGG pathways, respectively. In same pathway can be written very easily. Fur- addition to the genome sequences, the KEGG thermore, GBrowse provides a graphical web in- DAS server also provides KEGG annotations for terface for the KEGG DAS server to browse, each gene linked to the PATHWAY and search, zoom and visualize a particular region LIGAND databases, as well as the SSDB data- of the genome. Thanks to the powerful feature base containing paralog, ortholog and motif in- of the GBrowse, users are also able to add their formation. We have been developing the server own annotations onto the KEGG DAS view by basedonopensourcesoftwareincluding uploading their data. This enables researchers to BioRuby, BioPerl, BioDAS and GMOD/GBrowse annotate the genome by themselves, and by 151 sharing the annotations with the community ing the KEGG system, such as searching and they can continuously refine the annotation. computing biochemical pathways in cellular This is so-called “community annotation.” processes or analyzing the universe of genes in KEGG DAS is available at http://das.hgc.jp/. the completely sequenced genomes. The users access the KEGG API server using the SOAP 2. Automatic assignments of orthologs and technology over the HTTP. The SOAP server paralogs in complete genomes also comes with WSDL, which makes it easy to build a client library for a specific computer lan- Toshiaki Katayama and Minoru Kanehisa guage. This enables the users to write their own programs for many different purposes and to In addition to the human intensive efforts in automate the procedure of accessing the KEGG the community databases, we are developing a API server and retrieving the results. Over the computational method for automating genome first year after the initial release of the KEGG annotations. It is based on a graph analysis of API, we have received numerous feedbacks be- the KEGG SSDB database, containing sequence yond our expectations. Making use of these sug- similarity relations among all the genes in the gestions, we are continuously developing the completely sequenced genomes. The nodes of KEGG API to refine the interface to the KEGG the SSDB graph are genes (currently about system. Recent key changes include the follow- 680,000 genes in over 200 genomes) and the ing: (1) Methods for retrieving the pathway in- edges are the Smith-Waterman sequence similar- formation have been re-organized for consis- ity scores computed by the SSEARCH program tency. (2) Wrapper methods for the DBGET/ (currently over 300 million edges above the LinkDB system have been fully incorporated to threshold score of 100). The edges are not only enable complicated database search. DBGET/ weighted but also directed, indicating the best LinkDB serves as a database retrieval system be- (top-scoring) hit when a gene in an organism is hind the KEGG system. (3) Several new meth- compared against all genes in another organism. ods (e.g. methods for retrieving a set of ho- Thus, a highly connected cluster of nodes con- molog genes) have been added. (4) To control taining a number of bidirectional best hits might the number of results, ‘start’ and ‘max_results’ be considered an ortholog cluster consisting of arguments were introduced to the methods that functionally identical genes. Such a cluster can return vast amounts of results at once. The be found by our heuristic method for finding KEGG API is available at http://www.genome. “quasi-cliques”, but the SSDB graph is too large jp/kegg/soap/. to perform quasi-clique finding at a time. There- fore, we introduce a hierarchy (evolutionary re- 4. High performance database entry retrieval lationship) of organisms and treat the SSDB system graph as a nested graph. The automatic decom- position of the SSDB graph into a set of quasi- Kazutomo Ushijima, Chiharu Kawagoe, Toshi- cliques results in the KEGG OC (Ortholog Clus- aki Katayama, Shuichi Kawashima, Hideo ter) database, which is regularly updated and Bannai, Kenta Nakai, Minoru Kanehisa made avaiable at http://www.genome.jp/kegg- bin/OC/count/lookup. Recently, the number of entries in biological databases is exponentially increasing year by 3. SOAP/WSDL interface for the KEGG sys- year. For example, there were 10,106,023 entries tem in the GenBank database in the year 2000, which has now grown to 43,918,359 (Release 144). Al- Shuichi Kawashima, Toshiaki Katayama and though we have provided a simple entry re- Minoru Kanehisa trieval system covering several major databases for several years at the Human Genome Center, We have started developing a new web serv- searches have come to require quite a long time. ice, named KEGG API, to facilitate usability of Therefore, in order to continue to utilize the the KEGG system. KEGG is a suite of databases rapidly growing databases, it has become a criti- and associated software, integrating our current cal issue to develop a new data retrieval system knowledge of molecular interaction networks in that can scan the whole database at a high biological processes (PATHWAY database), the speed. Furthermore, as the number of matched information about the universe of genes and entries by a simple keyword search increases proteins (GENES/SSDB databases), and the in- along with the growth of the database size, a formation about the universe of chemical com- new mechanism is needed to narrow down the pounds and reactions (LIGAND database). The results by applying a field specific search. To KEGG API provides valuable means for access- meet these needs, we have developed a new 152 high performance database entry retrieval sys- ous types of biological functions of given com- tem, named HiGet. The HiGet system is con- pounds. However, exisiting structure compari- structed on the HiRDB, a commercial ORDBMS son methods for organic compounds, not to say (Object-oriented Relational Database Manage- methods for similarity search on chemical struc- ment System) developed by Hitachi, Ltd., and is ture databases, are not well-suited for under- publicly accessible at http://higet.hgc.jp/. standing biological functions. Hence we are de- HiGet can execute full text search on various veloping new algorithms and data structures for biological databases. In addition to the original indexing and/or searching structures of various plain format, the system contains data in the kinds, including compounds, polyketides, nonri- XML format in order to provide a field specific bosomal peptides, and proteins. For polyketides search facility. When a complicated search con- and nonribosomal peptides, we already devel- dition is issued to the system, the search proc- oped algorithms for searching enzymes that are essing is executed efficiently by combining sev- used to synthesize them. For proteins, we are eral types of indices to reduce the number of re- developing a new data structure based on geo- cords to be processed within the system. Cur- metric hashing and suffix trees to achieve fast rent searchable databases are GenBank, UniProt, search for structurally similar proteins. Prosite and OMIM. We are planning to include other valuable databases (such as PDB and Ref- 6. Chemogenomic dissection of the biosyn- Seq) and also planning to develop an interdata- thetic process of medicinal natural prod- base search interface and a complex search facil- ucts ity combining keyword search and sequence similarity search. Michihiro Araki, Tetsuo Shibuya, and Minoru Kanehisa 5. Development of algorithms for searching similar patterns in molecular structures Medicinal natural products are enzymically synthesized as secondary metabolites for specific Tetsuo Shibuya, Michihiro Araki and Minoru biological purposes, which have been the major Kanehisa sources of bioactive compounds with diverse pharmacological activities. In order to make full Finding similar patterns in biological se- use of the potential of natural products as re- quences of proteins and DNAs is a critical step search tools as well as drug leads on the context in identifying biological functions of these mole- of metabolic engineering, it is of great impor- cules. Algorithms to search sequence similarities tance to understand the biosynthetic strategies were developed in conjunction with the avail- with the integrative computational analyses of ability of DNA and protein databases.