KEGG and DBGET/Linkdb: Integration of Biological Relationships in Divergent Molecular Biology Data Watarut Fujibuchi Kazushige Sato [email protected],Ac

KEGG and DBGET/Linkdb: Integration of Biological Relationships in Divergent Molecular Biology Data Watarut Fujibuchi Kazushige Sato Wataru@Kuicr.Kyoto-U,Ac

KEGG and DBGET/LinkDB: Integration of Biological Relationships in Divergent Molecular Biology Data Watarut Fujibuchi Kazushige Sato [email protected],ac. jp kazs~scl.kyoto-u, ac. jp Hiroyuki Ogata Susumu Goto Minoru Kanehisa ogata~kuicr.kyoto-u.ac, jp goto©scl.kyoto-u, ac. jp kanehisa~kuicr.kyoto-u.ac, jp Institute for Chemical Research, Kyoto University Uji, Kyoto 611-0011, Japan

Abstract The links provided by the original databases as cross- references are useful for retrieving related entries in A simple formulationto integrate various biologi- other databases, especially in the Web-based systems. cal data is presentedbased on the conceptof links, In addition, the sequence neighbors computed by se- whichare classified into three types: factual, sim- quence similarity search programs such as BLASTand ilarity, and biological. Factual links are cross- FASTAare extremely useful for finding functionally re- reference information of entries amongmolecu- lar biology databases. Similarity links are neigh- lated sequences. However, we consider the knowledge of bor information of sequence entries computedby biological relationships, such as the information of bi- sequence similarity search programs. Biological ological partners in molecular pathways and molecular links are the novel and powerful relations that assemblies, is under-represented in the existing molec- are being organized in KEGG,including molecular biology databases. Such information may occa- ular interactions on the metabolic and regula- sionally exist in text as comments,but it is not readily tory pathways, physical closeness of genes on the utilizable for computation. genome,and the rest of binary relations in which Under the project named KEGG,Kyoto Encyclope- divergent biological phenomenacan be accom- modated. DBGET/LinkDBwas originally de- dia of Genes and Genomes(Kanehisa 1997a), we are signed to handle factual and similarity links in computerizing biological relationships of molecular in- the existing molecularbiology databases, but it is teractions and genetic interactions in living cells. Our nowbeing extendedto include biological links as aim is to integrate such biological knowledge together well. The process of logical reasoning, for exam- with the cross-references and similarity neighbors based ple, for functional assignment of newlysequenced on the commonconcept of links between two items, or genes can often be decomposedinto a sequence what we call "binary relations". Here we report the cur- of combiningdifferent types of links; thus, we ex- rent status of the DBGET/LinkDBsystem, the KEGG pect it can be automated by the extension of the databases, and the KEGGcomputational tools as well DBGET/LinkDBsystem and the database efforts as our plan for future extensions. of KEGG. Integration of Public Databases Introduction Data Representation The power of making links is evident not only for nav- In DBGETa database is simply considered a collec- igation in the World Wide Web, but also for integration of entries stored in a fiat file or multiple fiat files. tion of knowledge in molecular biology that is based Our definition of fiat files includes text files and other on various types of biological data. A tightly-coupled multimedia files such as GIF files for the pathway dia- database, such as a relational database containing all grams. Because an entry can be uniquely identified by types of data, is reliable in its frameworkand retrieval an entry name or an accession number in each database, ability. However,because of its rigidity, it cannot fol- any entry of a fiat file database can be retrieved uni- low the rapid changes of database formats and contents. formly in DBGETby the combination of the "database On the contrary, a loosely-coupled approach that com- name" and the entry "identifier". The cross-reference bines any database at the level of its entry, has been data among a number of molecular biology databases successful in a number of practical systems, such as can be represented as: our DBGET/LinkDB(Fujibuchiet al. 1997) as well as Entrez(Schuler et al. 1996) and SRS(Etzold & Argos databasel:identiflerl -~ database2:identifler2 1993). This binary relation is the basis of our LinkDBsystem and its extension: tPermanent Address: Nihon Silicon Graphics Cray K.K., Umeda,Kita-ku, Osaka 530-0001, Japan organism:genel --+ organism:gene2

35 is the basis of representing biological relationships in I ...... : : ::::::::::::::: " I I:::::: ::::::::::::::::::::::::::: ::.::: KEGG. I:::::: ...... ::::: I ...... :: :1

Field Indexing I:::;:: ::::::1 I ...... ::::::1 One of the advantages of DBGETsystem is to use the I:::::: ...... I I:::::::::: ::::::::::::::: :::1::::::::::" ’::": .... original database files as they are, requiring less disk :::::::::::::::::::::::::::: ,::::::::::::: :::::::::::: I I" :: .::::::: : : ::::l space for storage and less computation time for dally I::::::::::: ::::::::::::::::: :::::::::::: ...... ::::::::::::::::::::::::::::::: :::::::::::::::: ::::::;::::;: updates. In order to accomplish rapid access and search I:::::::::::: :111:::::;::;:::: ::::::::::::::::: :::::::::::::1 of entries, a small numberof auxiliary files are created by the indexing program seqnew during the update pro- cedure as shown in Table 1. \. d,. .,/ Table 1: The list of auxiliary files created by seqnew. fllename file type contents bfmd~ Client ] hash table by entry and accession keys for the position db.pag dbm offset and the size of each entry. Figure 1: DBGETcan be configured as a network- primary and secondary ac- distributed database system. db.acc fiat cessions of each entry title or definition tield of db.tit flat each entry hash table by entry and ac- Table 2: An example of network configuration defined db.tit.pag dbm cession keys for db.tit in dbtab. db.ref flat references in each entry database dbtype dbsite db.ant flat authors in each entry genbank genbank @dbget.genome.ad.jp hash table by entry keys for embl embl @133.103.97.7:3266 db.lnk+.pag dbm original links hash table by entry keys for mydb genbank /usr/local/db/mydbdir db.lnk-.pag dbm rev~r.~lin k.q original links. Dependingon the characteristics of indi- Some of them are hashed by the GNUdatabase man- vidual databases, the routes of computing indirect links ager library. Extracting specific keys and field data are defined in the lin~ab file. For instance, OMIMdoes from various databases requires lexicographical analy- not have original cross-references to external databases, sis routines dedicated for individual databases. Wehave though it has internal references. In contrast, SWISS- developed a C++class library for parsing a wide range PROThas rich cross-references to other databases, in- of database formats by utilizing the advantages of C++ cluding EMBL, PROSITE, PIR, PDB, Medline, and language, such as the inheritance of base class members. OMIM.Therefore, OMIMcan go out to almost all other databases by defining in the linktab table the route of Network-Distributed Database first using the reverse link to SWISS-PROT. In DBGETall databases do not necessarily exist on one Conceptually, there are three types of links that are server but mayreside on several servers in the network. to be utilized in LinkDB: Thus, DBGETcan be used as a network-distributed ¯ Factual Links: database system as shown in Figure 1. links provided by the original databases, e.g., cross- The network configuration of multiple databases is reference of Medline ID and GenBankaccession. defined in the dbtab table file. The example shown in Table2 describes that genbank is retrieved from the ¯ Similarity Links: server at dbget.genome.ad.jp, while embl is taken from links producedby similarity search, e.g., the result of the server of IP address 133.103.97.7 with port number BLAST, FASTA and MOTIF programs. 3266. The user can incorporate his/her own databases ¯ Biological Links: into DBGETas shown on the last line of the Ta- ble, which specifies that the database named mydb in links by biological relationships, e.g., molecular or the genbank format is taken from the user’s directory genetic interactions in the KEGGpathways. ’/usr/local/db’. LinkDBincorporates all three types of links to eventu- ally perform integrated retrieval for biological reasoning, such as functional assignment of newly sequenced Computation of Links genes. LinkDBis a repository of all cross-reference information among a number of databases. It contains orig- DBGET/LinkDB on GenomeNet inal links extracted from each database, and also re- Under the Japanese GenomeNet database ser- verse links and indirect links that are computed from vice(Kanehisa 1997b) DBGET/LinkDB currently

36 supports 18 databases and contains links to Medline. This generic naming is very useful in the daily up- Table 3 shows a list of those databases maintained dated databases, where the combination of the daily cu- in DBGET/LinkDB, where all the major databases mulative (e.g. genbank-upd) and the fixed release (e.g. are daily or weekly updated. PATHWAY,LIGAND, genbank) is given the generic naming (e.g. genbank- GENES, and BRITE are the products of the KEGG today), which requires us to index only the cumulative project, in which biological knowledge of molecular portion every day in order to make the latest version of interactions and pathways is being organized. the database. It saves both the computation time and the disk space. Another useful feature of DBGETis aliasing for the databases, such as gb for genbank, and Table 3: The DBGETdatabases on GenomeNet. gbt for genbank-today.

Content Database names Integration with Other Tools nucleic acid sequences *GenBank, *EMBL Because of the loose-coupling integration, protein sequences *SWISS-PROT, PIR, PRF, *PDBSTR DBGET/LinkDBsystem can easily be coupled with 3Dstructures *PDB other Web-based search programs. In GenomeNet, for sequence motifs PROSITE, EPD, TRANS- FAC example, the result of similarity searches by BLAST enzymereactions *LIGAND and FASTA are directly linked to DBGET/LinkDB metabolic/regulatory pathways *PATHWAY for retrieval of found entries and related entries, which regulatory relations BRITE help interpretation of biological meanings. Conversely, amino acid mutations PMD the entries retrieved by DBGETcan be transferred aminoacid indices AAindex to helper application programs, such as sequence genetic diseases *OMIM similarity tools or 3D protein structure viewers. literature LITDB genes and genomes *GENES Those markedby asterisks are daily or weekly updated. Integration of Biological Knowledge KEGGDatabases KEGG(Gotoet al. 1996) is our attempt to describe, Generic Naming utilize, predict, and possibly design biological systems based on the genomic information. To accomplish this Figure 2 is a WWWinterface to DBGET/LinkDBand task, first, the current knowledgeof biological functions KEGGthat are tightly integrated with each other. As in molecular and cellular biology is being computerized indicated in the Figure DBGETallows generic nam- in terms of the information pathways that consist of ing for a composite database, for example, "DNA" interacting genes and molecules. This may be consid- database for GenBank and EMBL, and "PROTEIN" ered the process of making a functional catalog of liv- database for PRF, PIR, SWISS-PROT, and PDBSTR. ing organisms. Second, gene catalogs being obtained by genomesequencing projects of different organisms are linked to individual componentsof the functional cata- DBGET DatabaseLinks log. This may be considered the processing of mapping a gene catalog to the functional catalog. KEGGthus provides the linkage between the catalog KF.GG of molecular components and the network of molecular interactions in living cells and organisms. The latter IAnkDB is organized as the PATHWAYdatabase, which is the primary product of the KEGGproject, and the former is organized in the GENESdatabase taken from the existing databases (genome databases, GenBank, and SWISS-PROT) as well as in the LIGANDdatabase of chemical compounds(Goto, Nishioka, & Kanehisa 1998). The existing sequence databases are organized in such a way that the sequence is the core information with everything else as an attribute or an annotation to the sequence. The KEGGPATHWAY database is at a higher level of abstraction where the network of interacting genes or molecules is the core information. A gene or a molecule is the basic element that forms the network, just like a nucleotide or an amino acid is the Figure 2: DBGET/LinkDB/KEGGdatabase links dia- basic element that forms the sequence. At the moment, gram. most of the contents in the GENESdatabase are auto- matically generated from a number of sources, but we

37 plan to incorporate information on interacting partners to other categories of maps by pop-up menus. Graph- that are manually identified in the KEGGproject. ical objects on the map, such as a gene product or a compound, are linked to related information. These KEGG Tools main features can also be implemented in CGI scripts and clickable map function of WWW,but Java allows The functional catalog is represented in KEGGby more useful interactive features to be added. graphical pathway maps describing molecular interac- tion networks and by hierarchically organized texts describing classifications of molecules. The gene catalog is represented by graphical genome maps and by hierarchical texts of gene classifications. KEGGprovides tools to manipulate these different types of information. The tools allow users to browse, navigate, search, compute, and even modify the information. KEGGis available in two versions, WWWand local (CD-ROM) versions. For the WWWversion KEGG,we first wrote CGI scripts for browsing pathway map graphics and hierarchical texts. However, it was not possible to implement genome map graphics by the server program alone, for they need be utilized not only for just browsing but also for local manipulations. In the past, we developed genome map handling Figure 3: Pathway map browser. tools, such as Genomatica(Akiyama et al. 1994) and HyperGenome(Goto et al. 1994), under the client- For the genome map browser two versions were de- server mechanism. From the user’s point of view this veloped, one for completely sequenced genomes and the mechanism requires downloading and installation of other for human chromosomes. The former consists of specialized client software, which often inhibits wide two viewer windowsfor circular genomemaps and linear usage in the international genome research community. genome maps. For example, the E. coli chromosome is Fromthe developer’s point of view, the necessity of de- displayed in a circular genome map window, while the veloping different client software for different platforms 16 S. cerevisiae chromosomesare displayed in a linear has also been a nightmare. genome map window. Organisms that have both cir- In order to solve these problems and to include more cular and linear plasmids, such as B. burgdorferi, are capabilities we have decided to use Java and devel- displayed in both circular and linear genomemap win- oped a suite of Java applets for KEGG.Java has a dows. potential of transforming the World Wide Web from a static information resource to be retrieved to a more dynamic resource to be computed. This is especially suited for KEGGwhich may be considered as a deduc- tive database in the sense that additional information is logically deduced from the stored information. Thus, different types of computations are required in KEGG, and Java has enabled us to implement both computation and graphics handling capabilities to be performed locally on the user’s machine. However, server side CGI programs still have advantages for the WWWversion of KEGG. Because CGI Figure 4: Genomemap browser (B. burgdorferi). programs are executed on the server, less system resources, such as memoryand CPUpower, are required on the user side client machine. It means CGI version The genome map browser for human chromosomes is tools, although their functionality maybe limited, can similar to the linear chromosomegenome map viewer, be used more conveniently by most users. In order to but it has different functions to handle low resolution accommodate varying requirements, KEGGtools are maps(Figure 5). written in both CGI scripts and Java programs. Each genome map viewer contains a zoom-up window and an overall view. The gene identifiers in the Java Map Browsers zoom-up windoware clickable to retrieve the defini- KEGGcontains two types of graphics data, pathway tion of the genes stored in the hierarchical gene cat- maps and genome maps, for which map browsers have alog,, and then additional information in the existing been developed in Java. In the pathway map browser databases through the DBGET/LinkDBretrieval sys- the user can easily switch to other organisms’ maps or tem. Conversely, each GENESdatabase entry has a

38 All these queries are done on the server. The last step of examining if the H. influenzae genes are localized in the genometo possibly form an operon is done locally by searching and marking the genes on the genomemap. In addition to the PATHWAYdatabase, the GENES database is a key database in KEGG.Any gene name in the KEGGpathway or gene catalog has a link to this database and this database has links to the existing molecular biology databases. Wealso plan to add information of molecular interactions and genetic interactions in the GENESdatabase. While the KEGGdatabases provide numerous pre- Figure 5: Genomemap browser (H. sapiens). defined links that can be used to browse a number of different aspects of genes and genomes, KEGGcan also be used as a search tool for finding additional links. For example, genes can be searched on the known pathways link to the genome map view. Therefore, the user can in KEGGby EC numbers, by gene names, by compound easily navigate between genome maps and functional accessions, or sequence similarity. Figure 7 shows an catalogs(Figure 6). example of the search by gene names.

Figure 7: Links made by pathway search.

/ , Biological Reasoning by Computing Links In LinkDBthe link search is configured in the route table, for example, the combination of original/reverse links to derive indirect links is defined here. links are to be combined in This mechanism can be extended to the biological reasoning as shownin Table 4. In this example, the processing starts with a humansequence by

Figure 6: Navigation through genome maps and func- Table 4: An example of defining integrated link search. tional catalogs. database route h.sapiens +s(m.musculus): +b(pathway) :+s(h.sapiens) Links among KEGG An example of local handling of the genomemap is the similarity search against mousesequences, finds inter- following. Suppose that one wishes to see if trypto- acting molecules in the KEGGpathways by biological phan operon is conserved in E. coli and H. influenzae. links, and again by similarity search, finds humanho- Starting from the E. coil gene catalog of knownoper- mologues of these interacting molecules. ons, the genes are mappedon the functional catalog of An illustration of this reasoning process is shown the metabolic pathway, and then the corresponding H. in Figure 8, where biological links in mouse can be influenzae genes are identified. utilized to identify humancounterpart molecules. Here

39 BiologicalLink: - genomenet resources in japan. Trends Bioehem. Sci Adenosinedeaminase <-> Adenosinekinase 22:442-444. Schuler, G.; Epstein, J.; Ohkawa, H.; and Knas, J. 1996. Entrez: molecular biology database and retrieval system. Methods Enzymol 266:141-162.

Figure 8: Biological link search in mouse mayprovide information about human counterpart molecules.

"adenosine-deaminase" is knownto have biological relation to "adenosine-kinase" in the metabolic pathway. In this example, starting with the "adenosine-deaminase" gene on human chromosome20, the suggested genes of "adenosine-kinase" are marked on the human chromosome 10 by following the combination of similarity and biological links.

References Akiyama, Y.; Yakoh, T.; Mori, H.; and Ogasawara, N. 1994. A server-client version of genomatica integrated genome information browser. In Proc. Genome Informatics Workshop 1995, 202-203. Etzold, T., and Argos, P. 1993. Srs-an indexing and retrieval tool for flat file data libraries. ComputAppl Biosci 9:49-57. Fujibuchi, W.; Migimatsu, H.; Uchiyama, I.; Ogi- wara, A.; Akiyama, Y.; and Kanehisa, M. 1997. DBGET/LinkDB:an integrated database retrieval system. In Pacific Symposium on Biocomputing ’98, 683-694. Goto, S.; Kuhara, S.; Takagi, T.; and Kanehisa, M. 1994. Extension of the integrated database hypergenome for genome maps and sequence information. In Proc. Genome Informatics Workshop 1994, 204- 2O5. Goto, S.; Bono, H.; Ogata, H.; Fujibuchi, W.; Nish- ioka, T.; and Kanehisa, M. 1996. Organizing and computing metabolic pathway data in terms of binary relations. In Pacific Symposiumon Biocomputing ’97, 175-186. Goto, S.; Nishioka, T.; and Kanehisa, M. 1998. Lig- and: Chemical database for enzyme reactions. Bioin- formatics accepted. Kanehisa, M. 1997a. A database for post-genome analysis. Trends Genet. 13:375-376. Kanehisa, M. 1997b. Linking databases and organisms