Human Genome Center Laboratory of Genome Database Laboratory of Sequence Analysis ゲノムデータベース分野シークエンスデータ情報処理分野

136 Human Genome Center Laboratory of Genome Database Laboratory of Sequence Analysis ゲノムデータベース分野シークエンスデータ情報処理分野 Professor Minoru Kanehisa, Ph.D. 教授（委嘱）理学博士金久實 Research Associate Toshiaki Katayama, M.Sc. 助手理学修士片山俊明 Research Associate Shuichi Kawashima, M.Sc. 助手理学修士川島秀一 Lecturer Tetsuo Shibuya, Ph.D. 講師理学博士渋谷哲朗 Research Associate Michihiro Araki, Ph.D. 助手薬学博士荒木通啓 Owing to continuous developments of high-throughput experimental technologies, ever-increasing amounts of data are being generated in functional genomics and proteomics. We are developing a new generation of databases and computational technologies, beyond the traditional genome databases and sequence analysis tools, for making full use of such large-scale data in biomedical applications, espe- cially for elucidating cellular functions as behaviors of complex interaction systems. 1. Comprehensive repository for community We have been developing the server based on genome annotation open source software including BioRuby, BioPerl, BioDAS and GMOD/GBrowse to make Toshiaki Katayama, Mari Watanabe and Mi- the system consistent with the existing open noru Kanehisa standards. The contents of the KEGG DAS database can be accessed graphically in a web KEGG DAS is an advanced genome database browser using GBrowse GUI (graphical user in- system providing DAS (Distributed Annotation terface) and also programatically by the DAS System) service for all organisms in the protocol. The DAS, which is an XML over HTTP GENOME and GENES databases in KEGG data retrieving protocol, enables the user to (Kyoto Encyclopedia of Genes and Genomes). write various kinds of automated programs for Currently, KEGG DAS contains 6,943,951 anno- analyzing genome sequences and annotations. tations for genome sequences of 441 organisms. For example, by combining KEGG DAS with The KEGG DAS server provides gene annota- KEGG API, a program to retrieve upstream se- tions linked to the KEGG PATHWAY and quences of a given set of genes which have LIGAND databases, as well as the SSDB data- similar expression patterns on the same path- base containing paralog, ortholog and motif in- way,canbewrittenveryeasily.GBrowse,the formation. In addition to the coding genes, in- graphical interface, enables user to browse, formation of non-coding RNAs predicted using search, zoom and visualize a particular region Rfam database is also provided to fill the anno- of the genome. Moreover, users are also able to tation of the intergenic regions of the genome. add their own annotations onto the GBrowse 137 view by providing another DAS server or by edge of molecular interaction/reaction pathways simply uploading their own data as a file. This and other systemic functions (PATHWAY and functionality enables researchers to add various BRITE databases), the information about the annotations on the genome and by sharing their genomic space (GENES database), and informa- annotations with the community they can con- tion about the chemical space (LIGAND data- tinuously refine the genome annotation, so- base). KEGG API provides valuable means to re- called “community annotation.” This year we trieve various kinds of information stored in the have updated the GBrowse genome browser to KEGG and has become an increasingly popular the latest version, which has improved user in- mode of access. Recently, we have introduced terface. We are also responsible to Japanese lo- several new methods to search compounds, calization of the browser. The KEGG DAS is drugs and glycans by their structure, mass, com- weekly updated and freely available at http:// position and annotations. Additionally, the das.hgc.jp/. methods to colorize the PATHWAY diagram is enhanced to control the different elements shar- 2. Automatic assignments of orthologs and ing the same name can be distinguished by their paralogs in complete genomes ID. The KEGG API is available at http://www. genome.jp/kegg/soap/. Toshiaki Katayama and Minoru Kanehisa 4. EGassembler: web server for large-scale Accession of the number of sequenced geno- clustering and assembling ESTs and mes made it difficult to characterize orthologous genomic DNA fragments relationship among organisms. We are developing a computational method for finding appro- Ali Masoudi-Nejad, Shuichi Kawashima, Koi- priate orthologous gene clusters automatically. It chiro Tonomura, Masanori Suzuki, Minoru is based on a graph analysis of the KEGG SSDB Kanehisa database, containing sequence similarity rela- tions among all the genes in the completely EST sequencing has proven to be an economi- sequenced genomes. The nodes of the SSDB cally feasible alternative for gene discovery in graph are genes and the edges are the Smith- species lacking a draft genome sequence. Ongo- Waterman sequence similarity scores computed ing large-scale EST sequencing projects feel the by the SSEARCH program. The edges are not need for bioinformatics tools to facilitate uni- only weighted but also directed, indicating the form ESTs handling. This brings about a re- best (top-scoring) hit when a gene in an organ- newed importance to a universal tool for proc- ism is compared against all genes in another or- essing and functional annotation of large sets of ganism. Thus, a highly connected cluster of ESTs in order to cover the complete transcrip- nodes containing a number of bidirectional best tome of an organism. EGassembler (http://egas- hits might be considered an ortholog cluster sembler.hgc.jp/) is a web server, which provides consisting of functionally identical genes. Such a an automated as well as a user-customized cluster can be found by our heuristic method for analysis tool for cleaning, repeat masking, vec- finding quasi-cliques, but the SSDB graph is too tor trimming, organelle masking, clustering and large to perform quasi-clique finding at a time. assembling of ESTs and genomic fragments. It is Therefore, we introduce a hierarchy (evolution- also designed to serve as a standalone web ap- ary relationship) of organisms and treat the plication for each one of those processes. The SSDB graph as a nested graph. However, the web server is freely available and provides the method still requires large computation time community with a unique all-in-one online ap- along with the number of organism increases, plication web service for large scale ESTs and we are trying to refine the process to make it genomic DNA clustering and assembling, espe- faster and accurate. cially for EST processing and annotation projects. 3. SOAP/WSDL interface for the KEGG system 5. New version of MAGEST database Toshiaki Katayama, Shuichi Kawashima and Shuichi Kawashima and Minoru Kanehisa Minoru Kanehisa We have developed a new version of MAG- We have continued to develop KEGG API, a EST database. Previously MAGEST was de- web service to facilitate usability of the KEGG signed as a database for maternal gene expres- system. KEGG is a suite of databases and associ- sion information for an ascidian, Halocynthia ated software, integrating our current knowl- roretzi. In the new version of MAGEST, it is ex- 138 tended to four embryonic developmental stages developed a high performance database entry including the following additional stages, e.g. retrieval system, named HiGet. The HiGet sys- early cleavage, early gastrula and early neurula. tem is constructed on the HiRDB, a commercial Furthermore, we constructed gene clusters and ORDBMS (Object-oriented Relational Database assembled sequences of ESTs by EGassembler. Management System) developed by Hitachi, These clusters enabled us to cross-refer the gene Ltd. It is publicly accessible on the Web page at expression of the four different stages. The new http://higet.hgc.jp/ or SOAP based web service web site is accessible at http://magest.hgc.jp/. at http://higet.hgc.jp/soap/. HiGet can execute Now we are comparing the MAGEST sequences full text search on various biological databases. with other species of ascidian, Ciona sp. Because In addition to the original plain format, the sys- the phylogenetic position of H. roretzi is a good tem contains data in the XML format in order to outgroup for Enterogonia to which Ciona sp. be- provide a field specific search facility. When a longs, we expect that the result from this com- complicated search condition is issued to the parative analysis lead us to understand the di- system, the search processing is executed effi- versification of gene families among Urocho- ciently by combining several types of indices to data. reduce the number of records to be processed within the system. Current searchable databases 6. SSS: a sequence similarity search service areGenBank,UniProt,Prosite,OMIM,PDBand RefSeq. We are planning to include other valu- Toshiaki Katayama, Kazuhiro Ohi, Minoru able databases and also planning to develop an Kanehisa inter-database search interface and a complex search facility combining keyword search and There are various services in the world to find sequence similarity search. similar sequences from the database, such as the famous BLAST service provided at NCBI. How- 8. Development of algorithms for biosyn- ever, the method to search and the database to thetic process analysis be searched could not be added from outside. To provide our super computer resources at the Kohichi Suematsu, Tetsuo Shibuya, Michihiro Human Genome Center to the research commu- Araki, and Minoru Kanehisa nity, we started to develop a new service for the sequence similarity search, SSS. In SSS, user can We developed algorithms for identifying bio- select the search algorithm from BLAST, FASTA, synthetic process of medicinal products by util- SSEARCH, TRANS and EXONERATE. This va- izing the database of sub-molecular building riety of options is unique among the public blocks in biosynthetic processes. The problem is services. Then user is prompted to select appro- to find the most reasonable decomposition of a priate database depending on the algorithm se- graph into subgraphs which are annotated in lected and the search is executed.

Human Genome Center Laboratory of Genome Database Laboratory of Sequence Analysis ゲノムデータベース分野シークエンスデータ情報処理分野

Role of Phytochemicals in Colon Cancer Prevention: a Nutrigenomics Approach

Supplementary Data

Peptide Vaccines for Cancers Expressing MPHOSPH1 Or DEPDC1 Polypeptides

TRANSCRIPTOME ANALYSIS in MAMMALIAN CELL CULTURE: APPLICATIONS in PROCESS DEVELOPMENT and CHARACTERIZATION Anne Kantardjieff We

Association of Cnvs with Methylation Variation

Human Genome Center Laboratory of Genome Database Laboratory of Sequence Analysis ゲノムデータベース分野シークエンスデータ情報処理分野

ARTICLE a Genomewide Screen for Late-Onset Alzheimer Disease in a Genetically Isolated Dutch Population

Transdifferentiation of Human Mesenchymal Stem Cells

Profiling Analysis Reveals the Crucial Role of the Endogenous Peptides

Genome-Wide Association Studies in Alzheimer Disease

Modeling Chromosomes in Mouse to Explore the Function of Genes

(12) United States Patent (10) Patent No.: US 7,834,171 B2 Leake Et Al

Human Genome Center Laboratory of Genome Database Laboratory of Sequence Analysis ゲノムデータベース分野 シークエンスデータ情報処理分野

Human Genome Center Laboratory of Genome Database Laboratory of Sequence Analysis ゲノムデータベース分野シークエンスデータ情報処理分野