Trends in Bioinformatics Technology

View metadata, citation and similar papers at core.ac.uk brought to you by CORE SCIENCE & TECHNOLOGY TRENDS 4 Trends in Bioinformatics Technology ATSUSHI NOGI AND SHOTARO KOHTSUKI Affiliated Fellows fundamental technologies such as rice genome 4.1 Introduction and full-length cDNA libraries and, therefore, has great potential to take the leading role both in Through international cooperation, the Human academic and applied areas. Genome Project has proceeded at a greater-than- In recent years, the speed-up and automation of expected speed. Accomplishing the draft analyzers have resulted in the generation of vast sequencing by June 2000, the project has now amounts of experimental data, which has imparted entered its final stage toward complete sequence a growing importance to information technology. determination. Japan’s contribution to the project In addition to the improvement of data processing was 6–7%, which corresponded to the budget efficiency, knowledge of mathematics and scale. However, problems have remained, such as informatics is required to cope with the com- the lack of strategy, which has forced Japan into a plicated, comprehensive analyses performed on late start in this area. Nevertheless, research genome information. In order to work on such projects such as complete sequencing of mouse issues systematically, a new academic area, full-length cDNA (See Footnote) and rice genome bioinformatics, has been created by the interface sequencing led by The Institute of Physical and of bioscience and informatics. Chemical Research (RIKEN), accomplished in Bioinformatics has been covered in the article of August 2002, and ascidian genome sequencing this journal’s November 2002 issue, which conducted as a joint research project between introduced an overview of the trends in Japan and the U.S., completed in December 2002, bioinformatics from a bioscience point of view. have driven genome science into a new stage. This report article is a sequel to this, discussing Now that human genome and mouse cDNA have the methodology for the systematic understanding been sequenced, the focus should be shifted from of genome and life phenomena and the up-to-date sequence analyses to comprehensive, systematic issue of human resource development. functional analyses. Japan has advantages in 4.2 Informatics-based Footnote: approaches in bioinformatics full-length cDNA: cDNAs are DNAs obtained by excluding unnecessary sequences from If we view the issues in genome research from genomic DNAs. cDNAs are generated by using an informatics point of view and attempt their mRNAs (messenger RNAs, the genetic infor- formulation, we soon realize that most of such mation-carrying substances that only contain protein-encoding sequences) as templates. issues are far beyond the capabilities of existing Unlike partial cDNA fragments, full-length calculators.This can be attributed to the extremely cDNAs possess all the information necessary long lengths of genomic sequences and to the for protein synthesis and is therefore capable combinatorial nature of the issues.Therefore, such of synthesizing proteins. Efficient synthesis of issues must be analyzed by approximate or full-length cDNA requires very high skill, in heuristic means. The development of practical which Japan takes the initiative among other countries. algorithms is the main research area in bioinformatics, and its achievements have greatly 44 QUARTERLY REVIEW No.8 / July 2003 Figure 1: Research trends in bioinformatics contributed to the success of genomic se- research subject from DNA to genes, genes to quencing. Whole genome shotgun sequencing, an proteins and proteins to individuals, increasing the approach employed by Celera Genomics (U.S.), complexity and diversity of the subjects. The involves the assemblage of tenmillions of random subjects of analyses have shifted from sequences sequence fragments as in a jigsaw puzzle. At first, to functions or actions. Research and development the approach itself was considered to have great in bioinformatics must be in concert with such uncertainty, but using top-performance calculators trends in genome research. and a unique algorithm, Celera Genomics accomplished sequence determination at an 4.2.1 Databases astoundingly high speed, which has strongly The majority of the vast amounts of data demonstrated the power and effectiveness of generated from genome research are registered in informatics. public databases, which are available to the public Using approximate means does not necessarily through the Internet. Such data can be viewed as a come up with the optimal solution.Therefore, the research foundation, which serves as a starting analysis results require biological evaluation, based point for every informatics research. The number on which the algorithms or parameters must be of DNA sequence data registered in public modified. Bioinformatics can be characterized as a genomic databases (GenBank (U.S.), EMBL tool for narrowing down the vast search domain, (Europe), and DDBJ (Japan)) are growing ex- which otherwise requires experimental confir- ponentially (Figure 2). Conventionally, databases mation. It also enables systematic, comprehensive were used to accumulate primary data such as analyses to elucidate the general view of life DNA sequences and protein structural data, but phenomena, which could not have been seen by along with the progress of genome projects, they analyzing the individual genes. have diversified, incorporating new types of data To date, the main research subjects in bio- such as data on single nucleotide polymorphisms informatics were the construction of genomic (SNP) or mutations and data on gene or protein databases, and the development of data analysis interactions. In order to perform analyses em- tools and their application techniques. From now ploying such diverse types of data, further ad- on, a higher data-processing capability will be vances in databases and information technology employed in genome research for discovering are required. genes and predicting their functions. As shown in Figure 1, genome research, which started from (1) Annotation of databases genome sequencing, has gradually shifted its A series of processes for discovering new 45 SCIENCE & TECHNOLOGY TRENDS Figure 2: Transition in the number of bases registered Source: Website of DNA Data Bank of Japan knowledge from various data is called data mining, names in different research areas, which require an important technique in bioinformatics. integration. As genome sequencing has been However, after the experimental data are successively accomplished in human and other registered in databases, they are left unprocessed, model organisms, more and more studies are obstructing the effective application of data beginning to focus on the entire living world. mining techniques. To solve this problem, Conventionally, databases have been constructed information on their interpretation must be independently for each model organism (E. coli, imparted to such data. This process is called yeast, mouse, etc.), but the lack of criteria for the annotation. Although computerization of integration of terms or molecular names have annotation is currently under way, computers hindered their efficient use. alone cannot figure out the optimum annotation, In order to solve these problems, organization of requiring human assistance for its validation. A non-standardized vocabularies has been initiated conference on the annotation of mouse full-length to enable their systematic description. Such cDNA sequence determined at RIKEN was held in process is called ontology, a process for imparting August 2000 at RIKEN in Tsukuba.The annotation consistent terms and definitions to ideas. The process involved 2 weeks of heated debates by establishment of ontology has been promoted for participating researchers from various countries. each research area, such as gene ontology Beyond interpretations of individuals, annotation covering genes, interaction ontology covering requires the cooperation of researchers from molecular and cellular interactions, and signal various organizations for its validation. ontology covering signal transductions. Progress in such integration and systematic classification of (2) Integration of ideas and vocabularies vocabularies should increase mutual application of Each gene or protein is given a customary name databases, thereby enabling mutual reference during its research process. As a result, the same among databases of different species. protein may have been given several different 46 QUARTERLY REVIEW No.8 / July 2003 4.2.2 Homology analysis grams, individual laboratories no longer need to Homology analysis is a powerful approach for purchase expensive calculation facilities. Our studying the functions of genes or proteins based country must work on the further enrichment of on interspecific amino acid sequence similarities. such domestic efforts. Proteins show higher structural and functional similarities between closely related species. Con- 4.2.3 Protein structure analysis sequently, if a functionally identified gene with a Proteins are functional molecules fundamental similar sequence to the target gene can be found to life activity, which are the final products in another species, it would serve as a key to pre- synthesized in vivo based on the genomic in- dict the functional characteristics of the target formation. One of the largest objectives in bio- gene. informatics is to understand the relationship In a homology analysis

Trends in Bioinformatics Technology

A Semantic Standard for Describing the Location of Nucleotide and Protein Feature Annotation Jerven T

HTG Data Input: Annotation File Output in DDBJ Flat File

Nucleic Acid Databases on the Web Richard Peters and Robert S

Biological Databases

Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

2021.03.02.433662V1.Full.Pdf

DDBJ Progress Report

Aquatic Symbiosis Genomics at the Wellcome Sanger Institute Announcement of Opportunity & Call for Collaboration Proposals

Flow of Genetic Information DNA --> RNA --> PROTEIN

Nucleotide and Protein Databases

CBD/DSI/AHTEG/2020/1/2 3 March 2020

Best Practice Data Life Cycle Approaches for the Life Sciences[Version 2; Peer Review: 2 Approved]