View metadata, citation and similar papers at core.ac.uk brought to you by CORE

SCIENCE & TECHNOLOGY TRENDS 4

Trends in Technology

ATSUSHI NOGI AND SHOTARO KOHTSUKI Affiliated Fellows

fundamental technologies such as rice 4.1 Introduction and full-length cDNA libraries and, therefore, has great potential to take the leading role both in Through international cooperation, the Human academic and applied areas. has proceeded at a greater-than- In recent years, the speed-up and automation of expected speed. Accomplishing the draft analyzers have resulted in the generation of vast by June 2000, the project has now amounts of experimental data, which has imparted entered its final stage toward complete sequence a growing importance to information technology. determination. Japan’s contribution to the project In addition to the improvement of data processing was 6–7%, which corresponded to the budget efficiency, knowledge of mathematics and scale. However, problems have remained, such as informatics is required to cope with the com- the lack of strategy, which has forced Japan into a plicated, comprehensive analyses performed on late start in this area. Nevertheless, research genome information. In order to work on such projects such as complete sequencing of mouse issues systematically, a new academic area, full-length cDNA (See Footnote) and rice genome bioinformatics, has been created by the interface sequencing led by The Institute of Physical and of bioscience and informatics. Chemical Research (RIKEN), accomplished in Bioinformatics has been covered in the article of August 2002, and ascidian genome sequencing this journal’s November 2002 issue, which conducted as a joint research project between introduced an overview of the trends in Japan and the U.S., completed in December 2002, bioinformatics from a bioscience point of view. have driven genome science into a new stage. This report article is a sequel to this, discussing Now that human genome and mouse cDNA have the methodology for the systematic understanding been sequenced, the focus should be shifted from of genome and life phenomena and the up-to-date sequence analyses to comprehensive, systematic issue of human resource development. functional analyses. Japan has advantages in 4.2 Informatics-based Footnote: approaches in bioinformatics full-length cDNA: cDNAs are DNAs obtained by excluding unnecessary sequences from If we view the issues in genome research from genomic DNAs. cDNAs are generated by using an informatics point of view and attempt their mRNAs (messenger RNAs, the genetic infor- formulation, we soon realize that most of such mation-carrying substances that only contain protein-encoding sequences) as templates. issues are far beyond the capabilities of existing Unlike partial cDNA fragments, full-length calculators.This can be attributed to the extremely cDNAs possess all the information necessary long lengths of genomic sequences and to the for protein synthesis and is therefore capable combinatorial nature of the issues.Therefore, such of synthesizing proteins. Efficient synthesis of issues must be analyzed by approximate or full-length cDNA requires very high skill, in heuristic means. The development of practical which Japan takes the initiative among other countries. algorithms is the main research area in bioinformatics, and its achievements have greatly

44 QUARTERLY REVIEW No.8 / July 2003

Figure 1: Research trends in bioinformatics

contributed to the success of genomic se- research subject from DNA to genes, genes to quencing. Whole genome shotgun sequencing, an proteins and proteins to individuals, increasing the approach employed by Celera (U.S.), complexity and diversity of the subjects. The involves the assemblage of tenmillions of random subjects of analyses have shifted from sequences sequence fragments as in a jigsaw puzzle. At first, to functions or actions. Research and development the approach itself was considered to have great in bioinformatics must be in concert with such uncertainty, but using top-performance calculators trends in genome research. and a unique algorithm, Celera Genomics accomplished sequence determination at an 4.2.1 Databases astoundingly high speed, which has strongly The majority of the vast amounts of data demonstrated the power and effectiveness of generated from genome research are registered in informatics. public databases, which are available to the public Using approximate means does not necessarily through the Internet. Such data can be viewed as a come up with the optimal solution.Therefore, the research foundation, which serves as a starting analysis results require biological evaluation, based point for every informatics research. The number on which the algorithms or parameters must be of DNA sequence data registered in public modified. Bioinformatics can be characterized as a genomic databases (GenBank (U.S.), EMBL tool for narrowing down the vast search domain, (Europe), and DDBJ (Japan)) are growing ex- which otherwise requires experimental confir- ponentially (Figure 2). Conventionally, databases mation. It also enables systematic, comprehensive were used to accumulate primary data such as analyses to elucidate the general view of life DNA sequences and protein structural data, but phenomena, which could not have been seen by along with the progress of genome projects, they analyzing the individual genes. have diversified, incorporating new types of data To date, the main research subjects in bio- such as data on single nucleotide polymorphisms informatics were the construction of genomic (SNP) or mutations and data on gene or protein databases, and the development of data analysis interactions. In order to perform analyses em- tools and their application techniques. From now ploying such diverse types of data, further ad- on, a higher data-processing capability will be vances in databases and information technology employed in genome research for discovering are required. genes and predicting their functions. As shown in Figure 1, genome research, which started from (1) Annotation of databases genome sequencing, has gradually shifted its A series of processes for discovering new

45 SCIENCE & TECHNOLOGY TRENDS

Figure 2: Transition in the number of bases registered

Source: Website of DNA Data Bank of Japan knowledge from various data is called data mining, names in different research areas, which require an important technique in bioinformatics. integration. As genome sequencing has been However, after the experimental data are successively accomplished in human and other registered in databases, they are left unprocessed, model organisms, more and more studies are obstructing the effective application of data beginning to focus on the entire living world. mining techniques. To solve this problem, Conventionally, databases have been constructed information on their interpretation must be independently for each model organism (E. coli, imparted to such data. This process is called yeast, mouse, etc.), but the lack of criteria for the annotation. Although computerization of integration of terms or molecular names have annotation is currently under way, computers hindered their efficient use. alone cannot figure out the optimum annotation, In order to solve these problems, organization of requiring human assistance for its validation. A non-standardized vocabularies has been initiated conference on the annotation of mouse full-length to enable their systematic description. Such cDNA sequence determined at RIKEN was held in process is called ontology, a process for imparting August 2000 at RIKEN in Tsukuba.The annotation consistent terms and definitions to ideas. The process involved 2 weeks of heated debates by establishment of ontology has been promoted for participating researchers from various countries. each research area, such as Beyond interpretations of individuals, annotation covering genes, interaction ontology covering requires the cooperation of researchers from molecular and cellular interactions, and signal various organizations for its validation. ontology covering signal transductions. Progress in such integration and systematic classification of (2) Integration of ideas and vocabularies vocabularies should increase mutual application of Each gene or protein is given a customary name databases, thereby enabling mutual reference during its research process. As a result, the same among databases of different species. protein may have been given several different

46 QUARTERLY REVIEW No.8 / July 2003

4.2.2 Homology analysis grams, individual laboratories no longer need to Homology analysis is a powerful approach for purchase expensive calculation facilities. Our studying the functions of genes or proteins based country must work on the further enrichment of on interspecific amino acid sequence similarities. such domestic efforts. Proteins show higher structural and functional similarities between closely related species. Con- 4.2.3 Protein structure analysis sequently, if a functionally identified gene with a Proteins are functional molecules fundamental similar sequence to the target gene can be found to life activity, which are the final products in another species, it would serve as a key to pre- synthesized in vivo based on the genomic in- dict the functional characteristics of the target formation. One of the largest objectives in bio- gene. informatics is to understand the relationship In a homology analysis involving an extremely between the amino acid sequence and three- long sequence such as a genomic sequence, speed dimensional structures of proteins. In these few is required in finding homologous sequence years, the amount of data concerning protein regions.Typical analysis programs for this purpose structure has increased drastically, showing a 10- are FASTA (Fast Alignment) and BLAST (Basic Local fold increase from a decade ago. Alignment Search Tool). Comparing these two, The three-dimensional structure of a protein is BLAST is more commonly used, due to its higher uniquely determined by its amino acid sequence, processing speed, while FASTA is used in detailed which means that theoretically the structure of a analyses due to its high detection sensitivity. protein can be predicted from its sequence.After a Even BLAST or FASTA requires a long time to protein is synthesized in a cell, it is folded into the search through large-scale genome databases. In most energetically stable structure within a few such cases, parallelization of the searching process milliseconds to seconds. This process is called is effective. Examples of parallelization methods folding. The precision of protein structure are the multiprocessor method SMP (Symmetric prediction based on molecular dynamics is still MultiProcessor), PC clusters and grids. The poor, due to the enormous amount of calculation multiprocessor method involves the loading of required for this approach. multiple CPUs onto a single computer system to An example of a practical approach for increase the processing performance. PC cluster- structure prediction is homology modeling. ing realizes a parallel calculator at low cost by Homology modeling takes advantage of the fact connecting multiple standard PCs through that proteins with similar sequences also have networks. The widely used BLAST program similar structures. When the target protein has provided by NCBI (National Center for Bio- 30% or higher homology with a known protein at technology Information: an integral database for a sequence level, the structure of the target pro- biological data in the U.S.) is designed for multi- tein can be predicted by partially altering the processor type parallel calculation, but the multi- structure of its homologous protein. processor method requires expensive hardware Since the prediction of the protein structure has costs. Therefore, BLAST designed for the less already been realized, the focus of protein re- expensive PC clustering is available for practice. search has now shifted into functional analysis. Another alternative is the sharing of the Functional analysis for predicting the functional calculation environment, which is represented by characteristics of proteins is an important step for grid. Grid is one of the information technologies drug discovery, and the role of bioinformatics receiving great attention in recent years. The should be extremely important in this area. BioGrid Project by Osaka University and OBIGrid (discussed later) of Initiative for Parallel Bio- 4.2.4 Gene Network informatics (IPAB) are examples of domestic bio- The determination of complete genome related grids. OBIGrid provides an environment for sequences or protein structures does not mean using the latest databases and applications that we have fully understood the life itself. Our required for bioinformatics. By using such pro- next step is to figure out how genes interact with

47 SCIENCE & TECHNOLOGY TRENDS each other. To elucidate such network of genes, standard American college textbooks. As can be the vast amount of information that has been seen, software developed outside the U.S. are accumulated in bioscience needs to be system- extremely rare.Although there are some excellent atized in terms of interaction, so that it can be domestic application software such as those for handled with calculators. Therefore, this area is protein structure prediction and gene annotation, being focused on as a new application area of basic software with widespread use as those listed bioinformatics. in Table 1 are rarely found. The following are the In genome functional analyses, the whole set of possible reasons mentioned by some researchers genes expressed in a cell is called transcriptome, in their interviews. and the whole set of proteins synthesized from genes is called proteome. Identifying when and • Japan is far behind the U.S. in terms of the how a gene is expressed would provide an im- number of bioinformatics researchers. In the portant clue to the identification of its function. U.S., researchers can promptly switch their Data obtained from proteome analyses are also careers along with the shift in research trends. useful for other researchers. Swiss Institute of Such mobility in human resource has enabled Bioinformatics is promoting database construction quick securement of sufficient researchers of electrophoresis gel images representing the specialized in bioinformatics. Human resource analysis results obtained for proteins contained in development in the bioinformatics area is an experimental samples. However, since proteome urgent issue for Japan. data are patentable and directly lead to industrial • The integration of informatics with bioscience applications such as drug discovery, their public has not quite proceeded. The developer of disclosure is moving towards restriction. There- BLAST was originally specialized in mathe- fore, it is urgent that domestic proteome-related matics. We need an environment where re- databases are established. searchers from different areas can work In addition, studying the cell as a dynamic together. system constructed by genes encoded in the • The development of software necessary for genome is becoming a main research theme in bioinformatics is abandoned before reaching bioinformatics. Kyoto University’s KEGG (Kyoto its distribution. Working programs within Encyclopedia of Genes and ) system individual research activities are left without discloses the results from their gene network further development. Even when a highly research in the form of a database. The con- original algorithm is developed, the work is ventional methodologies for biological research abandoned after its publication and does not based on the description of gene functions are lead to software development. Furthermore, inadequate for gene network studies, and must be the distribution of software requires packag- integrated with informatics. ing of manuals, installation tools and distri- bution media, which cannot be afforded by 4.3 Issues in bioinformatics individual researchers. from the viewpoint of information technology An effective solution to these problems is the establishment of an environment for developing Most analysis tools (software) used in human resources specialized in bioinformatics or bioinformatics are dependent on technologies a system for evaluating the applicability of developed abroad. As a consequence, their software and distributing them. A structure for functions are hidden in a black box, such as providing governmental support to such systems unshared source codes and online-limited distri- needs to be discussed. butions. Most commercial software programs are provided in combination with foreign software. Table 1 shows a list of well-established genome sequencing software programs that appear in

48 QUARTERLY REVIEW No.8 / July 2003

Table 1: Major genome sequencing software

Software Inventor/creator Characteristics Homology search FASTA Pearson 1988 (University of Virginia, Higher detection sensitivity than BLAST. U.S.) BLAST Altschul 1990 (U.S. MCBI) Higher speed than FASTA. Most commonly used. PSI-BLAST Altschul 1997 (U.S.) Dialogic version of BLAST for searching protein families. Higher detection sensitivity than SSEARCH. SEG Wootton, Federhen 1993 (U.S.) Increases comparison precision by excluding low-complexities and repeats. SSEARCH Pearson 1991 (U.S.) Provides optimal alignment by using dynamic programming. Extremely slow. Bayes block aligner Zhu 1998 (U.S.) Employs Bayes statistics. Slower than SSEARCH, but detects distantly related sequences. PROBE Neuwald 1997 (U.S.) Similar function as PSI-BLAST. Finds most significant sequence set via non-dialogic process using Bayes statistics. Multiple alignment (alignment of multiple sequences) ClustalW Higgins, Sharp 1988 (U.K.) Provides alignment of multiple sequences using a progressive approach. Most commonly used for multiple alignment. PILEUP Fen. Doolittle 1987 (U.S.) Provides alignment of multiple sequences using a progressive approach. Employs the Needleman-Wunsch approach for sequence comparison. MSA Lipman 1989 (U.S.) Provides optimal alignment via multidimensional dynamic programming. PRRP Goto 1996 (Japan National Institute Constructs dendrograms and improves alignment via iterative of Advanced Industrial Science and learning. Technology CBRC) SAGA Notredame, Higgins (France) Selects highly scored alignments using a genetic algorithm. HMMER Eddy 1998 (U.S.) Employs the Hidden Markov model. Profile search (search for characteristic patterns) ProfileSearch Gribskov 1996 (U.S.) Searches sequence patterns (motifs). MAST Bailey, Gribskov 1997 (U.S.) Searches sequences matching gap-free sequence blocks. Gene discovery RepeatMasker Smit (University of Washington, U.S.) Detects and removes repeats to facilitate gene discovery. Korf (University of Washington, U.S.) Compares genomes between different species and finds genes from TWINSCAN conserved sequence regions. A hybrid of the alignment approach and ab initio approach.

Source: Authors’ compilation based on reference [1]

high marks in CASP (the Critical Assessment of 4.4 Domestic projects in Techniques for Protein Structure Prediction), an bioinformatics international contest of protein structure prediction. FAMS outscored the programs While slowness in the progress of domestic developed in the U.S. and other countries that are bioinformatics research has been pointed out, commonly used in this area. Unlike other there are some domestic research projects that programs, which employ a bottom-up approach to can match U.S. and European researches. construct an entire structure from partial structures, FAMS broadly grasps the entire 4.4.1 Protein structure prediction program structure before predicting the partial structures. With drug discovery in view, protein structure This is how humans recognize structures, first prediction is currently attracting great interest. In focusing on the whole, and FAMS incorporated 2000, a team led by Professor Umeyama of this human’s view of structure recognition in its Kitasato University developed FAMS and achieved algorithm. Participation in such international

49 SCIENCE & TECHNOLOGY TRENDS contests should promote domestic research and corporation led to a general-purpose package. development. The cases mentioned above represent the potential of domestic bioinformatics research in 4.4.2 Biogrid Japan. Grid is an information technology that has gained great interest in recent years. It is an 4.5 Efforts in human resource electric power-transmission network developed development based on an image of “a computer system that allows you to use calculation power or discs freely 4.5.1 Talents sought for bioinformatics just by plugging in, similar to electricity” (cited research from “Trends in Grid Technology” in the February When researchers were interviewed about the 2003 issue of Science and Technology Trends causes of the slowness in domestic bioinformatics Quarterly Review). research, the predominant answer was “the lack of Biogrid aims at the sharing of the program path human resource.” Bioinformatics is essential as a necessary for bioinformatics through grid technology that strongly promotes diverse technology. analyses ranging from genome and DNA analyses OBIGrid (Open Bioinformatics Grid) is led by to structural and functional analyses of proteins. “Genome Information Science,” supported by Since bioinformatics is an amalgam of informatics Grant-in-Aid for Scientific Research on Priority and bioscience, human resource development is Areas from the Ministry of Education, Culture, an important task in our country. Bilingual talents Sports, Science and Technology, and Initiative for understanding the languages of both bio- Parallel Bioinformatics (IPAB). It aims at logy/medicine and informatics are desired. establishing an environment that enables the use Furthermore, the lack of communication and of the latest databases and applications by just transaction across divisions and departments was accessing the grid. OBIGrid has the potential for also blamed for the slowness in domestic being a hub of genome analysis, lowering the informatics research. In the FANTOM project of barriers for new researchers entering the RIKEN, researchers from bioscience, medicine and bioinformatics area. It should attract many informatics worked in close cooperation, which researchers who, by themselves, cannot afford to led to a great achievement. It is difficult to exploit arrange an environment for using various mutual research results between informatics and databases and applications required for bioscience areas merely by exchanging ex- bioinformatics research. Furthermore, the grid may perimental and analytical data. The researchers in provide the opportunity for disclosing experi- these two areas must not draw a distinction mental data that are usually hoarded. Since files between their roles in such cooperative projects. can be easily accessed as in a LAN environment, Regarding informatics specialists, those skilled novel discoveries from such experimental data can in DB structure and programming are preferred be expected. for dealing with the enormous amount of data. Meanwhile, bioscientists are not asked to have 4.4.3 Commercialization by business-academia great knowledge in IT, but are required to have collaboration sufficient understanding in the mechanisms of There is a move afoot to commercialize and analytical programs and good command of them. distribute the software developed in public- Whenever needed, they should be capable of funded bioinformatics research. The cDNA modifying the programs according to the function annotation system developed in the individual experiments. FANTOM (Functional Annotation of Mouse) An effective way to develop human resources project of RIKEN was commercialized in 2002.The seems to be the promotion of human resource system is noteworthy as a successful case where flow from the IT area with abundant talents into the achievement of a cooperative research project the bioscience area. However, a high level of between a public research institute and a private bioinformatics research cannot be achieved

50 QUARTERLY REVIEW No.8 / July 2003 without a deep understanding of bioscience and For developing human resources for infor- medicine. Therefore, we must create a path for matics, flexible management or modification of bioscience/medicine specialists to study IT and graduate school curricula should be effective. proceed into the bioinformatics area to secure Additionally, faculty positions must be established high quality human resources. in the bioinformatics area to offer bioinformatics In Western countries, especially in the U.S., researchers a career path comparable to those in researchers who have mastered informatics find existing areas. Evaluation of research achievements their way into new research areas such as genome and human resource allocation must be performed science. Their collaboration with genetics within such framework. Furthermore, develop- researchers led to advances in the bioinformatics ment of research systems accepting the partici- area. Theoretically and mathematically supported pation of private institutes is desired. Such policies analytical approaches have been applied to gene can also attract new entrants from the related expression or protein structure/function academic areas. Especially, research areas such as experiments, and after repeating trials and errors, mathematics, statistics and mathematical engi- they were finally established as practical analytical neering can greatly contribute to the development techniques. of the theoretical foundation in bioinformatics. Governmental funding should be provided for 4.5.2 Policy for human resource development systems meeting these requirements. Bioinformatics, moreover, is greatly expected as In 2001, Keio University established Advanced a new academic area. Due to the radical advances Life Science Institute Inc. in Tsuruoka City, in genome science and protein research, an Yamagata prefecture.With the slogan of “IT-driven immediate supply of human resource is demanded bioscience,”the institute provides an environment by many research institutes including private where faculties and young researchers in institutes. Human resource development in the informatics and bioscience areas, together with bioinformatics area is an urgent task, which should the students, can study the mutual areas and be promoted by the establishment of an conduct amalgam research. In 2002, Osaka environment having the following conditions. University established the Graduate School of Frontier Biosciences, which is an interdisciplinary (1) Graduate schools should offer students the department composed of life science-related option of studying in both areas of infor- laboratories from Osaka University, including matics and bioscience/medicine, and laboratories of medical/bioscience, bioen- approve research in the amalgam area. gineering, biology and physics. (2) An environment where informatics research- Meanwhile, in 2001, the government established ers or technicians can collaborate with human resource education units, supported by bioscience/medical researchers must be Special Coordination Funds for Promoting Science established. This should promote the and Technology, for the prompt education of integration of experiments with bioinfor- bioinformatics professionals. By 2002, a total of 6 matics. For example, analytical algorithms units were established in Tokyo University, Kyoto may be developed according to the progress University, the National Institute of Advanced of DNA or protein experiments, leaving the Industrial Science and Technology, Keio University actual analysis to computers. This should and the Nara Institute of Science and Technology, promote technical advancement and devel- and integrated human resource development is opment of practical talents. progressing in each unit. By spreading such (3) A system for evaluating the achievements movements to other universities and research and technical contributions of inventors of institutes, activation of bioinformatics and other various computer analysis algorithms and interdisciplinary research areas should be software tools used in bioinformatics should promoted. be established.

51 SCIENCE & TECHNOLOGY TRENDS

term, it should be effective to establish an 4.6 Conclusion environment where researchers from informatics and bioscience/medical areas can conduct their Speed is an important factor in genome research while mutually sharing their knowledge research, the area in which bioinformatics exerts and know-how to promote collaboration between its strength. Many countries are making rapid informatics researchers/technicians and bio- investments of research resources in this area, and science/medical researchers. Bioinformatics- Japan is also increasing its investment in genome related projects currently in progress should also research.Yet, the development of domestic human put emphasis on this point. Meanwhile, we must resources in bioinformatics, which support promote mutual exchange of researchers and genome research accelerating toward drug technicians between informatics and bio- discovery and tailor medicines, has not reached an science/medicine. adequate level. Genome and protein research is gradually Acknowledgement shifting into a new stage, from relatively simple We would like to thank Project Director Dr. sequencing into functional analysis and, Yoshihide Hayashizaki, Team Leader Dr. Yasushi furthermore, into application. Along with this Okazaki and Project Director Dr.Akihiko Konagaya trend, requirements for bioinformatics will grow of RIKEN Genomic Sciences Center for kindly larger, as well as its importance in research studies. providing us with their information and resources To fulfill such requirements, it is urgent that in preparing this manuscript. talents having academic/technical knowledge of both informatics and bioscience are developed.As References mentioned in Chapter 4.5, some actions have [1] Written by David W. Mount, translation already been taken, but not quite enough on a supervised by Yasushi Okazaki and Hidemasa national scale. Bono, “Bioinformatics,” Medical Science For human resource education, in the short International (2002).

(Original Japanese version: published in January 2003)

52