From Bioinformatics to Computational Biology

Downloaded from genome.cshlp.org on September 27, 2021 - Published by Cold Spring Harbor Laboratory Press Commentary From Bioinformatics to Computational Biology Jean-Michel Claverie Structural and Genetic Information Laboratory, CNRS-AVENTIS UMR 1889, Marseille cedex 20, France It is quite ironic that the uncertainty about the number butions of the events recorded in the databases. Once of human genes (28,000–120,000) (Ewing and Green known, these rules considerably simplify the descrip- 2000; Liang et al. 2000; Roest Crollius et al. 2000) ap- tion of the database content and, more important, pears to increase as the determination of the human have a predictive power: the realm of the theory may genome sequence is nearing completion. I shall con- encompass objects or events that have not been ob- tend here that this paradox reveals deep epistemologi- served previously. This part of theoretical endeavor is cal problems, and that “bioinformatics”—a term entirely missing in current bioinformatics. As a conse- coined in 1990 to define the use of computers in sequence, we are still not able to agree on the number of quence analysis—is no longer developing in directions human genes despite having the complete sequence of relevant to biology. the human genome at hand. After the pioneers who established the basic con- Identifying precisely the 5Ј and 3Ј boundaries of cepts of molecular sequence analysis (Fitch and Mar- genes (the transcription unit) in metazoan genomes, as goliash 1967; Needleman and Wunsch 1970; Chou well as the correct sequences of the resulting mRNA and Fasman 1974), most computational biologists of (“exon parsing”) has been a major challenge of bioin- my generation (the second one) embarked on their formatics for years. Yet, the current program perfor- journey into the emerging discipline with the ambi- mances are still totally insufficient for a reliable auto- tion to turn it into the bona fide theoretical branch of mated annotation (Claverie 1997; Ashburner 2000). It molecular biology. Having a physicist’s background, I is interesting to recapitulate quickly the research in suspect that many of us had the vision of establishing this area to illustrate the essential limitation plaguing bioinformatics in a leadership role over experimental modern bioinformatics. Encoding a protein imposes a biology, similar to the supremacy that theoretical variety of constraints on nucleotide sequences, which physics enjoys over experimental physics. Somewhere do not apply to noncoding regions of the genome. along the line, it seems that bioinformatics lost this These constraints induce statistical biases of various ambition and became sidetracked onto what physicists kinds, the most discriminant of which was soon recog- would call a “phenomenological” pathway. nized to be the distribution of six nucleotide-long Let us follow the example of particle physics for a “words” or hexamers (Claverie and Bougueleret 1986; little longer. There, theoretical research has two phases Fickett and Tung 1992). Initial gene parsing methods (which, in fact, run in parallel). In the first phase (so- were then simply based on word frequency computa- called phenomenological), a large number of physical tion, eventually combined with the detection of splic- events are recorded in huge raw databases, classified ing consensus motifs. The next generation of software into separate groups based on statistical regularities, implemented the same basic principles into a simu- and then utilized to identify the most recurrent ob- lated neural network architecture (Uberbacher and jects. Optimal database design, fast classification/ Mural 1991). Finally, the last generation of software, clustering algorithms, and data mining software are based on hidden Markov models, added an additional the main area of development here. The level of knowl- refinement by computing the likelihood of the pre- edge gained from this phase is, for instance, that ob- dicted gene architectures (e.g., favoring human genes jects A and B often appear together except when C is with an average of seven coding exons, each 150 around, or when parameter X is lower than a certain nucleotides long) is added (Kulp et al. 1996; Burge and threshold; it is mostly statistical in nature. The parallel Karlin, 1997)). These ab initio methods are used in con- with the current state of bioinformatics is clear. junction with a search for sequence similarity with pre- However, theoretical physics also has a subse- viously characterized genes or expressed sequence tags quent, totally different phase, aiming at discovering (EST). Sadly, it is often claimed that matching back the basic (few) rules (e.g., E = mc2) underlying the re- cDNA to genomic sequences is the best gene identifi- lationships between the objects, their individual prop- cation protocol; hence, admitting that the best way to erties, and thus finally explaining the statistical distri- find genes is to look them up in a previously established catalog! E-MAIL [email protected]; FAX +33-4-91-16- Thus, the two main principles behind state-of-the- 45-49. Article and publication are at www.genesdev.org/cgi/doi/10.1101/ art gene prediction software are (1) common statistical gad.155500. regularities and (2) plain sequence similarity. From an 10:1277–1279 ©2000 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/00 $5.00; www.genome.org Genome Research 1277 www.genome.org Downloaded from genome.cshlp.org on September 27, 2021 - Published by Cold Spring Harbor Laboratory Press Claverie epistemological point of view, those concepts are quite 2241qnbuNjbnjCfbdi, leading to its decoding to: primitive. For instance, the concept of analyzing the YellowDuckwillmeetTarzanat1130pmatMiamiBeach. frequency of groups of letters was actually introduced However, this is not truly useful if we do not know the by Arab scholars around A.D. 700 to break substitution meaning of the predefined code words: YellowDuck, ciphering. Thus, the legendary cryptanalyst al-Kindi Tarzan, and MiamiBeach (e.g., a given boat, a certain (Singh 1999) could still grab the essence of modern admiral, and a precise geographical location). bioinformatics without much difficulty. Moreover, the Similarly, for whatever DNA sequence we deci- above concepts are intrinsically conservative and in- pher, we have to determine its meaning from the cell’s troduce a bias in favor of the detection of genes similar point of view. Thus, I believe that the days of abstract to those already known. But the most fundamental DNA “numerology” are over, and theoretical biologists limitation of the current approaches is that they bear with a strong interest in the intricacies of the cell ma- absolutely no relationship to the actual molecular chinery should now reinvest the field. We now need to mechanisms of gene expression; when a human cell make educated guesses on the meaning of the code triggers the transcription of a given region of its ge- words, for example, by concentrating on the way they nome, it is not because an homologous region exists in might interact with each other. The statistical features yeast, or because the transcripts (once translated) will recognized at the level of DNA sequence have now to lead to a meaningful (three-dimensional folding) be related to chromatin structures, kinetic properties, amino acid sequence. Thus, current approaches are not and physicochemical principles of macromolecular in- on the pathway of a theoretical understanding of the teractions. The huge amount of information acquired genome, and have no predictive power beyond the recently (Szentirmay and Sawadogoo 2000) on the spa- realm of immediate analogy. This limitation is well il- tial organization of large molecular edifices such as lustrated by the complete failure of current programs transcriptional complexes and the spliceosome have in locating nonprotein coding genes (such as Xist and now to be incorporated into a radically new type of H19), which might have essential regulatory roles. The bioinformatic approach. Similarly, the careful classifi- number of nonprotein coding genes is unknown, and cation (requiring a detailed biological knowledge) of they might constitute a significant fraction of yet genes in well-defined subsets is certainly the key in anonymous EST clusters. The same fundamental prob- identifying the multiplicity of specific regulatory mo- lem is also attested by the near-zero performance of tifs (Wasserman and Fickett 1998; Fickett and Wasser- current methods to locate core promoter regions, as man 2000) resulting in biologically significant pro- well as all other regulatory segments (Fickett and Hatzi- moter regions, while the quest of a generic definition georgiou 1997; Stormo 2000). Yet, textbooks persist in has remained elusive (Fickett and Hatzigeorgiou 1997). misusing the words “signal” or “motif” to qualify ge- Clearly, taking advantage of such complex and special- nomic subsequence such as TATAAA or CCAAT, the ized knowledge in the design of new gene prediction occurrence of which are only exceptionally related to methods requires different skills than refining the now transcription. Students are left to discover the hard routine statistical “deciphering” approaches. way that the promoter theory they were taught is not For those wishing to remain at a more abstract at all predictive. Finally, the presence of repeats or low level, designing new algorithms at least logically con- complexity regions, induce intractable situations for sistent with known biochemical

From Bioinformatics to Computational Biology

Gene Prediction and Genome Annotation

Gene Structure Prediction

A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods

There Is a Lot of Research on Gene Prediction Methods

Gene Prediction Using Deep Learning

Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration

"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

A Benchmark Study of Ab Initio Gene Prediction Methods in Diverse Eukaryotic Organisms

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Progress in Gene Prediction: Principles and Challenges Srabanti Maji and Deepak Garg*

PATTERNS of DIPEPTIDE USAGE for GENE PREDICTION a Thesis

Bioinformatics Is a New Discipline That Addresses the Need to Manage and Interpret the Data That in the Past Decade Was Massively Generated by Genomic Research