Sequences and Topology Editorial Overview Mark Gerstein* and Barry Honig†

327 Sequences and topology Editorial overview Mark Gerstein* and Barry Honig† Addresses *Department of Molecular Biophysics and Biochemistry, 266 Whitney aspects of biological data. We will return to the apparent Avenue, Yale University, PO Box 208114, New Haven, CT 06520, USA; ‘conflict’ between these two perspectives below. A second e-mail: [email protected] issue that arises from the emphasis on clustering involves †Department of Biochemistry and Molecular Biophysics, Columbia the actual goals of the various analyses being carried out. University, 650 West 168th Street, New York, NY 10032, USA; e-mail: [email protected] On the one hand, the clustering of data and the derivation of new relationships is a valuable goal in its own right. On Current Opinion in Structural Biology 2001, 11:327–329 the other hand, it is not yet clear how to maximize the 0959-440X/01/$ — see front matter impact of bioinformatics on mainstream biology, which, for © 2001 Elsevier Science Ltd. All rights reserved. many years, has focused on the detailed characterization of individual systems in specific organisms. It is worth reading The reviews appearing in this section of Current Opinion in the reviews in this section with these questions and Structural Biology deal with many of the data sources cur- perspectives in mind. rently used in bioinformatics — genome sequences, three-dimensional structures of proteins and expression Califano (pp 330–333) discusses recent trends in sequence data sets. They also include a wide variety of computa- analysis and describes how profile-based methods have, in tional approaches — sequence and structure alignment most cases, replaced pairwise analysis methods. Profile and analysis, gene-expression clustering and biophysical methods of course rely either on the existence of multiple analysis. Broadly, the section follows the molecular biolog- alignments or on the ability to generate them on-the-fly, ical ‘data flow’ from raw genome sequences to detailed and reflect major advances that have taken place in building structural understanding and, in the process, touches on consensus models for sequence families. A second theme genome annotation, integration of expression information, discussed in this review is the combination and integra- fold assignments, structural alignment and the under- tion of different methods towards various goals. Examples standing of protein–protein interactions. include the identification of regulatory motifs in eukaryotes, protein structure prediction and the com- Bioinformatics is a new field and is still in the process of bined use of expression clustering and sequence data in being defined. It is focused on analysis of genomic and, promoter detection. more recently, proteomic data. The range of tools that has been brought to bear on these data is enormous and, as The review by Kriventseva, Biswas and Apweiler reflected in the reviews in this section, has its sources in (pp 334–339) addresses the clustering and analysis of pro- different disciplines: computer science, mathematics and tein families in greater detail. They discuss the statistics, physics, chemistry, biochemistry and biophysics. construction and use of protein family databases, and how A remarkable aspect of the growth of bioinformatics has a number of these are integrated in the new InterPro been the speed at which the various disciplines have been resource. The article also discusses clustering in structural integrated into the research programs of individual labora- databases and phylogenetic classification. The authors tories. For example, biophysicists have learned and even emphasize the challenge associated with the prediction of developed dynamic programming methods, whereas com- protein function and highlight the Gene Ontology consor- puter scientists have made important contributions to the tium, whose goal is to produce a vocabulary for biological analysis of three-dimensional structure. A common lan- processes. It will be interesting to see the extent to which guage is emerging and some common goals exist, while modern biology, and biologists, will be amenable to the others are in the process of being defined. construction of a controlled vocabulary. As is necessarily the case, different perspectives are evi- Altman and Raychaudhuri (pp 340–347) discuss various dent in the articles in this section and in the literature that applications of whole-genome expression analysis. A variety is reviewed. A common thread that runs through many of of clustering methods has been applied to microaray data the reviews is the use of clustering to define biological and others have been developed. They describe how the ‘parts’ and then the use of these parts as frameworks for clusters found from analyzing expression data can be used data integration and analysis. The clustering and classifica- as a starting point for the prediction of regulatory elements, tion of data is, of course, an essential element in the history protein function, interactions and localization. In fact, their of biology and the vast quantity of new data that has review artfully illustrates yet another application of clus- become available opens up a seemingly limitless set of pos- tering, that of clustering literature databases. They base sibilities for the derivation of new relationships and their entire exposition on a clustering of expression analysis groupings. An alternate viewpoint emphasizes continuous literature in terms of word counts. 328 Sequences and topology Figure 1 The graph shows how the number of publications covering bioinformatics has 25 800 increased over the past two decades and how Fraction of bioinformatics the fraction of these publications devoted to articles relating to clustering 700 ‘clustering’ has increased as well. These 20 bibliometric statistics were obtained from the Number of bioinformatics 600 NCBI’s PubMed database. For the first graph, articles showing the total number of bioinformatics 500 publications (thin line), a query looking for 15 either a ‘computational biology’ MeSH 400 subheading or a small set of bioinformatics- only forums was used (i.e. Bioinformatics 10 [previously CABIOS], the Journal of 300 Number of articles Computational Biology or the conference proceedings from ISMB or Pacific symposia). 200 These journals and conference proceedings Fraction related to clustering (%) 5 represent ‘bioinformatics-only’ forums, so one 100 doesn’t have the problem of the general increase in the number of papers in PubMed 0 0 inflating the results. The results of the first query 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01∗ obviously understate the number of bioinformatics publications, but we believe do Year track some of the increase in the field. Note that Current Opinion in Structural Biology the numbers for 2001 reflect that only about a third of the year had elapsed at the time of the first query that also contained words such as fraction of the second query in the first query search. For the second graph (gray columns), a ‘clustering’ or the MeSH subheading ‘cluster per year. Notice how this also intelligently query looking for the subset of matches to the analysis’ was used. The gray bars show the extrapolates a fraction for 2001. The review by Gaasterland and Oprea (pp 377–381) dis- also focuses on the use of different methods to deduce cusses a fundamental problem in genomics, how to functional information. These range from the physical identify proteins in raw genome sequences. They provide properties of active sites to the use of phylogeny to pre- a thoughtful discussion of the current state of the annota- dict protein–protein interactions. The authors describe tion of some of the larger eukaryotes, in particular, human the assortment of methods that is being used to predict and fly. They also discuss how the experimental evidence which proteins interact with one another, a problem that for verifying proposed new genes, for example, ESTs, cannot be solved directly from structural genomics pro- cDNAs, microarrays and homology matches, integrates a jects, which tend to focus on individual domains. An number different data sources. important lesson emphasized in the review involves the limitations, at least at present, of automatic methods. For Koehl (pp 348–353) discusses various methods that have example, the SCOP database relies on human decision been developed to compare proteins in structural terms and about evolutionary relationships. highlights the difficulties in arriving at a quantitative mea- sure of protein structure similarity. As discussed below, this Protein–protein and protein–small molecule interactions is an area in which alternate viewpoints exist, in that struc- are the focus of the review by Ma, Wolfson and Nussinov tural classifications into discrete groupings are widely used, (pp 364–369), which also discusses how structure compari- while there appears to be a continuous aspect to structural son can be used to study protein flexibility and plasticity. space as well. Structural alignments can, in many cases, pro- This article highlights the fact that the fold of the polypep- duce sequence alignments that are superior to those tide chain is only one way of finding relationships between obtained from pure sequence methods, and there is clearly proteins; the nature of the protein surface, where interac- much work to be done in integrating the two approaches, tions actually occur, is another. Far less attention has been particularly as the amount of three-dimensional structural

Sequences and Topology Editorial Overview Mark Gerstein* and Barry Honig†

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support